Sequence inference method based on KV cache segment recomputation and related device
By performing segment filtering and key-value caching and recalculation on the input sequence, the performance degradation and accuracy instability caused by memory overflow in long sequence inference are solved, and efficient and accurate output sequence generation is achieved.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- HUAWEI TECH CO LTD
- Filing Date
- 2025-03-19
- Publication Date
- 2026-07-02
AI Technical Summary
In long sequence reasoning scenarios, existing technologies are prone to memory overflow, which leads to decreased reasoning performance and unstable accuracy, especially when the length of the input sequence exceeds the window size, resulting in a significant drop in reasoning accuracy.
By filtering the input sequence into segments, identifying key tokens, and recalculating them in conjunction with the key-value vectors of relevant segments, the key-value cache is updated. The correlation between segments is considered, thereby improving the accuracy and efficiency of inference and generating the output sequence.
While ensuring efficiency, it avoids a significant drop in accuracy, guarantees the accuracy and stability of the output sequence generated by inference, reduces the amount of computation, and improves inference efficiency.
Smart Images

Figure CN2025083315_02072026_PF_FP_ABST
Abstract
Description
A sequence reasoning method and related equipment based on KV cache segmented recalculation
[0001] This application claims priority to Chinese Patent Application No. 202411495522.3, filed on October 23, 2024, entitled "A Sequence Reasoning Method and Related Device Based on KV Cache Segment Recalculation", the entire contents of which are incorporated herein by reference. Technical Field
[0002] This application relates to the field of artificial intelligence (AI) technology, and in particular to a sequence reasoning method, reasoning platform, computing device cluster, computer-readable storage medium, and computer program product. Background Technology
[0003] With the continuous development of artificial intelligence (AI) technology, especially the rapid development of large language models (LLM), inference services based on AI models such as LLM have been widely applied. Let's take an LLM-based inference service as an example. For instance, an LLM-based inference service can be applied to chatbot scenarios to answer user questions. An LLM-based chatbot can conduct multi-turn dialogues to meet business needs. Another example is a document summarization service, which can be applied to document summarization scenarios to understand and summarize the content of one or more documents, thereby generating document summaries.
[0004] Inference services typically take sequences as input and output sequences, hence the name sequence inference. Sequence inference can also be categorized into long sequence inference and short sequence inference based on the length of the input sequence. AI models are usually trained using short sequences; for example, an LLM (Limited Learning Model) can be trained on text sequences of length 2K (K for kilo) or 4K during the pre-training phase. This can lead to Out of Memory (OOM) errors in long sequence inference scenarios. An OOM error occurs when the available memory space is less than the required memory space. This can cause frequent data exchange between memory and persistent storage during inference, drastically increasing inference latency and significantly degrading inference performance.
[0005] To address this issue, the industry has proposed a sliding window scheme to mitigate the performance degradation caused by memory overflow. However, when the length of the input sequence exceeds the window size, it can lead to a significant drop in inference accuracy. Summary of the Invention
[0006] This application provides a sequence reasoning method. This method identifies key tokens for target segments in an input sequence and, by combining these tokens with other segments preceding the segment containing the key token, recalculates the key-value vector to update the key-value cache. Since the updated key-value cache considers the correlation between segments, reasoning based on the updated cache can improve the accuracy of the generated output sequence, avoiding a significant drop in accuracy while maintaining efficiency. This application also provides a corresponding inference platform, computing device cluster, computer-readable storage medium, and computer program product.
[0007] Firstly, this application provides a sequence inference method. This method can be executed by an inference platform. In some cases, the inference platform may also be referred to as an inference acceleration platform. The inference platform can be a software system. This software system can be standalone software, such as a standalone AI model inference tool or suite. Alternatively, the software system can be integrated into other software as a functional module, plugin, component, or applet. In some examples, the inference platform can also be a hardware system. The hardware system can be a cluster of computing devices with sequence inference capabilities; when the computing device cluster runs, it executes the sequence inference method of this application.
[0008] Specifically, the inference platform receives an input sequence and, based on the evaluation metrics of at least one segment in the input sequence, selects a target segment from at least one segment. The inference platform can determine key tokens in the target segment. For the first segment of the target segment, it updates the key-value cache based on the key tokens and related tokens of the first segment, obtaining an updated key-value cache. Related tokens include tokens preceding the key token of the first segment or tokens in the second segment. The second segment is the segment in the target segment that precedes the first segment. The inference platform can then infer and generate an output sequence based on the updated key-value cache.
[0009] This method filters the input sequence into segments, identifies key tokens within the selected segments, and recalculates the key-value vector of the key token by combining it with other segments preceding the segment containing the key token, thereby updating the key-value cache (KV cache). Since the updated KV cache considers the correlation between segments, inference based on the updated KV cache improves the accuracy of the generated output sequence. Furthermore, this method pre-filters the input sequence into segments and updates and reuses the KV cache for the selected segments, eliminating the need to update the KV cache for each segment and improving inference efficiency. This method maintains efficiency while avoiding a significant drop in accuracy, ensuring the stability of accuracy.
[0010] In some possible implementations, at layer t of the first artificial intelligence (AI) model, the inference platform can determine the relevance between the key token and related tokens based on the key token and related tokens of the first fragment. The inference platform can then obtain the updated key-value cache for layer t+1 of the first AI model based on the relevance.
[0011] In this method, the inference platform determines the token relevance across segments (or inter-segment relevance), updates the key-value cache based on the relevance, and performs inference based on the updated key-value cache, which can improve the accuracy of the generated output sequence.
[0012] In some possible implementations, the inference platform can obtain the output of the t-th layer of the first AI model based on the relevance, and then use the key-value weight matrix to map the updated key-value vector based on the output of the t-th layer. The updated key-value vector is then used to replace the key-value vector cached in the (t+1)-th layer of the first AI model to obtain the updated key-value cache of the (t+1)-th layer of the first AI model.
[0013] This method recalculates the output of the t-th layer of the first AI model based on relevance, then updates the key-value vector through the key-value weight matrix based on the output, and replaces the key-value cache of the (t+1)-th layer of the first AI model with this, laying the foundation for subsequent key-value cache reuse and improving the accuracy of subsequently generated tokens.
[0014] In some possible implementations, the inference platform can also filter at least one segment from the input sequence based on its relevance to the input sequence, obtaining filtered segments. The inference platform then further refines these filtered segments based on their information content to obtain the target segment.
[0015] This method segments the input sequence, performs coarse screening based on the relevance of the segments to the input sequence, and then performs a second screening (fine screening) on the segments based on their information content. On the one hand, this reduces the number of tokens that need to be recalculated in the key-value cache, significantly reducing the amount of computation and accelerating inference. On the other hand, by screening multiple times in different dimensions, it can identify segments with higher importance, identify key tokens for these segments, and recalculate the key-value vectors for these key tokens to update the key-value cache. The output sequence is then generated based on the updated key-value cache, which can improve inference accuracy.
[0016] In some possible implementations, the inference platform can also input at least one segment from the input sequence into a first AI model, and determine the relevance of at least one segment to the input sequence based on the degree of dispersion of the shallow attention of the first AI model. And / or, the inference platform can also input at least one segment from the input sequence into a second AI model, determine the information content of at least one segment based on the probability distribution output by the second AI model, and determine the relevance of at least one segment to the input sequence based on the information content. The second AI model includes the first m layers of the first AI model and an adapter layer, where m is a positive integer. The inference platform inputs at least one segment from the input sequence into a third AI model, and determines the relevance of at least one segment to the input sequence based on the semantic similarity between at least one segment output by the third AI model and the input sequence.
[0017] This method supports determining the correlation between a fragment and the input sequence through multiple methods or any combination of multiple methods, thus providing a reference for fragment selection and having high availability.
[0018] In some possible implementations, for a segment in the input sequence, if the relevance determined by the first AI model, the second AI model, and the third AI model are all higher than a threshold, the inference platform can retain the candidate segment.
[0019] This method combines three methods to determine relevance: relevance determined by a first AI model, relevance determined by a second AI model, and relevance determined by a third AI model, to perform fragment filtering, thereby improving the accuracy of fragment filtering.
[0020] In some possible implementations, the inference platform can also input the filtered fragments into the first artificial intelligence (AI) model for inference, obtain the probability distribution of the output vocabulary of the pre-filling stage, and determine the information content of the filtered fragments based on the probability distribution of the output vocabulary.
[0021] This method inputs filtered fragments into a model for inference. Based on the probability distribution of the output vocabulary from the pre-filling stage during inference, the information content is determined using this probability distribution, which can provide a reference for subsequent secondary screening of fragments based on information content. Notably, since it does not require inputting every fragment into the model for inference, the number of inference iterations or the computational load of inference are significantly reduced.
[0022] In some possible implementations, the inference platform can also acquire the attention of tokens in the target fragment, which is inferred from the target fragment by the first artificial intelligence model. Accordingly, the inference platform determines the key tokens in the target fragment based on the attention of tokens in the target fragment.
[0023] This method can reuse the attention of tokens output during model inference to identify key tokens in target segments, thus helping to update the key-value cache.
[0024] In some possible implementations, the inference platform can also determine the key-value vector of the token in the target fragment and cache the key-value vector to obtain a key-value cache.
[0025] This method caches key-value vectors, allowing reuse of these vectors during the decoding phase to obtain new tokens. Since it stores the memory of the large model (such as a language model, including but not limited to LLM), secondary inference by the large model is unnecessary. By using feature lookup (such as key-value vectors) instead of computation, it achieves on-demand use, improving the inference efficiency of the large model. Taking text sequence inference as an example, by offloading and persisting the key-value cache (KV Cache) from hot documents and historical sessions to storage, similarity retrieval in subsequent sessions enables the reuse of relevant KV Caches. Furthermore, for long sequences, this method can determine the relevance between segments, such as attention between segments (including but not limited to cross attention), and then recalculate the KV Cache based on cross attention, improving the accuracy of KV Cache retrieval or query, thereby increasing inference precision. Through segment selection and KV Cache recalculation, this method not only achieves sequence compression and efficient inference but also avoids a significant drop in precision while maintaining efficiency, ensuring the stability of accuracy.
[0026] In some possible implementations, the inference platform can also perform incremental pre-filling and full decoding based on the updated key-value cache to obtain the output sequence. Incremental pre-filling refers to generating the first token for each segment of the target sequence using the updated key-value cache. Full decoding can generate the remaining tokens for each segment of the input sequence by querying the updated key-value cache.
[0027] This method achieves stability of accuracy by performing incremental pre-filling and full decoding based on the updated key-value cache, thereby avoiding a significant drop in accuracy while ensuring efficiency.
[0028] Secondly, this application provides a reasoning platform. The reasoning platform includes:
[0029] The segment filtering module is used to receive the input sequence and filter the target segment from at least one segment according to the evaluation index of at least one segment in the input sequence.
[0030] The key-value cache update module is used to determine the key token in the target fragment. For the first fragment in the target fragment, the key-value cache is updated according to the key token of the first fragment and related tokens to obtain the updated key-value cache. Related tokens include tokens in the first fragment before the key token of the first fragment or tokens in the second fragment. The second fragment is the fragment in the target fragment that is located before the first fragment.
[0031] The inference module is used to infer and generate the output sequence based on the updated key-value cache.
[0032] In some possible implementations, the key-value cache update module is specifically used for:
[0033] In the t-th layer of the first artificial intelligence (AI) model, the relevance between the key token and the related token is determined based on the key token and related tokens of the first fragment.
[0034] Based on the relevance, obtain the updated key-value cache of the (t+1)th layer of the first AI model.
[0035] In some possible implementations, the key-value cache update module is specifically used for:
[0036] Based on the relevance, obtain the output of the t-th layer of the first AI model;
[0037] Based on the output of layer t, the updated key-value vector is obtained by mapping using the key-value weight matrix. The updated key-value vector is then used to replace the key-value vector cached in layer t+1 of the first AI model, so as to obtain the updated key-value cache of layer t+1 of the first AI model.
[0038] In some possible implementations, the fragment filtering module is specifically used for:
[0039] Based on the correlation between at least one segment in the input sequence and the input sequence, filter at least one segment to obtain filtered segments;
[0040] Based on the amount of information in the filtered segments, the target segments are obtained by further selection of the filtered segments.
[0041] In some possible implementations, the fragment filtering module is also used for:
[0042] Input at least one segment from the input sequence into a first artificial intelligence (AI) model, and determine the relevance of at least one segment to the input sequence based on the degree of dispersion of the shallow attention of the first AI model; and / or,
[0043] At least one segment from the input sequence is input into a second AI model. The information content of the at least one segment is determined based on the probability distribution output by the second AI model. The correlation between the at least one segment and the input sequence is determined based on the information content. The second AI model comprises the first m layers of the first AI model and an adapter layer, where m is a positive integer; and / or,
[0044] At least one segment from the input sequence is input into a third AI model. Based on the semantic similarity between at least one segment output by the third AI model and the input sequence, the relevance of at least one segment to the input sequence is determined.
[0045] In some possible implementations, the fragment filtering module is specifically used for:
[0046] For a segment in the input sequence, if the relevance determined by the first AI model, the second AI model, and the third AI model are all higher than the threshold, then the candidate segment is retained.
[0047] In some possible implementations, the fragment filtering module is also used for:
[0048] The filtered segments are input into the first artificial intelligence (AI) model for inference to obtain the probability distribution of the output vocabulary in the pre-filling stage;
[0049] The information content of the filtered segments is determined based on the probability distribution of the output vocabulary.
[0050] In some possible implementations, the key-value cache update module is also used for:
[0051] The attention of tokens in the target segment is obtained by reasoning about the target segment by the first artificial intelligence model;
[0052] The key-value cache update module is specifically used for:
[0053] Identify the key tokens in the target fragment based on attention to the tokens in the target fragment.
[0054] In some possible implementations, the key-value cache update module is also used for:
[0055] Determine the key-value vector of the token in the target fragment and cache the key-value vector to obtain the key-value cache.
[0056] In some possible implementations, the inference module is specifically used for:
[0057] Based on the updated key-value cache, perform incremental pre-filling and full decoding to obtain the output sequence.
[0058] Thirdly, this application provides a computing device cluster. The computing device cluster includes at least one computing device, and the at least one computing device includes at least one processor and at least one memory. The at least one processor and the at least one memory communicate with each other. The at least one processor is used to execute instructions stored in the at least one memory to cause the computing device or the computing device cluster to perform the sequence reasoning method as described in the first aspect or any implementation thereof.
[0059] Fourthly, this application provides a computer-readable storage medium storing instructions that instruct a computing device or a cluster of computing devices to execute the sequence reasoning method described in the first aspect or any implementation thereof.
[0060] Fifthly, this application provides a computer program product containing instructions that, when run on a computing device or a cluster of computing devices, causes the computing device or cluster of computing devices to execute the sequence reasoning method described in the first aspect or any implementation thereof.
[0061] Based on the implementation methods provided in the above aspects, this application can be further combined to provide more implementation methods. Attached Figure Description
[0062] To more clearly illustrate the technical methods of this application, the accompanying drawings used will be briefly described below.
[0063] Figure 1 is a schematic diagram of an architecture for compressing key-value caching using a sliding window provided in this application;
[0064] Figure 2 is a schematic diagram of the architecture of an inference platform provided in this application;
[0065] Figure 3 is a flowchart of a sequence reasoning method provided in this application;
[0066] Figure 4 is a schematic diagram of a correlation-based segment filtering method provided in this application;
[0067] Figure 5 is a schematic diagram of a fragment sorting method based on information content provided in this application;
[0068] Figure 6 is a schematic diagram of an updated key-value cache provided in this application;
[0069] Figure 7 shows an application scenario of a sequence reasoning method provided in this application;
[0070] Figure 8 is a schematic diagram of the structure of a computing device provided in this application;
[0071] Figure 9 is a schematic diagram of the structure of a computing device cluster provided in this application;
[0072] Figure 10 is a schematic diagram of another computing device cluster provided in this application;
[0073] Figure 11 is a schematic diagram of another computing device cluster provided in this application. Detailed Implementation
[0074] The terms "first" and "second" used in the embodiments of this application are for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Therefore, a feature defined with "first" and "second" may explicitly or implicitly include one or more of that feature.
[0075] First, some technical terms involved in the embodiments of this application will be introduced.
[0076] Artificial intelligence (AI) is the ability to correctly interpret external data, learn knowledge from that data, and use that knowledge to achieve specific goals and tasks. Currently, a significant branch of AI is natural language processing (NLP). NLP uses algorithms such as machine learning (ML) and deep learning (DL) to build language models (LM), enabling computers to interpret, process, and understand human language.
[0077] A language model is a probability distribution model of words in a natural language. The most common data in natural language processing is text data, such as a piece of natural language text, which can be viewed as a discrete time series. Assuming the words in a natural language text of length T are w1, w2, ..., wT, then in the discrete time series, wt (1 ≤ t ≤ T) can be seen as the output or label at time step t. Given a word sequence of length T, w1, w2, ..., wT, the language model will calculate the probability of this word sequence: P(w1, w2, ..., wT). Based on this, the language model can predict the next most likely word using the input context (e.g., several preceding words).
[0078] Language models can also be categorized by parameter size into small language models and large language models (LLMs). Small language models can be simply referred to as small models, and large language models as large models. For ease of description, this application uses an LLM as an example. An LLM is a language model constructed from a deep neural network containing hundreds of billions of weights, which can be trained using self-supervised learning methods on a large amount of unlabeled text.
[0079] When language models such as LLM employ transformer networks, they exhibit the following significant characteristics: each inference round generates only one token as output. This token is then combined with all previously generated tokens to form the input sequence for the next round of inference. Here, the token represents the smallest unit of meaning that the language model can understand and generate, serving as its fundamental unit. Depending on the tokenization scheme used, a token can represent a word, a portion of a word, or even just a character. Tokens are assigned numerical values or identifiers, arranged as sequences or vectors, and input into or output from the language model, forming its linguistic components. This generation process is repeated until a complete output sequence is generated. Because each round's input sequence only adds one token to the previous round's, it leads to significant redundant computation. Therefore, the industry has introduced key-value caches (KV caches) to reduce redundant computation.
[0080] KV Cache is used to store and reuse (or reuse) information computed by models (such as LLMs using transformer networks) in the self-attention layer. Self-attention (or internal attention) is the core of the transformer network; it allows the model to compare each element with other elements in a sequence (such as an input sequence) to determine the relationships between elements. For language models, elements can be tokens, for example, obtained by tokenizing the user-input text sequence.
[0081] In the self-attention layer of a model or network, each element of the sequence can be transformed into three representations: a query vector, a key vector, and a value vector. These can also be abbreviated as Q, K, and V. Q represents the current element, used to match with the key vector K. K represents other elements in the sequence, used to match with the query vector Q. V represents other elements in the sequence; when the key vector K matches the query vector Q, the corresponding value vector V is used to construct the output. Specifically, by calculating the similarity between the query vector and all key vectors, the model can record the attention that should be given to each element in the sequence when generating each output element. This attention can also be called an attention score, attention rating, or attention weight. The attention score determines the contribution of the value vector to the output and can be used for weighted calculation to obtain the model's output at that layer.
[0082] The introduction of KV Cache divides the inference process into two phases: prefill and decode. The prefill phase generates the first output token. During prefill, the system precisely calculates the key and value vectors for each transformer layer and caches them, forming a key cache and a value cache. At this stage, the number of floating-point operations is the same as without KV Cache. This phase involves a large number of General Matrix Multiplication (GEMM) operations, making it a computationally intensive task. The decode phase generates the second output token until all tokens are generated; that is, it generates every token except for the prefill phase. During decode, since the KV Cache (such as the key cache and value cache) already stores the key and value vectors from previous rounds, each round of inference in this phase can read data from the KV Cache and add the newly calculated key and value vectors to the corresponding caches. At this stage, the number of floating-point operations is reduced, the inference speed is significantly improved compared to the prefill phase, and the task type becomes memory-intensive.
[0083] KV Cache stores the key and value vectors of previously generated tokens, so that the generation of new tokens only needs to interact with the cached key and value vectors, thereby reducing the complexity to linear. It avoids the need to repeatedly calculate the key and value vectors of the tokens, effectively avoiding unnecessary redundant calculations and significantly improving inference efficiency.
[0084] Inference services built on AI models such as LLM typically take sequences as input and output, hence the term sequence inference. Sequence inference can also be categorized into long sequence inference and short sequence inference based on the length of the input sequence. AI models, such as LLM, are usually trained using short sequences; for example, LLM can be trained on text sequences of length 2K or 4K during the pre-training phase. This can lead to Out of Memory (OOM) errors in long sequence inference scenarios. OOM occurs when the available memory space is less than the required memory space. In this case, intermediate results generated during inference are frequently swapped in and out of memory and persistent storage, drastically increasing inference latency, especially for the first token, thus becoming a performance bottleneck.
[0085] The industry has proposed a sliding window solution to address performance bottlenecks caused by memory overflow. This solution caches the most recent key and value vectors (or collectively key-value vectors). When the length of the input sequence exceeds the cache size, the key-value vectors of the initial tokens can be overwritten, causing performance crashes in the AI model. Based on the attention sink phenomenon where AI models tend to strongly focus on the initial tokens during processing, this solution can retain the key-value vectors of the initial tokens and cache the key-value vectors of newly generated tokens. As illustrated in Figure 1, when generating token 7, the key-value vectors of the initial tokens 0 to 3 are retained using the attention sink, while the key-value vectors of tokens 4 to 6 are retained using a sliding window. Similarly, when generating token 8, the same applies to the key-value vectors. When generating token 9, the same applies to the key-value vectors. This can prevent performance crashes.
[0086] However, when the length of the input sequence exceeds the window size set during pre-training, the input sequence is typically segmented into multiple segments. These segments can be subsets of the sequence, also known as subsequences. For example, if the sequence is a 100K token sequence, the inference service can divide it into 25 segments, each 4K in length; each segment can be a 4K subsequence. For each segment, the inference service stores the key-value vector of the token using a sliding window to avoid memory overflow and performance crashes. However, in this approach, the key-value vector of each segment is calculated independently, without considering the correlation between segments. This means that the token obtained from the key-value cache query may not be the accurate token, leading to unstable inference accuracy of the AI model. This is especially true when performing long sequence inference, such as inference for sequences longer than 128K, where accuracy drops significantly.
[0087] In view of this, this application provides a sequence inference method. This method can be executed by an inference platform. In some cases, the inference platform may also be referred to as an inference acceleration platform. The inference platform can be a software system. This software system can be standalone software, such as a standalone AI model inference tool or suite. Alternatively, the software system can be integrated into other software as a functional module, plugin, component, or applet. In some examples, the inference platform can also be a hardware system. The hardware system can be a cluster of computing devices with sequence inference capabilities; when the computing device cluster runs, it executes the sequence inference method of this application.
[0088] Specifically, the inference platform receives an input sequence and, based on the evaluation metrics of at least one segment in the input sequence, selects a target segment from at least one segment. The inference platform then determines key tokens within the target segment. These key tokens can be tokens with high importance within the target segment, such as tokens with importance scores higher than a set value. Importance scores can be determined in different ways; in some examples, the importance score can be the attention score of the AI model's inference output for the target segment. For the first segment within the target segment, the inference platform updates the key-value cache (KV Cache) based on the key tokens and related tokens of the first segment, obtaining an updated KV Cache. Related tokens include tokens preceding the key tokens in the first segment or tokens in the second segment. It should be noted that the tokens in the input sequence are ordered; therefore, the segments divided from the input sequence are also ordered, and the tokens within each segment are also ordered. The second segment is the target segment preceding the first segment; specifically, the spatial order (or position) of the second segment precedes the first segment. Next, the inference platform generates an output sequence based on the updated key-value cache.
[0089] This method filters the input sequence into segments, identifies key tokens within the selected segments, and recalculates the key-value vector of the key token by combining it with other segments preceding the segment containing the key token, thereby updating the KV Cache. Since the updated KV Cache considers the correlation between segments, inference based on the updated KV Cache improves the accuracy of the generated output sequence. Furthermore, this method pre-filters the input sequence into segments and updates and reuses the KV Cache for the selected segments, eliminating the need to update the KV Cache for each segment and improving inference efficiency. This method maintains efficiency while avoiding a significant drop in accuracy, ensuring accuracy stability.
[0090] To make the technical solution of this application clearer and easier to understand, the system architecture of this application is described below with reference to the accompanying drawings.
[0091] Referring to Figure 2, which shows a schematic diagram of the architecture of an inference platform, the inference platform 20 includes a fragment filtering module 202, a key-value cache update module 204, and an inference module 206. The functions and collaboration process of the fragment filtering module 202, the key-value cache update module 204, and the inference module 206 are described in detail below.
[0092] The segment filtering module 202 is used to receive an input sequence. This input sequence can be a long sequence or a short sequence. Figure 2 illustrates this using a long input sequence as an example. The segment filtering module 202 is also used to filter target segments from at least one segment based on an evaluation metric of at least one segment in the input sequence. For any of the at least one segment, the evaluation metric can include at least one of the segment's information content and its relevance to the input sequence. Information content is used to quantitatively measure information, and can be equal to the reduction in uncertainty after obtaining the information. Based on this, the information content of a segment can be the reduction in uncertainty after obtaining the segment. The information content of a segment can be determined based on the segment's entropy.
[0093] The key-value cache update module 204 is used to determine the key tokens in the target segment. For the first segment of the target segment, it updates the key-value cache based on the key tokens and related tokens of the first segment to obtain the updated key-value cache. The key token is an important token in the target segment. For example, for the first segment of the target segment, the key token can be a token in the first segment with an importance score greater than a set value, or the token with the highest importance score in the first segment. The importance score can include, but is not limited to, the attention score output by the AI model when inferring the target segment. Related tokens include tokens preceding the key token in the first segment or tokens in the second segment. The second segment is the segment in the target segment that precedes the first segment. "Before the token" and "before the segment" can refer to spatial order or positional order. For example, when the input sequence is divided into segments 1 to N, the target segment is segment K, and the key token is the s-th token in segment K, the second segment can be segments 1 to K-1, and the related tokens can be tokens in segments 1 to K-1 and the 1st to s-1st tokens in segment K. Here, N, K, and s are greater than 1. It should be noted that when the fragment is fragment 1, the associated tokens can include tokens within fragment 1. When the fragment is fragment 2, fragment 3, ... or fragment N, the associated tokens can include tokens across fragments. If the key token is the first token within a fragment, the associated tokens may not include tokens within that fragment.
[0094] The inference module 206 is used to infer and generate the output sequence based on the updated key-value cache. Specifically, during the generation of the second token and the last token, the inference module 206 can generate the corresponding tokens by querying the updated key-value cache. This method improves the accuracy of the retrieval results by updating the KV Cache and performing KV Cache retrieval based on the updated KV Cache. This KV Cache reuse further improves the accuracy of the generated output sequence, ensuring both efficient inference and stability of accuracy.
[0095] Furthermore, the fragment filtering module 202 may include at least one of the filtering submodule 2022, the reasoning submodule 2024, or the sorting submodule 2026. The functions of the filtering submodule 2022, the reasoning submodule 2024, or the sorting submodule 2026 are described in detail below.
[0096] The filtering submodule 2022 filters segments based on the relevance of at least one segment to the input sequence, obtaining filtered segments. This retains segments with high relevance and filters segments with low relevance, achieving a coarse screening of segments in the input sequence. The inference submodule 2024 and the sorting submodule 2026 filter at least one segment in the input sequence based on information content. Specifically, the inference submodule 2024 and the sorting submodule 2026 can select target segments from the filtered segments based on their information content. This achieves a fine screening of segments in the input sequence.
[0097] The inference submodule 2024 is used to input at least one segment from the input sequence into the AI model in parallel for inference, obtaining the probability distribution of the output vocabulary in the pre-filling stage. To distinguish it from other AI models, the original AI model can be referred to as the first AI model. The output vocabulary may include several words or tokens corresponding to several words. The probability distribution of the output vocabulary can be represented by a vector, where each element represents the probability of a word in the output vocabulary or the probability of a token corresponding to a word. Taking a segment from the input sequence as an example, the inference submodule 2024 inputs the segment into the AI model. The AI model can predict the probability that each token in the vocabulary is the next token of the segment through forward inference, thereby obtaining the probability distribution of the output vocabulary. The sorting submodule 2026 is used to determine the information content of at least one segment based on the probability distribution of the output vocabulary, sort the at least one segment in the input sequence according to its information content, and, based on the sorting result, select the target segment from at least one segment (e.g., a filtered segment).
[0098] It should be noted that when the fragment filtering module 202 includes a filtering submodule 2022 but excludes the inference submodule 2024 and the sorting submodule 2026, or when the fragment filtering module 202 does not enable information-based fragment filtering, the fragments filtered by the filtering submodule 2022 can be directly used as target fragments. In other words, the filtering submodule 2022 is used to determine the relevance of at least one fragment to the input sequence, and then determines the target fragment based on the relevance of at least one fragment to the input sequence. When the fragment filtering module 202 includes the filtering submodule 2022, the inference submodule 2024, and the sorting submodule 2026, or when the fragment filtering module 202 enables information-based fragment filtering, the fragments filtered by the filtering submodule 2022 can be used as input to the inference submodule 2024, thereby achieving dual filtering of fragments. When the fragment filtering module 202 includes the inference submodule 2024 and the sorting submodule 2026, but does not include the filtering submodule 2022, or when the fragment filtering module 202 does not enable relevance-based fragment filtering, fragments in the input sequence can directly enter the inference submodule 2024 and the sorting submodule 2026 for information-based filtering.
[0099] In some possible implementations, the inference submodule 2024 can also output the attention (or attention score, attention rating) of tokens in at least one segment. Correspondingly, the key-value cache update module 204 can determine the key tokens in the target segment based on the attention of tokens in the target segment. For example, the key-value cache update module 204 can determine the token with the highest attention as the key token. As shown in Figure 2, the key-value cache update module 204 includes an identification submodule 2042 and a recalculation submodule 2044. The identification submodule 2042 determines the key tokens in the target segment based on the attention of tokens in the target segment. The recalculation submodule 2044 is used to update the key-value cache for a first segment in the target segment based on the key tokens of the first segment and related tokens, obtaining the updated key-value cache.
[0100] Based on the reasoning platform 20 shown in Figure 2, this application also provides a sequence reasoning method. The sequence reasoning method of this application will be described in detail below with reference to the accompanying drawings.
[0101] Referring to Figure 3, a flowchart of a sequence reasoning method is shown. This method can be executed by the reasoning platform 20 and specifically includes the following steps:
[0102] S302, Inference Platform 20 receives the input sequence.
[0103] The input sequence refers to the token sequence formed by user input. The tokens in the sequence are ordered, and this order can be spatial, meaning that the token corresponding to the text input earlier precedes the token corresponding to the text input later. Specifically, users can input text, which can be natural language text or code. This text can be tokenized using a tokenization scheme to form the input sequence. Sub-word granularity tokenization schemes can include, but are not limited to, Byte Pair Encoding (BPE) and WordPiece. These tokenization schemes automatically generate a word segmentation dictionary by statistically analyzing the frequency of substrings in the text, effectively addressing the out-of-vocabulary (OOV) word problem while maintaining a certain degree of semantic integrity. OVO words are words that did not appear during training but appeared during testing.
[0104] It should be noted that this embodiment uses text as an example for input. In other possible implementations of this application, user input may also include one or more of images, voice, or video, and this application does not limit this.
[0105] S304. The inference platform 20 selects the target segment from at least one segment based on the evaluation index of at least one segment in the input sequence.
[0106] The input sequence may include at least one segment. For example, when the input sequence is a long sequence, it can be divided into multiple segments. The inference platform 20 can divide the input sequence according to the set segment length to obtain at least one segment. Specifically, the inference platform 20 can use a uniform division method to divide the input sequence to obtain at least one segment. When the length of the input sequence is less than or equal to the segment length, the inference platform 20 can treat the input sequence as a single segment. When the length of the input sequence is greater than the segment length, the inference platform 20 can divide the input sequence according to the segment length to obtain multiple segments. The length of the last segment can be less than or equal to the set segment length.
[0107] The inference platform 20 can determine an evaluation metric for at least one segment, and select a target segment from the at least one segment based on the evaluation metric. The evaluation metric refers to an indicator that evaluates a segment from one or more dimensions. The evaluation metric may include at least one of the following: the relevance of the segment to the input sequence or the amount of information contained in the segment.
[0108] In some possible implementations, the inference platform 20 can perform segment filtering based on the correlation between at least one segment and the input sequence. The inference platform 20 can determine the correlation between at least one segment and the input sequence in various ways. A detailed description is provided below with reference to the accompanying drawings.
[0109] Referring to Figure 4, which illustrates a correlation-based segment filtering method, this example uses a long input sequence comprising N segments: segment 1…segment K…segment N. The inference platform 20 inputs at least one segment from the input sequence, such as segment 1 to segment N, into a first AI model (original model). Based on the dispersion of the shallow attention of the first AI model, it determines the correlation between at least one segment and the input sequence. Here, the shallow attention of the AI model refers to the attention of the shallow output of the AI model. The shallow layer of the AI model refers to the first l layers, where l can be a positive integer. Typically, l can be configured as 1, 2, or 3 based on empirical values. The shallow attention of the AI model can vary with changes in the input data. Dispersion, also known as dispersion, can be measured by one or more of the following: range, interquartile range, mean difference, variance, standard deviation, and coefficient of variation. A higher dispersion indicates a lower correlation.
[0110] Similarly, the inference platform 20 inputs at least one segment from the input sequence, such as segment 1 to segment N, into the second AI model. Based on the probability distribution output by the second AI model, it determines the information content of at least one segment and, based on the information content, determines the relevance of at least one segment to the input sequence. The second AI model may include the first m layers of the original AI model and an adapter layer, where m is a positive integer. For example, m can be 2 or 3. In some examples, the second AI model can be a draft model. This allows the platform to utilize a fixed, shallow subnetwork of the original model as a draft model and add a lightweight adapter layer to it to decide whether to perform further computation on the segment.
[0111] The inference platform 20 also inputs at least one segment from the input sequence, such as segment 1 to segment N, into the third AI model. Based on the semantic similarity between the at least one segment output by the third AI model and the input sequence, the relevance between at least one segment and the input sequence is determined. The third AI model is used to determine semantic similarity; specifically, it can be a dual-path model. One path of the dual-path model is used to determine the vector of the segment, and the other path is used to determine the vector of the input sequence. The semantic similarity between at least one segment and the input sequence can be a vector similarity obtained by vectorizing the segment and the input sequence. Vector similarity can be represented by vector distance. Vector distance can include, but is not limited to, Euclidean distance and cosine distance.
[0112] In this application, for a candidate segment in the input sequence, when the relevance determined by the first AI model, the relevance determined by the second AI model, and the relevance determined by the third AI model are all higher than a threshold, the inference platform 20 retains the candidate segment, and then the inference platform 20 can obtain the target segment from the retained candidate segment.
[0113] Alternatively, the inference platform 20 can determine the overall correlation based on the correlation determined by the first AI model, the correlation determined by the second AI model, and the correlation determined by the third AI model. For example, the overall correlation can be determined by weighted operation. When the overall correlation is greater than a threshold, it means that the segment is strongly correlated with the input sequence, and the inference platform 20 can retain the candidate segment.
[0114] It should be noted that the inference platform 20 can calculate the rejection probability of a corresponding segment based on relevance (relevance score or relevance rating). As shown in Figure 4, the filtering submodule 2022 of the inference platform 20 can output rejection probabilities for the same segment using different relevance determination methods, such as rejection probability 1, rejection probability 2, and rejection probability 3. The inference platform 20 obtains a comprehensive rejection probability based on rejection probabilities 1, 2, and 3. When the comprehensive rejection probability is higher than the corresponding threshold, the segment is filtered. When the comprehensive rejection probability is lower than the threshold, the segment is retained.
[0115] In other possible implementations, the greater the amount of information predicted by the first AI model (e.g., a large model), the higher the confidence and accuracy of the first AI model in its prediction. Based on this, the inference platform 20 can filter segments according to the information content of at least one segment. Specifically, referring to Figure 5, which illustrates a segment sorting diagram based on information content, the inference platform 20 can input at least one segment from the input sequence, such as segment 1 to segment N, in parallel into the first AI model for inference, obtaining the probability distribution of the output vocabulary during the pre-filling stage (the stage of inferring the first token / first token). The probability distribution of the output vocabulary of the first token from segment 1 to segment N can be expressed as: {P1,…,P…} k ,P N}, where P1 is the probability distribution of the output vocabulary of the first token of segment 1, and so on, P N Let N be the probability distribution of the output vocabulary of the first token of segment N. In some possible implementations, the inference platform 20 can concatenate the segments with the input sequence, and then input the segments concatenated with the input sequence into the first AI model for inference.
[0116] The inference platform 20 can determine the information content of at least one segment based on the probability distribution of the output vocabulary. Specifically, the inference platform 20 can first determine the entropy (denoted as S) and perplexity (PPL) of at least one segment based on the probability distribution of the output vocabulary. The entropy can be conditional entropy, defined as the entropy of Y given two random variables X and Y, with X known. In the specific inference scenario, conditional entropy can be the entropy of generating the first token given a known segment. For a segment T = {T1, ..., T2}... N}, T represents the token, N is greater than 1, and the inference platform 20 is based on the first AI model, such as the probability that LLM generates this fragment is P(T1,…T). N Since the probability P is affected by the segment length, the perplexity can be obtained by taking the negative Nth root of the above probability, as shown below:
[0117] Where N is the segment length, PPL is the perplexity, and P is the probability. The higher the probability, the lower the perplexity.
[0118] Apply the following chain rule to formula (1): P(T1,…T N )=P(T1)P(T2|T1)P(T3|T1,T2)...P(T N |T1,…,T N-1 (2)
[0119] The degree of confusion can be rewritten as:
[0120] Then, the inference platform 20 can determine the information content of at least one segment based on the entropy and perplexity of at least one segment. Specifically, as shown below: I j =-w*S j -(1-w)*PPL j (4)
[0121] Among them, I j S represents the information content of the j-th segment. j PPL represents the entropy of the j-th segment. j This represents the amount of information in the j-th segment.
[0122] The inference platform 20 can sort at least one segment in the input sequence according to its information content, and select the target segment from the at least one segment based on the sorting result. As shown in Figure 5, the inference platform 20 can sort the segments according to their information content and select the top K segments as the target segments. The inference platform 20 can cache the key and value of the target segments for reuse in subsequent inference.
[0123] It should be noted that the inference platform 20 can first filter at least one segment based on its relevance to the input sequence, thus performing a coarse screening of at least one segment in the input sequence. Then, based on the information content of the filtered segments, it selects the target segment from the filtered segments, thereby performing a fine screening of the coarsely screened segments (filtered segments) using information content. That is, the inference platform 20 can input the segments retained based on relevance coarse screening into the first AI model for inference, obtain the probability distribution of the output vocabulary of the pre-filling stage, determine the information content of the segments based on the probability distribution, and then sort the segments according to the information content. This can significantly reduce the computational load of the first AI model's inference, thereby accelerating the inference process.
[0124] In other possible implementations of the embodiments of this application, the inference platform 20 may also perform segment filtering by one of relevance or information content, or by other evaluation indicators, and this application does not limit this.
[0125] Furthermore, the inference platform 20 can also determine the key-value vector of the token in the target fragment and cache the key-value vector to obtain a key-value cache. The key-value vector includes a key vector and a value vector. For any token in the target fragment, the inference platform 20 can use the output of the token at layer t of the first AI model as the input to layer t+1 of the first AI model, and obtain the key-value vector of the token through a key-value weight matrix mapping.
[0126] The key-value weight matrix includes a key weight matrix (or simply key matrix) and a value weight matrix (or simply value matrix). The key weight matrix is a set of weights used to linearly transform the input vector (or input features) into a key vector, and the value weight matrix is a set of weights used to linearly transform the input vector into a value vector. These weight matrices can be part of the parameters of the first AI model and can be learned during the learning process of the first AI model. For example, the weight matrices can be updated and optimized during the training phase using the backpropagation algorithm.
[0127] Mapping refers to transforming a vector from one vector space to another. In this embodiment, the inference platform 20 can map the input of the (t+1)th layer to the vector space of key vectors using a key weight matrix, thereby obtaining a key vector, and can map the input of the (t+1)th layer to the vector space of value vectors using a value weight matrix, thereby obtaining a value vector. Here, a vector space is a special set of scalable and additive mathematical real domains.
[0128] The inference platform 20 caches key-value vectors to facilitate reuse of these vectors in the KV cache during the decoding phase, eliminating the need for recalculation and enabling lookup-based computation. Furthermore, for long-sequence inference, considering the limited cache space, the inference platform 20 can also persist the key-value vectors, for example, by performing KV cache unloading or persistence. When subsequent repeated or similar input sequences are received, the key-value vectors can be read from the persistent medium (such as a hard disk) and KV cache retrieval can be performed to achieve KV cache reuse. This improves inference efficiency and accelerates inference.
[0129] S306, Inference Platform 20 determines the key token in the target fragment.
[0130] A key token is a token that the first AI model focuses on in a target segment. In a target segment, a key token is more important than other tokens. In some examples, importance can be represented by attention. Based on this, the inference platform can determine the key tokens in a target segment based on the attention of tokens within that segment. For example, when attention is represented by an attention score, for any segment in the target segment, the inference platform 20 can identify the token with the highest attention score as the key token in that segment. Alternatively, for any segment in the target segment, the inference platform 20 can compare the attention scores of tokens in the segment with a set value; if a token's attention score is greater than the set value, that token is identified as the key token in that segment.
[0131] S308. For the first segment in the target segment, the inference platform 20 updates the key-value cache based on the key token and related tokens of the first segment, and obtains the updated key-value cache.
[0132] The first fragment can be any fragment in the target fragment. In practice, the inference platform 20 can update the key-value cache sequentially according to the spatial order of the fragments in the target fragment. The key-value cache is used to accelerate inference during the decoding stage. Based on this, the first fragment can be the j-th fragment in the target fragment, where j is greater than or equal to 1. When the key-value cache of the j-th fragment is updated, the first fragment can be the (j+1)-th fragment in the target fragment.
[0133] Relevant tokens include tokens preceding the key token in the first segment or tokens in the second segment. The first and second segments are segments of the input sequence. Since the input sequence is ordered, the segments within the input sequence are also ordered, and the tokens within the segments are also ordered. The order of tokens within a segment, and the order between segments, can be spatial order. Tokens preceding the key token in the first segment refer to tokens that spatially precede the key token in the first segment. Similarly, the second segment can be a segment in the target segment that precedes the first segment.
[0134] For example, if the first segment is the j-th segment in the target segment, and the key token in the first segment is the k-th token in that segment, then the relevant tokens may include the tokens of the first segment to the (j-1)-th segment in the target segment, as well as the first token to the (k-1)-th token in the j-th token.
[0135] In the first AI model at layer t, the inference platform 20 can determine the relevance between key tokens and related tokens based on the key tokens and related tokens of the first segment. Related tokens can include multiple tokens, and the relevance between key tokens and related tokens can be the relevance between the key token and each related token. In some examples, key tokens can also include multiple tokens; therefore, the relevance between key tokens and related tokens can be the relevance between each key token and each related token. For example, if the j-th segment in the target segment includes q key tokens, for each of the q key tokens, the inference platform 20 can determine the relevance between that key token and each related token, thus obtaining n1+n2+…+n q There are several relevance scores. Here, n1 represents the number of relevant tokens for the first key token of the j-th segment, n2 represents the number of relevant tokens for the second key token of the j-th segment, and so on, n... q This represents the number of related tokens for the q-th key token in the j-th segment.
[0136] Relevance is used to measure the degree of association between tokens. Related tokens can include at least one of tokens within the same segment or tokens across segments. Based on this, relevance can include intra-segment relevance or cross-segment relevance. Cross-segment relevance, also known as inter-segment relevance, can be illustrated using a key token in the first segment as an example. Cross-segment relevance can be the relevance between a key token in the first segment and related tokens of that key token in the second segment. In some examples, cross-segment relevance can be cross-attention. Cross-attention can be obtained by calculating the similarity between the query vector of the key token and the key vector of the related tokens. For example, the inference platform 20 can calculate the similarity between the query vector of the key token and the key vector of the related tokens. This similarity can be represented by vector distance, which includes, but is not limited to, Euclidean distance and cosine distance. Then, the inference platform 20 can normalize the vector distance, for example, by using a softmax function, to obtain cross-attention. It should be noted that intra-segment relevance can also be determined in a similar way, and this application does not limit this.
[0137] Then, the inference platform 20 can obtain the updated key-value cache of the (t+1)th layer of the first AI model based on the relevance. Specifically, the inference platform 20 can obtain the output of the tth layer of the first AI model based on the relevance. For example, the inference platform 20 can use the relevance as a weight to perform a weighted operation on the value vectors of relevant tokens, thereby obtaining the output of the tth layer of the first AI model. This output incorporates information from other fragments and can therefore be called a fused output. The output of the tth layer can be used as the input of the (t+1)th layer. Based on the output of the tth layer, the inference platform 20 can use the key-value weight matrix to map and obtain an updated key-value vector, and then use the updated key-value vector to replace the key-value vector cached in the (t+1)th layer of the first AI model, thereby obtaining the updated key-value cache of the (t+1)th layer of the first AI model. In this embodiment, the inference platform 20 can map the input of the (t+1)th layer to the vector space of key vectors using the key weight matrix, thereby obtaining an updated key vector, and map the input of the (t+1)th layer to the vector space of value vectors using the value weight matrix, thereby obtaining an updated value vector.
[0138] To facilitate understanding, an example is provided below. Referring to Figure 6, which illustrates an update of the key-value cache, the inference platform 20 uses the attention scores obtained from each layer of the pre-filling stage to identify key tokens in the target segment. Taking one key token, such as token i, as an example, token i is the token ranked i among all segments of the target segment. At layer t, the inference platform 20 recalculates the cross-attention between token i and its preceding related tokens (such as token 1 to token i-1), thereby enabling token i to acquire information from other segments and forming the output O′ of layer t of the first AI model. i The inference platform 20 will output O′ of the t-th layer of the first AI model. i The input is fed into the (t+1)th layer, using the key-value weight matrix W. k W v The mapping yields the updated key-value vector, K′. i and V′ i Using K′ i and V′ i Replace the original cached K in layer t+1 of the first AI model i and V i .
[0139] In the pre-filling stage, the key-value cache obtained by the first AI model for each fragment does not contain information from other fragments. This method improves the overall accuracy by combining other fragments to update the key-value cache of some key tokens in each layer.
[0140] It should be noted that the inference platform 20 can also persistently store the identifier of the key token in the target fragment and the updated key-value vector, for example, by performing KV cache unloading (unloading from memory to storage) or KV cache persistence, so that when performing repeated inference tasks later, the identifier of the key token and the updated key-value vector can be read into memory, and KV cache retrieval and KV cache reuse can be performed to reduce unnecessary calculations and achieve query-based calculation.
[0141] S310 and inference platform 20 infer and generate output sequences based on the updated key-value cache.
[0142] Specifically, the inference platform 20 can perform incremental pre-filling and full decoding based on the updated key-value cache to obtain the output sequence. Specifically, the inference platform 20 can use the updated key-value cache as context, concatenate the context with the input sequence, and use the concatenation result as a prompt to input the first AI model such as LLM, performing incremental pre-filling and full decoding to obtain the output sequence. Incremental pre-filling refers to generating the first token for each segment in the target segment using the updated key-value cache. Full decoding can generate the remaining tokens for each segment of the input sequence by querying the updated key-value cache.
[0143] Based on the above description, this application provides a sequence reasoning method. This method filters the input sequence into segments, identifies key tokens in the selected target segments, and recalculates the key-value vectors of the key tokens by combining them with other segments preceding the segment containing the key token, thereby updating the KV Cache. The updated KV Cache considers the correlation between segments; therefore, reasoning based on the updated KV Cache can improve the accuracy of the generated output sequence. This method avoids a significant drop in accuracy while maintaining efficiency, ensuring accuracy stability. Furthermore, by filtering important segments, this method can significantly reduce the GPU memory used for reasoning, enabling reasoning for infinitely long sequences. In addition, by filtering the input sequence into segments and updating the KV Cache for key tokens in the selected target segments, the key-value vectors of highly relevant tokens can be cached, improving cache hit rate, thereby increasing KV Cache reuse and improving reasoning performance.
[0144] To facilitate understanding, the sequence reasoning method of this application will be introduced below with a specific application scenario.
[0145] Refer to Figure 7 for an application scenario of a sequence reasoning method. This scenario uses a text sequence as the input sequence for illustration and includes the following steps:
[0146] S702, Inference Platform 20 receives text sequences.
[0147] The text sequence can be a sequence obtained by tokenizing text.
[0148] S704. Inference platform 20 determines whether the length of the text sequence exceeds the limit. If yes, execute S706; otherwise, execute S726.
[0149] If the length of the text sequence exceeds the limit, it indicates that the inference for the text sequence is a long sequence inference, and steps S706 to S724 can be executed to improve inference efficiency while maintaining or slightly reducing inference accuracy. If the length of the text sequence does not exceed the limit, it indicates that the inference for the text sequence is a short sequence inference, and the AI model can be used directly for inference, executing the normal inference process.
[0150] It should be noted that S704 described above is an optional step in the embodiments of this application, and the sequence reasoning method of this application may or may not execute S704. In other words, the sequence reasoning method of this application can directly perform reasoning without confirming whether the text sequence is a long sequence or a short sequence, and the reasoning platform 20 can unify the sequence reasoning process for long sequences and short sequences.
[0151] S706, the inference platform 20 concatenates fragments from the text sequence with the text sequence to determine the correlation between the fragments and the text sequence.
[0152] The inference platform 20 can determine the relevance between the fragment and the text sequence through various methods, models, and plugins. For specific implementation details, please refer to the description of the embodiment shown in Figure 3. Furthermore, when determining the relevance between the fragment and the text sequence, the inference platform 20 may not need to concatenate the fragment and the text sequence. For example, the inference platform 20 can input the fragment and the text sequence separately into the model to determine the relevance.
[0153] S708. For each segment in the text sequence, the inference platform 20 determines whether the correlation between the segment and the text sequence is greater than a threshold. If yes, then execute S710; otherwise, execute S712.
[0154] S710, Inference Platform 20 Retained Fragments.
[0155] S712, Reasoning Platform 20 discarded fragments.
[0156] S714, the inference platform 20 will retain the fragment input LLM for inference, record the probability distribution of the output vocabulary when generating the first token, and determine the information content based on the probability distribution. Furthermore, the inference platform 20 caches the key-value vectors and attention from the pre-filling stage.
[0157] S716 and the inference platform 20 sort the retained fragments according to the amount of information to obtain the top K target fragments in terms of information content.
[0158] S718 and the inference platform 20 determine the key token of each segment in the target segment based on the attention of the pre-filling stage cache, and recalculate the key-value vector of the key token based on the key token and related tokens of each segment in the target segment, and update the key-value cache of the key token.
[0159] The specific implementation of S714 to S718 described above can be found in the description of the embodiment shown in Figure 3, and will not be repeated here.
[0160] The S720 and inference platform 20 use the updated key-value cache as context.
[0161] S722 and the inference platform 20 concatenate the context and text sequence, and perform incremental pre-filling and full decoding based on the concatenation result to obtain the answer.
[0162] S724, the reasoning platform 20 outputs the answer to the user.
[0163] The answer can be the output text sequence. The output text sequence can be obtained by detoxing the output token sequence.
[0164] S726, Inference Platform 20 performs normal inference.
[0165] This method improves inference efficiency, including reducing inference latency and increasing inference throughput, by filtering segments and reusing key-value caches on the input text sequence, while maintaining minimal loss inference accuracy. Tests using a 128KB long sequence demonstrate that the proposed sequence inference method achieves approximately a 10-fold speedup in the pre-filling stage while maintaining consistent accuracy. Furthermore, by filtering important segments, this method significantly reduces the GPU memory used for inference, enabling inference of infinitely long sequences.
[0166] Based on the aforementioned sequence reasoning method, this application also provides a reasoning platform 20. The structure of the reasoning platform 20 of this application will be described below with reference to the accompanying drawings.
[0167] Referring to Figure 2, which shows a schematic diagram of the structure of an inference platform 20, the inference platform 20 includes:
[0168] The segment filtering module 202 is used to receive an input sequence and filter target segments from at least one segment according to the evaluation index of at least one segment in the input sequence.
[0169] The key-value cache update module 204 is used to determine the key token in the target fragment. For the first fragment in the target fragment, the key-value cache is updated according to the key token of the first fragment and related tokens to obtain the updated key-value cache. The related tokens include tokens in the first fragment before the key token of the first fragment or tokens in the second fragment. The second fragment is the fragment in the target fragment that is located before the first fragment.
[0170] The inference module 206 is used to infer and generate the output sequence based on the updated key-value cache.
[0171] For example, the above-mentioned fragment filtering module 202, key-value cache update module 204, and inference module 206 can be implemented in hardware or in software.
[0172] When implemented in software, the fragment filtering module 202, key-value cache update module 204, and inference module 206 can be applications running on computing devices, such as computing engines. These applications can also be virtualized and provided to users as virtualization services. Virtualization services can include virtual machine (VM) services, bare metal server (BMS) services, or container services. VM services can be services that use virtualization technology to create virtual machine (VM) resource pools on multiple physical hosts to provide VMs for users to use on demand. BMS services are services that use virtualization technology to create BMS resource pools on multiple physical hosts to provide BMS for users to use on demand. Container services are services that use virtualization technology to create container resource pools on multiple physical hosts to provide containers for users to use on demand. A VM is a simulated virtual computer, that is, a logical computer. A BMS is a scalable, high-performance computing service with computing performance indistinguishable from traditional physical machines and features secure physical isolation. Containers are a kernel virtualization technology that provides lightweight virtualization to isolate user space, processes, and resources. It should be understood that the VM service, BMS service, and container service mentioned above are merely specific examples. In practical applications, virtualization services can also include other lightweight or heavyweight virtualization services, which are not specifically limited here.
[0173] When implemented in hardware, the fragment filtering module 202, key-value cache update module 204, and inference module 206 may include at least one computing device, such as a server. Alternatively, the fragment filtering module 202, key-value cache update module 204, and inference module 206 may also be implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD). The PLD may be a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL), or any combination thereof.
[0174] In some possible implementations, the key-value cache update module 204 is specifically used for:
[0175] In the t-th layer of the first artificial intelligence (AI) model, the relevance between the key token and the related token is determined based on the key token and related tokens of the first fragment.
[0176] Based on the relevance, obtain the updated key-value cache of the (t+1)th layer of the first AI model.
[0177] In some possible implementations, the key-value cache update module 204 is specifically used for:
[0178] Based on the relevance, obtain the output of the t-th layer of the first AI model;
[0179] Based on the output of layer t, the updated key-value vector is obtained by mapping using the key-value weight matrix. The updated key-value vector is then used to replace the key-value vector cached in layer t+1 of the first AI model, so as to obtain the updated key-value cache of layer t+1 of the first AI model.
[0180] In some possible implementations, the fragment filtering module 202 is specifically used for:
[0181] Based on the correlation between at least one segment in the input sequence and the input sequence, filter at least one segment to obtain filtered segments;
[0182] Based on the amount of information in the filtered segments, the target segments are obtained by further selection of the filtered segments.
[0183] In some possible implementations, the fragment filtering module 202 is also used for:
[0184] Input at least one segment from the input sequence into a first artificial intelligence (AI) model, and determine the relevance of at least one segment to the input sequence based on the degree of dispersion of the shallow attention of the first AI model; and / or,
[0185] At least one segment from the input sequence is input into a second AI model. The information content of the at least one segment is determined based on the probability distribution output by the second AI model. The correlation between the at least one segment and the input sequence is determined based on the information content. The second AI model comprises the first m layers of the first AI model and an adapter layer, where m is a positive integer; and / or,
[0186] At least one segment from the input sequence is input into a third AI model. Based on the semantic similarity between at least one segment output by the third AI model and the input sequence, the relevance of at least one segment to the input sequence is determined.
[0187] In some possible implementations, the fragment filtering module 202 is specifically used for:
[0188] For a segment in the input sequence, if the relevance determined by the first AI model, the second AI model, and the third AI model are all higher than the threshold, then the candidate segment is retained.
[0189] In some possible implementations, the fragment filtering module 202 is also used for:
[0190] The filtered segments are input into the first artificial intelligence (AI) model for inference to obtain the probability distribution of the output vocabulary in the pre-filling stage;
[0191] The information content of the filtered segments is determined based on the probability distribution of the output vocabulary.
[0192] In some possible implementations, the key-value cache update module 204 is also used for:
[0193] The attention of tokens in the target segment is obtained by reasoning about the target segment by the first artificial intelligence model;
[0194] The key-value cache update module 204 is specifically used for:
[0195] Identify the key tokens in the target fragment based on attention to the tokens in the target fragment.
[0196] In some possible implementations, the key-value cache update module 204 is also used for:
[0197] Determine the key-value vector of the token in the target fragment and cache the key-value vector to obtain the key-value cache.
[0198] In some possible implementations, the inference module 206 is specifically used for:
[0199] Based on the updated key-value cache, perform incremental pre-filling and full decoding to obtain the output sequence.
[0200] This application also provides a computing device 800. As shown in FIG8, the computing device 800 includes: a bus 802, a processor 804, a memory 806, and a communication interface 808. The processor 804, the memory 806, and the communication interface 808 communicate with each other via the bus 802. The computing device 800 may be a server or a terminal device. It should be understood that this application does not limit the number of processors and memories in the computing device 800.
[0201] Bus 802 can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of illustration, only one line is used in Figure 8, but this does not imply that there is only one bus or one type of bus. Bus 802 can include pathways for transmitting information between various components of computing device 800 (e.g., memory 806, processor 804, communication interface 808).
[0202] Processor 804 may include any one or more processors such as a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP), or a digital signal processor (DSP).
[0203] Memory 806 may include volatile memory, such as random access memory (RAM). Memory 806 may also include non-volatile memory, such as read-only memory (ROM), flash memory, hard disk drive (HDD), or solid-state drive (SSD). Memory 806 stores executable program code, which processor 804 executes to implement the aforementioned sequence reasoning method. Specifically, memory 806 stores instructions for the inference platform 20 to execute the sequence reasoning method.
[0204] The communication interface 808 uses transceiver modules such as, but not limited to, network interface cards and transceivers to enable communication between the computing device 800 and other devices or communication networks.
[0205] This application also provides a computing device cluster. The computing device cluster includes at least one computing device. The computing device can be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device can also be a terminal device such as a desktop computer, a laptop computer, or a smartphone.
[0206] As shown in Figure 9, the computing device cluster includes at least one computing device 800. The memory 806 of one or more computing devices 800 in the computing device cluster may store instructions for executing sequence reasoning methods using the same inference platform 20.
[0207] In some possible implementations, one or more computing devices 800 in the computing device cluster can also be used to execute some of the instructions used by the inference platform 20 to execute the sequence inference method. In other words, a combination of one or more computing devices 800 can jointly execute the instructions used by the inference platform 20 to execute the sequence inference method.
[0208] It should be noted that the memory 806 in different computing devices 800 in the computing device cluster can store different instructions for executing some functions of the inference platform 20.
[0209] Figure 10 illustrates one possible implementation. As shown in Figure 10, two computing devices 800A and 800B are connected via a communication interface 808. The memory in computing device 800A stores instructions for executing the functions of the fragment filtering module 202. The memory in computing device 800B stores instructions for executing the functions of the key-value cache update module 204 and the inference module 206. In other words, the memory 806 of computing devices 800A and 800B jointly stores the instructions used by the inference platform 20 to execute the sequence inference method.
[0210] The connection method between the computing device clusters shown in Figure 10 can be considered because the sequence reasoning method provided in this application requires a lot of resources for KV cache recalculation and reuse. Therefore, it is considered to dedicate the functions implemented by the key-value cache update module 204 and the reasoning module 206 to independent computing devices. For example, the functions implemented by the fragment filtering module 202 can be dedicated to computing device 800A, and the functions implemented by the key-value cache update module 204 and the reasoning module 206 can be dedicated to computing device 800B.
[0211] It should be understood that the functions of computing device 800A shown in Figure 10 can also be performed by multiple computing devices 800. Similarly, the functions of computing device 800B can also be performed by multiple computing devices 800.
[0212] In some possible implementations, one or more computing devices in a computing device cluster can be connected via a network. This network can be a wide area network (WAN) or a local area network (LAN), etc. Figure 11 illustrates one possible implementation. As shown in Figure 11, two computing devices 800C and 800D are connected via a network. Specifically, they are connected to the network through communication interfaces in each computing device. In this type of possible implementation, the memory 806 in computing device 800C stores instructions for executing the functions of the fragment filtering module 202. Simultaneously, the memory 806 in computing device 800D stores instructions for executing the functions of the key-value cache update module 204 and the inference module 206.
[0213] The connection method between the computing device clusters shown in Figure 11 can be considered because the sequence reasoning method provided in this application requires a large amount of resources for KV cache recalculation and reuse. Therefore, it is considered to dedicate the functions implemented by the functional analysis subsystem 202 to an independent computing device, such as computing device 800C, and dedicate the functions implemented by the key-value cache update module 204 and the reasoning module 206 to computing device 800D.
[0214] It should be understood that the functions of the computing device 800C shown in Figure 11 can also be performed by multiple computing devices 800. Similarly, the functions of the computing device 800D can also be performed by multiple computing devices 800.
[0215] This application also provides a computer-readable storage medium. The computer-readable storage medium can be any available medium that a computing device can store, or a data storage device such as a data center containing one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state drive). The computer-readable storage medium includes instructions that instruct the computing device to execute the sequence reasoning method described above applied to the inference platform 20.
[0216] This application also provides a computer program product containing instructions. The computer program product may be a software or program product containing instructions, capable of running on a computing device or stored on any usable medium. When the computer program product is run on at least one computing device, it causes the at least one computing device to perform the above-described sequence reasoning method.
[0217] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the protection scope of the technical solutions of the embodiments of the present invention.
Claims
1. A sequence inference method, characterized by, The method includes: Receive input sequence; Based on the evaluation index of at least one segment in the input sequence, a target segment is selected from the at least one segment; Identify the key tokens in the target fragment; For the first segment in the target segment, the key-value cache is updated according to the key token and related tokens of the first segment to obtain the updated key-value cache. The related tokens include tokens in the first segment before the key token of the first segment or tokens in the second segment. The second segment is the segment in the target segment that is located before the first segment. Based on the updated key-value cache, the output sequence is inferred and generated.
2. The method according to claim 1, characterized in that, The step of updating the key-value cache based on the key token and related tokens of the first fragment to obtain the updated key-value cache includes: At layer t of the first artificial intelligence (AI) model, the relevance between the key token and the related token is determined based on the key token and related token of the first fragment. Based on the relevance, the updated key-value cache of the (t+1)th layer of the first AI model is obtained.
3. The method according to claim 2, characterized in that, The step of obtaining the updated key-value cache of the (t+1)th layer of the first AI model based on the relevance includes: Based on the relevance, the output of the t-th layer of the first AI model is obtained; Based on the output of the t-th layer, an updated key-value vector is obtained by mapping using the key-value weight matrix. The updated key-value vector is then used to replace the key-value vector cached in the (t+1)-th layer of the first AI model, thereby obtaining the updated key-value cache of the (t+1)-th layer of the first AI model.
4. The method according to any one of claims 1 to 3, characterized in that, The step of selecting target segments from at least one segment in the input sequence based on an evaluation metric includes: Based on the correlation between at least one segment in the input sequence and the input sequence, the at least one segment is filtered to obtain a filtered segment; Based on the amount of information in the filtered segments, the filtered segments are further selected to obtain the target segments.
5. The method according to claim 4, characterized in that, The method further includes: At least one segment from the input sequence is input into a first artificial intelligence (AI) model, and the relevance of the at least one segment to the input sequence is determined based on the degree of dispersion of the shallow attention of the first AI model; and / or, At least one segment from the input sequence is input into a second AI model. The information content of the at least one segment is determined based on the probability distribution output by the second AI model. The correlation between the at least one segment and the input sequence is determined based on the information content. The second AI model includes the first m layers of the first AI model and an adapter layer, where m is a positive integer; and / or, At least one segment of the input sequence is input into a third AI model, and the relevance of the at least one segment to the input sequence is determined based on the semantic similarity between the at least one segment output by the third AI model and the input sequence.
6. The method according to claim 5, characterized in that, The step of filtering at least one segment of the input sequence to obtain filtered segments based on the correlation between at least one segment of the input sequence and the input sequence includes: For a segment in the input sequence, if the relevance determined by the first AI model, the second AI model, and the third AI model are all higher than a threshold, then the candidate segment is retained.
7. The method according to any one of claims 4 to 6, characterized in that, The method further includes: The filtered fragments are input into the first artificial intelligence (AI) model for inference to obtain the probability distribution of the output vocabulary in the pre-filling stage; The information content of the filtered segments is determined based on the probability distribution of the output vocabulary.
8. The method according to any one of claims 1 to 7, characterized in that, The method further includes: The attention of the tokens in the target segment is obtained, and the attention of the tokens in the target segment is inferred by the first artificial intelligence model from the target segment; The determination of the key token in the target fragment includes: Based on the attention given to the tokens in the target fragment, the key tokens in the target fragment are determined.
9. The method according to any one of claims 1 to 8, characterized in that, The method further includes: Determine the key-value vector of the token in the target fragment, and cache the key-value vector to obtain the key-value cache.
10. The method according to any one of claims 1 to 9, characterized in that, The step of inferring and generating the output sequence based on the updated key-value cache includes: Based on the updated key-value cache, incremental pre-filling and full decoding are performed to obtain the output sequence.
11. A reasoning platform, characterized in that, The inference platform includes: A segment filtering module is used to receive an input sequence and filter target segments from at least one segment in the input sequence according to an evaluation index of at least one segment in the input sequence. The key-value cache update module is used to determine the key token in the target segment, and for the first segment in the target segment, update the key-value cache according to the key token of the first segment and related tokens to obtain the updated key-value cache. The related tokens include tokens in the first segment before the key token of the first segment or tokens in the second segment, where the second segment is the segment in the target segment that is located before the first segment. The inference module is used to infer and generate an output sequence based on the updated key-value cache.
12. The platform according to claim 11, characterized in that, The key-value cache update module is specifically used for: At layer t of the first artificial intelligence (AI) model, the relevance between the key token and the related token is determined based on the key token and related token of the first fragment. Based on the relevance, the updated key-value cache of the (t+1)th layer of the first AI model is obtained.
13. The platform according to claim 12, characterized in that, The key-value cache update module is specifically used for: Based on the relevance, the output of the t-th layer of the first AI model is obtained; Based on the output of the t-th layer, an updated key-value vector is obtained by mapping using the key-value weight matrix. The updated key-value vector is then used to replace the key-value vector cached in the (t+1)-th layer of the first AI model, thereby obtaining the updated key-value cache of the (t+1)-th layer of the first AI model.
14. The platform according to any one of claims 11 to 13, characterized in that, The segment filtering module is specifically used for: Based on the correlation between at least one segment in the input sequence and the input sequence, the at least one segment is filtered to obtain a filtered segment; Based on the amount of information in the filtered segments, the filtered segments are further selected to obtain the target segments.
15. The platform according to claim 14, characterized in that, The fragment filtering module is also used for: At least one segment of the input sequence is input into a first artificial intelligence (AI) model, and the correlation between the at least one segment and the input sequence is determined based on the degree of dispersion of the shallow attention of the first AI model. And / or, At least one segment from the input sequence is input into a second AI model. The information content of the at least one segment is determined based on the probability distribution output by the second AI model. The correlation between the at least one segment and the input sequence is determined based on the information content. The second AI model includes the first m layers of the first AI model and an adapter layer, where m is a positive integer; and / or, At least one segment of the input sequence is input into a third AI model, and the relevance of the at least one segment to the input sequence is determined based on the semantic similarity between the at least one segment output by the third AI model and the input sequence.
16. The platform according to claim 15, characterized in that, The segment filtering module is specifically used for: For a segment in the input sequence, if the relevance determined by the first AI model, the second AI model, and the third AI model are all higher than a threshold, then the candidate segment is retained.
17. The platform according to any one of claims 14 to 16, characterized in that, The fragment filtering module is also used for: The filtered fragments are input into the first artificial intelligence (AI) model for inference to obtain the probability distribution of the output vocabulary in the pre-filling stage; The information content of the filtered segments is determined based on the probability distribution of the output vocabulary.
18. The platform according to any one of claims 11 to 17, characterized in that, The key-value cache update module is also used for: The attention of the tokens in the target segment is obtained, and the attention of the tokens in the target segment is obtained by the first artificial intelligence model inferring the target segment; The key-value cache update module is specifically used for: Based on the attention paid to the tokens in the target fragment, the key tokens in the target fragment are determined.
19. The platform according to any one of claims 11 to 18, characterized in that, The key-value cache update module is also used for: Determine the key-value vector of the token in the target fragment, and cache the key-value vector to obtain the key-value cache.
20. The platform according to any one of claims 11 to 19, characterized in that, The reasoning module is specifically used for: Based on the updated key-value cache, incremental pre-filling and full decoding are performed to obtain the output sequence.
21. A computing device cluster, characterized in that, The computing device cluster includes at least one computing device, the at least one computing device including at least one processor and at least one memory, the at least one memory storing computer-readable instructions; the at least one processor executes the computer-readable instructions to cause the computing device cluster to perform the sequence reasoning method as described in any one of claims 1 to 10.
22. A computer-readable storage medium, characterized in that, Includes computer-readable instructions; the computer-readable instructions are used to implement the sequence reasoning method according to any one of claims 1 to 10.
23. A computer program product, characterized in that, Includes computer-readable instructions; the computer-readable instructions are used to implement the sequence reasoning method according to any one of claims 1 to 10.