Text processing method and apparatus, electronic device, and computer readable medium
By segmenting ultra-long texts into text fragments and generating cached data, the problems of information forgetting and logical confusion in the processing of ultra-long texts by large language models are solved, and efficient and accurate text analysis is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING QIYI CENTURY SCI & TECH CO LTD
- Filing Date
- 2026-02-06
- Publication Date
- 2026-06-19
AI Technical Summary
Large language models suffer from problems such as information loss and logical confusion when processing extremely long texts, and existing technologies are difficult to effectively analyze extremely long texts, resulting in inaccurate analysis results and high costs.
The long text is segmented into multiple text fragments, and cached data for each text fragment is generated. The data is then pre-filled using a large language model to generate a key-value cache set. The cached data is then merged to generate the response text.
It improves the efficiency and accuracy of ultra-long text analysis, avoids information loss and logical confusion, captures long-distance dependencies, and enhances the completeness of analysis results.
Smart Images

Figure CN122240578A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer technology, specifically to text processing methods, apparatuses, electronic devices, and computer-readable media. Background Technology
[0002] With the rapid development of artificial intelligence technology, Large Language Models (LLMs) have shown great potential in the field of text processing. However, when the length of the input text exceeds the upper limit, LLMs suffer from problems such as information forgetting and logical confusion, and the reasoning cost increases superlinearly, making them unsuitable for in-depth analysis of extremely long texts such as film and television scripts.
[0003] In existing technologies, understanding and analyzing extremely long texts usually requires either manual reading or automated analysis based on summaries. The former is time-consuming, costly, and highly subjective, and is prone to failing to fully grasp long-distance dependencies in extremely long texts due to limitations in memory and attention. The latter loses a large number of key details, leading to distorted analysis results. Summary of the Invention
[0004] This application provides text processing methods, apparatus, electronic devices, and computer-readable media, which improve the efficiency and accuracy of analyzing very long texts and reduce labor costs.
[0005] In a first aspect, embodiments of this application provide a text processing method, the method comprising: receiving a question text for an extremely long text, wherein the extremely long text is text whose length exceeds the upper limit of the input length of a large language model; retrieving cached data of target text segments associated with the question text from a pre-established cache library, wherein the cache library is used to store cached data of each text segment in the extremely long text, and the cached data of each text segment is generated based on a key-value cache set generated when the large language model pre-fills the text segment; fusing the cached data of the target text segments to obtain fused cached data; and generating a response text for the question text using the large language model based on the fused cached data.
[0006] Secondly, embodiments of this application provide a text processing apparatus, comprising: a receiving unit for receiving a question text for an ultra-long text, wherein the ultra-long text is text whose length exceeds the upper limit of the input length of a large language model; a retrieval unit for retrieving cached data of target text segments associated with the question text from a pre-established cache library, wherein the cache library stores cached data of each text segment in the ultra-long text, and the cached data of each text segment is generated based on a key-value cache set generated when the large language model pre-fills the text segment; a fusion unit for fusion of the cached data of the target text segments to obtain fused cached data; and a generation unit for generating a response text for the question text based on the fused cached data and using the large language model.
[0007] Thirdly, embodiments of this application provide an electronic device, including: one or more processors; and a storage device having one or more programs stored thereon, wherein when the one or more programs are executed by the one or more processors, the one or more processors implement the method as described in any embodiment of the first aspect.
[0008] Fourthly, embodiments of this application provide a computer-readable medium having a computer program stored thereon that, when executed by a processor, implements the method as described in any embodiment of the first aspect.
[0009] The text processing method, apparatus, electronic device, and computer-readable medium provided in this application, upon receiving a question text for an excessively long text, first retrieve cached data of target text segments associated with the question text from a pre-established cache library. The cache library stores cached data of each text segment within the excessively long text. The cached data of each text segment is generated based on a key-value cache set generated during pre-filling processing of the text segment by a large language model. Then, the cached data of the target text segments are fused to obtain fused cached data. Finally, based on the fused cached data, a response text for the question text is generated using a large language model. On one hand, since the cached data is generated based on a key-value cache set generated during pre-filling processing of text segments by a large language model, it can serve as an intermediate representation of the text segments. By converting the excessively long text into cached data of multiple text segments, the large language model can avoid directly processing the excessively long text during inference, preventing problems such as information forgetting and logical confusion caused by the text length exceeding the input length limit, thus improving analysis efficiency and the accuracy of analysis results. On the other hand, by fusing cached data of target text fragments related to the question text, the large language model can simultaneously access all information related to the question text during the reasoning process. This enables it to capture long-distance dependencies between text fragments related to the question text, avoids missing details, and further improves the accuracy of the analysis results. Attached Figure Description
[0010] Other features, objects, and advantages of this application will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings: Figure 1 This is a flowchart illustrating an embodiment of the text processing method of this application; Figure 2 This is a schematic diagram illustrating the process of applying the text processing method of this application to a script analysis scenario; Figure 3 This is a schematic diagram of the structure of an embodiment of the text processing device of this application; Figure 4 This is a schematic diagram of the structure of an electronic device used to implement the embodiments of this application. Detailed Implementation
[0011] All actions involving the acquisition of signals, information, or data in this application are carried out in accordance with the relevant data protection laws and policies of the country where the application is located, and with the authorization of the owner of the relevant device.
[0012] The present application will now be described in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and not intended to limit it. Furthermore, it should be noted that, for ease of description, only the parts relevant to the invention are shown in the accompanying drawings.
[0013] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other. This application will now be described in detail with reference to the accompanying drawings and embodiments.
[0014] Please refer to Figure 1 The diagram illustrates a flow 100 of an embodiment of the text processing method according to this application. This text processing method can be applied to various electronic devices with data processing capabilities. For example, such electronic devices may include, but are not limited to: servers, smartphones, tablets, e-book readers, laptops, handheld computers, desktop computers, etc. The processor in the aforementioned electronic device can be the execution entity of the text processing method.
[0015] This text processing method includes the following steps: Step 101: Receive the question text for the ultra-long text, which is text whose length is greater than the upper limit of the input length of the large language model.
[0016] In this embodiment, a large language model refers to a neural network model for natural language processing that contains a very large number of parameters. Large language models possess enormous scale, multi-task learning capabilities, powerful computing resources, and abundant training data. They are widely used in the field of natural language processing and demonstrate outstanding performance in various natural language processing tasks.
[0017] Ultra-long text refers to text whose length exceeds the input length limit of a large language model, such as novels and film / television scripts. Text formats can include, but are not limited to, .txt, .pdf, .docx, and .doc formats. For example, the large language model is Llama-3-70B, whose input limit is typically 8192 tokens. A film / television script can reach hundreds of thousands of words, far exceeding this limit, and therefore falls under the category of ultra-long text.
[0018] The query text is a query entered by the user in natural language, requesting analysis of a specific aspect of a very long text. For example, for a film or television script, the user's query text might be "Please analyze the character arc of the protagonist, Li Ming, from the first episode to the last episode." In practice, the entity executing the text processing method can receive the query text through a user interface, which can be a web front-end or an API (Application Programming Interface).
[0019] Step 102: Retrieve cached data of target text fragments associated with the problem text from a pre-established cache library. The cache library is used to store cached data of each text fragment in the ultra-long text. The cached data of each text fragment is generated based on the key-value cache set generated when the large language model pre-fills the text fragment.
[0020] In this embodiment, the cache library can be a structured storage system used to store cached data for each text segment in a very long text. In practice, the cache library can be a distributed key-value database, such as Redis, or a document database, such as MongoDB; no specific limitation is made here.
[0021] A text segment can refer to a fragment within a very long text. The very long text can be pre-segmented into multiple text segments. For each text segment, it can first be pre-filled using a large language model to obtain a key-value cache set. Then, based on this key-value cache set, cached data for that text segment can be generated. For example, the key-value cache set can be used directly as cached data, or the key-value caches in the key-value cache set can be quantized or otherwise processed to obtain cached data; no specific limitations are specified here.
[0022] Pre-filling is a crucial stage in the inference process of a large language model, performed before the autoregressive generation stage. Specifically, it refers to the process by which the large language model, after receiving a text segment, computes and caches intermediate states through forward propagation. The key-value cache set is the product of this pre-filling process, specifically a collection of intermediate computation results generated by multiple self-attention sublayers of the large language model. Each self-attention sublayer generates an intermediate computation result that serves as a key-value cache (KV cache). During pre-filling, the large language model performs global semantic analysis and dependency modeling on the input sequence corresponding to the text segment. This ensures that the pre-filled key-value cache not only contains the semantics of the token itself but also includes contextual information from all other related tokens in the input sequence. Therefore, the key-value cache can represent the semantic features of the text segment and the relationships between tokens within it.
[0023] In practice, a large language model may include multiple Transformer layers, each of which may contain a self-attention sublayer and a feed-forward network sublayer (FFN). The self-attention sublayer is used to calculate the correlation between all lexical units in the input sequence and generates three tensors: Query (Q), Key (K), and Value (V). Here, Query is used for querying or focusing, Key represents the identity of a lexical unit, and Value represents the actual information content contained in the lexical unit. For each text segment T... k When inputting it into a large language model for pre-filling, for the first... Each Transformer layer can extract the Key tensor generated by its self-attention sublayer. and Value tensor This yields the key-value cache of the self-attention sublayer for the text fragment. , ). and The dimension is ,in, It's about the number of heads to focus on. It is a text fragment T k The length of the word segment sequence after word segmentation It is the dimension of each attention head. Let L be a positive integer less than or equal to L, where L is the number of Transformer layers in the large language model. The text fragment T is obtained by summing the Key and Value tensors generated by the self-attention sublayers in the L Transformer layers. k Corresponding key-value cache set It can be denoted as C k .
[0024] In this embodiment, each text fragment may carry one or more tags, which can be used to indicate the episode number, scene number, chapter number, character, etc., corresponding to the text fragment. After receiving the question text, the target text fragments associated with the question text can first be determined based on the keywords in the question text and the tags of each text fragment in the long text. There are usually multiple target text fragments. Then, cached data of the target text fragments associated with the question text is retrieved from the cache library.
[0025] In traditional text processing methods, excessively long texts exceed the input length limit of large language models. Directly inputting such texts can lead to information forgetting, logical inconsistencies, and a hyperlinear increase in inference costs. This embodiment converts excessively long texts into cached data of multiple text segments. This eliminates the need for the large language model to directly process the excessively long text during inference, avoiding information forgetting and logical inconsistencies caused by exceeding the input length limit, thus improving analysis efficiency and accuracy. Furthermore, upon receiving the question text, only the cached data associated with the question text needs to be retrieved, rather than using the entire cached data, further enhancing analysis efficiency.
[0026] Step 103: Merge the cached data of the target text fragment to obtain merged cached data.
[0027] In this embodiment, for each target text segment, its corresponding key-value cache set can first be determined based on the cached data of the target text segment. After obtaining the key-value cache sets corresponding to each target text segment, the key-value caches output by the same network layer in each key-value cache set can be concatenated to obtain multiple fused cache data corresponding to multiple network layers.
[0028] Understandably, since the cached data for each text segment is generated based on the key-value cache set produced when the text segment is pre-filled using the aforementioned large language model, for a given target text segment, its corresponding key-value cache set can be determined through its cached data and the inverse process of cache data generation. For example, if the key-value cache set is directly used as the cached data during the cache data generation stage, then the cached data of the target text segment is its corresponding key-value cache set; if the cached data is obtained by quantizing the key-value cache set during the cache data generation stage, then the cached data of the target text segment can be dequantized to obtain its corresponding key-value cache set.
[0029] When processing very long texts, long-distance dependencies may exist between different text segments. Long-distance dependencies refer to the associations between elements in an input sequence, such as a sentence, a paragraph of text, or a piece of code, that are related to each other but separated by many other unrelated or secondary information. The distance here typically refers to the number of steps between words in the input sequence. Traditional processing methods struggle to capture these long-distance dependencies, leading to inaccurate analysis results. This embodiment fuses cached data from multiple target text segments, enabling the large language model to achieve cross-text segment information interaction during inference, thereby capturing long-distance dependencies and improving the accuracy of the analysis results.
[0030] Step 104: Based on the fused cached data, use a large language model to generate response text for the question text.
[0031] In this embodiment, the question text can be input into the large language model, and the fused cache data corresponding to each network layer can be directly injected into the network layer of the large language model as context information, replacing the traditional pre-filling process stage, so that the large language model can directly enter the autoregressive generation stage and generate the response text word by word.
[0032] After the question text is input into the large language model, it is converted into an input sequence. Then, skipping the pre-filling stage, it enters the autoregressive stage. Here, the large language model first calculates a probability distribution based on the input sequence to predict the most likely first word, such as the word with the highest probability. This word is then appended to the end of the input sequence, forming a new sequence. Next, based on the new sequence, another probability distribution is calculated to predict the second word. This process is iterated until the large language model generates a special word representing the end, at which point it stops, resulting in a complete word sequence, i.e., the response text.
[0033] In each probability distribution calculation, the Transformer layer of the large language model integrates and analyzes the input sequence, capturing contextual relationships. Ultimately, it outputs a feature vector encompassing contextual information for the last position in the sequence—the position where a new word will be generated. This feature vector is then mapped onto the entire vocabulary through a linear layer to obtain the probability score for each word in the vocabulary. Applying a normalization function, such as the softmax function, to all probability scores converts them into probability values, thus yielding the probability distribution.
[0034] The method provided in the above embodiments of this application, on the one hand, since the cached data is generated based on the key-value cache set generated when the large language model pre-fills text fragments, it can serve as an intermediate representation of the text fragments. By converting ultra-long texts into cached data of multiple text fragments, the large language model can avoid directly processing ultra-long texts during inference, thus avoiding problems such as information forgetting and logical confusion caused by text length exceeding the input length limit, thereby improving analysis efficiency and the accuracy of analysis results. On the other hand, by fusing cached data of target text fragments related to the question text, the large language model can simultaneously access all information related to the question text during inference, thereby capturing long-distance dependencies between text fragments related to the question text, avoiding omission of details, and further improving the accuracy of analysis results.
[0035] In some optional embodiments, cached data for each text segment in a very long text can be generated through the following steps: Step 201: Segment the excessively long text to obtain multiple text fragments.
[0036] Specifically, segmentation rules can be pre-set to segment extremely long texts into multiple text fragments. For example, segmentation can be based on the number of characters, the structure of the text, or its semantic theme; no specific limitations are specified here. For instance, for film and television scripts, segmentation can be based on episodes or scenes, with each episode or scene constituting an independent text fragment; for biographical texts, segmentation can be based on different stages or events experienced by the character, with each stage or event corresponding to a text fragment. Each resulting text fragment can be assigned a unique identifier, denoted as a Module. ID .
[0037] Optionally, the extremely long text may include a film or television script. For a film or television script, the script elements can be queried first. Script elements include at least one of the following: episode number, scene number, and character name. In practice, rule-based Natural Language Processing (NLP) techniques, such as regular expressions and heuristic rules, can be used to parse the extremely long script to identify the script elements within the text. Then, based on these script elements, the extremely long text can be segmented into multiple text fragments.
[0038] As an example, based on episode numbers, extremely long scripts can be segmented by episode. Specifically, the extremely long text can be denoted as T, and T can be segmented by episode to obtain a set of text fragments. .in, This is the text segment corresponding to episode e, where E is the total number of episodes.
[0039] As another example, based on the scene number in each episode, a very long script can be divided into scenes. Specifically, for episode e, its corresponding text fragments can be further... The text fragments are obtained by segmentation. ,in It is a text fragment The complete text of the sth scene. It is the total number of scenes in episode e.
[0040] As another example, extremely long scripts can be segmented by character based on their names. For each character... You can search for episodes that contain this character. The game shows featuring the character's name are retrieved by summarizing the text fragments corresponding to the retrieved shows. Corresponding text fragment Text fragment This character is included. All dialogue and related action descriptions.
[0041] Unlike simply segmenting by fixed length, segmenting by script elements such as "episode," "scene," and "character" ensures that each generated text fragment corresponds to a narrative and logically consistent unit. For example, a text fragment corresponding to a "scene" contains the complete plot within the same time and space; a text fragment corresponding to a "character" reflects the character's arc. When a text fragment is pre-filled as a key-value cache and cached data is generated, each cached data item carries the complete semantic information of that text fragment, avoiding the contextual tearing and semantic fragmentation that may be caused by fixed-length segmentation, thus laying a reliable data foundation for subsequent retrieval and text generation.
[0042] Furthermore, by providing multi-dimensional segmentation methods such as "episode, scene, and character," it's equivalent to building multiple analytical perspectives for the same script. When processing user question texts, the most relevant dimensions can be flexibly selected for retrieval. For example, when processing the question text "analyze the pacing of episode 3," cached data of the text segment corresponding to "episode 3" can be retrieved; when processing the question text "Li Ming's character development," cached data of the text segment corresponding to "Li Ming" can be retrieved. This multi-dimensional segmentation method greatly enhances the ability of the large language model to answer complex and diverse questions, improving its adaptability to different analytical needs.
[0043] Step 202: For each text segment in the very long text, input the text segment into the large language model, and perform pre-padded processing through the large language model to obtain the key-value cache set corresponding to the text segment. This step can be referred to the description in the above embodiments, and will not be repeated here to avoid repetition.
[0044] By segmenting extremely long texts into multiple text fragments and generating independent cached data for each fragment, the system can quickly locate the relevant text fragments and their cached data when processing user queries. Compared to processing the entire extremely long text directly, this segmented storage method significantly reduces the retrieval scope and improves retrieval efficiency.
[0045] Step 203: Quantize the key-value cache set corresponding to each text segment in the ultra-long text to obtain the cache data of each text segment in the ultra-long text.
[0046] Quantization is a data compression technique that maps floating-point key-value caches to integer types to reduce storage space usage and improve computational efficiency. Floating-point types may include, but are not limited to, FP16 (Floating-Point 16, half-precision floating-point) and BF16 (Brain Floating Point 16), while integer types may include, but are not limited to, INT8 (INTeger 8, 8-bit integer) and INT4 (INTeger 4, 4-bit integer).
[0047] In practice, linear quantization and other methods can be used to quantize the key-value cache sets corresponding to each text segment in a very long text. See the following quantization formula:
[0048] Where x is the key-value cache in the key-value cache set, which is the original floating-point value; scale is the scaling factor, which is a coefficient used to perform a linear mapping between floating-point and integer representations; This is the maximum integer value, i.e., the maximum value of the range of quantized integers. It can be preset based on the type of the quantized integer, for example, the maximum integer value corresponding to the INT8 type is 127; round is the rounding function; clip restricts the value to a specified range. The clipping function within ); This represents the quantization result corresponding to x. After quantizing the cached data in the key-value cache set corresponding to each text segment using the above formula, the quantization results are summarized to obtain the cached data for that text segment.
[0049] By quantizing the key-value cache set, the floating-point key-value cache is converted into an integer type, significantly reducing the storage space occupied by the cached data and facilitating subsequent retrieval and analysis. Although quantization compresses the data to some extent, with appropriate quantization parameter selection, most of the semantic information in the key-value cache set can still be preserved. This is because the quantization process mainly maps the numerical range of the key-value tensors without changing the tensor's structure and semantic features. Therefore, the generated cached data can effectively represent the semantic content of text fragments, providing reliable data support for subsequent text analysis tasks.
[0050] In some optional embodiments, step 203 above may further include the following sub-steps: Step 2031: Input the calibration dataset into the large language model, perform pre-filling processing through the large language model, and statistically analyze the original numerical distribution of the key-value cache in the key-value cache set generated by the large language model.
[0051] A calibration dataset can be a representative set of input data used to estimate the numerical distribution of the parameters to be quantized, denoted as P(x). The calibration dataset does not necessarily represent the final application data (i.e., individual text fragments), but its distribution should be as close as possible to the distribution of the final application data to provide a statistical basis for determining quantization parameters (such as scaling factors). For example, a calibration dataset can be obtained by randomly selecting and summing text fragments from multiple segments of a very long text, thus reducing computational and storage overhead.
[0052] Specifically, calibration data from the calibration dataset can be input into a large language model. The large language model performs pre-filling processing to obtain a key-value cache set, and the original numerical distribution of the key-value caches within this set is statistically analyzed. The pre-filling process and the key-value cache generation process are described in the above embodiments and will not be repeated here to avoid repetition. The original numerical distribution refers to the statistical distribution of the floating-point format key-value caches without any compression.
[0053] Step 2032: Search for the optimal pruning threshold using an optimization algorithm. The optimal pruning threshold is obtained by minimizing the difference between the original numerical distribution and the reconstructed numerical distribution. The reconstructed numerical distribution is the numerical distribution obtained after pruning the data according to the optimal pruning threshold and then performing quantization and dequantization in sequence.
[0054] The clipping threshold is a scalar value T obtained through an optimization algorithm. It is used to symmetrically clip the data before quantization, setting all values less than -T to -T and all values greater than T to T. The optimal clipping threshold is the threshold that minimizes the information loss caused by quantization, and can be denoted as T*. The reconstructed numerical distribution is the statistical distribution of the data obtained after clipping the original floating-point values according to the optimal clipping threshold T, quantizing them to integers, and then dequantizing them back to floating-point numbers. It can be denoted as... .
[0055] Specifically, we can first define an objective function, such as the KL divergence function, as shown below:
[0056] in, Let T represent the pruning threshold that minimizes the KL divergence.
[0057] Then, by traversing a series of candidate pruning thresholds T through an optimization algorithm, the KL divergence corresponding to the candidate pruning threshold T is calculated, and the candidate pruning threshold T that minimizes the KL divergence is selected as the optimal pruning threshold T*.
[0058] Step 2033: Determine the scaling factor based on the optimal cropping threshold and the preset maximum integer value.
[0059] Specifically, based on the optimal cropping threshold T*, the scaling factor scale is calculated as follows: scale = T* / Q max It should be noted that after the scaling factor is calculated, its value can be stored in a cache library for use during dequantization.
[0060] Step 2034: Based on the scaling factor and the maximum integer value, quantize the values in the key-value cache set corresponding to each text segment in the ultra-long text to obtain the cached data of each text segment in the ultra-long text. Here, the quantization formula in the above embodiment can be directly used for quantization processing, and it will not be described again here to avoid repetition.
[0061] Understandably, the conventional method for calculating scaling factors, such as scale = max(|x|) / Q, is...max However, some methods sacrifice the quantization accuracy of the vast majority of data for a small number of extreme outliers. In this embodiment, by minimizing the KL divergence to find the optimal pruning threshold T*, outliers can be pruned to a reasonable range, allowing the scaling factor to more accurately match the distribution range of the main data. This ensures that most values can be reconstructed with high fidelity after quantization and dequantization, achieving high-fidelity data compression and helping to improve the quality of inference in large language models.
[0062] In some optional embodiments, step 102 above may further include the following sub-steps: Sub-step 1021: Extract the target entity from the question text.
[0063] Target entities refer to concrete objects or concepts in the question text that play a decisive role in the retrieval process, such as specific characters, chapters, plots, user intent, and analytical dimensions. In practice, natural language processing techniques, large language models, or rule-based natural language understanding engines can be used to parse the question text and extract the target entities.
[0064] Taking the use of a large language model as an example, the question text can first be filled into a preset prompt word template to obtain prompt words. For example: Please extract the target entity from the following user-input question text "Analyze the personality arc of the protagonist Li Ming from the first episode to the last episode". The target entity is the core object or scope that the user is concerned about, such as a specific character, chapter, plot, user intent, analysis dimension, etc.
[0065] Then, the prompt words are input into the large language model, which parses them to obtain the target entity. In practice, the parsing result output by the large language model is usually a structured object containing fields such as "Entities", "Analysis_Dimension", and "Time_Span".
[0066] For example, a large language model can output the following: Entities = {"Li Ming", "Episode 1", "Episode 2"} Analysis_Dimension = "Personality Arc" Time_Span = ["start", "end"] Sub-step 1022: Based on the target entity, determine the target text segment in the ultra-long text.
[0067] Specifically, each text segment can carry one or more tags, which can be used to indicate the set number, scene number, chapter number, character, etc. corresponding to the text segment. After receiving the question text, first, based on the keywords in the question text and the tags of each text segment in the long text, the target text segments associated with the question text can be determined, and the target text segments usually include multiple ones. Then, the cached data of the target text segments associated with the question text is retrieved from the cache library.
[0068] Optionally, after the long text is segmented in the early stage, an identifier can be assigned to each text segment obtained by each segmentation method. And a set of metadata is generated. The identifier and metadata of each text segment can be stored in the cache library. The metadata format can be .
[0069] Among them, is the text identifier of the long text, which is used to uniquely identify a long text. is the text segment type, which is used to describe the structural unit according to which the text segment is segmented. For example, it can include but is not limited to "episode", "scene", "character", etc. is the text segment identifier under this text segment type, which is used to uniquely identify a specific text segment under the given text segment type, and its format depends on the text segment type. For example, the set number "5", the character name "Li Ming", etc.
[0070] As an example, for the long text , after segmenting by set, the identifier of the text segment corresponding to the 5th set is , and the metadata is .
[0071] Based on this, based on the target entity, the metadata in the cache library can be retrieved. For example, for the example in the above sub-step 1021, the metadata with Module_Type being "character" and Module_Identifier being "Li Ming" can be retrieved; the metadata with Module_Type being "episode" and Module_Identifier being 1 and 30 (assuming a total of 30 sets) can be retrieved. Further, several intermediate sets can be randomly selected or several key sets can be determined based on the large language model, and the metadata with Module_Identifier being the key sets can be retrieved. After retrieval, the text segments corresponding to the retrieved metadata can be determined as the target text segments.
[0072] By assigning an identifier and metadata to each text fragment, fast and accurate indexing and retrieval can be achieved in the cache. This process eliminates the need to understand the specific content of the cache; the target fragment can be found simply by matching the metadata, thus improving retrieval efficiency.
[0073] In some optional embodiments, the large language model may include multiple transformer layers. Each transformer layer includes a self-attention sublayer. The key-value cache set includes key-value caches generated by the self-attention sublayers in each transformer layer of the large language model. The key-value cache includes key tensors and value tensors. The cached data for each text segment is the quantized result of the key-value cache set generated when the large language model pre-fills the text segment. Based on this, step 103 above may further include the following sub-steps: Sub-step 1031: Dequantize the target cache data to obtain the target key-value cache set.
[0074] Specifically, if there are N target text fragments, then there are N target cached data, which can be denoted as... These target cached data can be fast dequantized on the GPU to obtain N target key-value cache sets, denoted as... .
[0075] In practice, the values in the target cache data can be input into the dequantization function to obtain the corresponding dequantization result. The dequantization function can be found as follows:
[0076] in, For caching data; for The solution quantization results.
[0077] Sub-step 1032: For each transformer layer in the large language model, concatenate the key tensors in the target key-value cache set corresponding to the self-attention sub-layers in the corresponding transformer layer to obtain the fused key tensor corresponding to the transformer layer; concatenate the value tensors in the target tensor set corresponding to the self-attention sub-layers in the corresponding transformer layer to obtain the fused value tensor corresponding to the transformer layer; summarize the fused key tensor and the fused value tensor to obtain the fused cache data corresponding to the transformer layer.
[0078] Specifically, if there are N target text fragments, then there are N target key-value cache sets, i.e. Assuming the large language model has L transformer layers, for the k-th target text segment T among N target text segments... k Its target key-value cache set is , .in, For the first The result of quantization and dequantization of the key-value cache generated by the self-attention sublayer in each transformer layer. For the first The result of quantizing and then dequantizing the key tensor generated by the self-attention sublayer in each transformer layer. No. The result of quantizing and then dequantizing the value tensor generated by the self-attention sublayer in the transformer layer.
[0079] For the first in the large language model A converter layer can... The quantization and dequantization results of the key tensors generated by the self-attention sublayer of the transformer layer are concatenated to obtain the first... The fusion bond tensor corresponding to each transformer layer is denoted as . Similarly, it can be... The quantized and dequantized results of the value tensors generated by the self-attention sublayer of the transformer layer are concatenated to obtain the result. The fused value tensor corresponding to each transformer layer is denoted as... .Right now:
[0080]
[0081] Concatenating the key and value tensors of multiple independent target text fragments along the sequence dimension essentially physically connects multiple originally isolated context fragments into a continuous, longer context window. When a large language model performs attention calculations, its built-in self-attention mechanism naturally operates within this concatenated, ultra-large context window. Specifically, the key and value tensors of multiple independent target text fragments are concatenated along the sequence dimension. The fusion cache data corresponding to each transformer layer is denoted as follows: When a large language model uses this At this time, its attention calculation mechanism will naturally enable cross-cached information interaction. For example, when a large language model processes a new lexical from a user query... At that time, it was in the first Query vectors generated by each transformer layer Will with the whole Perform attention calculations, that is:
[0082] Matrix multiplication here Implicitly computed With all target key-value cache sets The degree of association among all involved lexical units. In this way, using only simple concatenation operations, the standard self-attention mechanism can naturally achieve information interaction across cached data, thus realizing deep fusion of global information in terms of functionality, without introducing more complex dedicated cross-attention modules or additional parameters.
[0083] Furthermore, choosing to perform fusion independently within each layer, rather than across layers or in a mixed manner, ensures the fusion of cached data. After entering the number When a self-attention sublayer of a transformer layer is used, its mathematical form and semantic meaning are completely consistent with the key-value cache generated by that sublayer in the regular pre-filling stage. Therefore, large language models can directly understand and utilize these fused cached data without any adaptation, maintaining the highest computational efficiency and compatibility.
[0084] Based on this, in step 104 above, the question text can be input into the large language model, the fusion cache data corresponding to each self-attention sub-layer in the large language model can be injected into the self-attention sub-layer, and the response text can be obtained by reasoning through the large language model.
[0085] Therefore, when calculating self-attention, each transformer layer of the large language model no longer uses the dynamically calculated keys and values of that layer, but directly uses the fusion cache data injected from the outside, corresponding to that layer. The fusion cache data injected into each self-attention sub-layer is equivalent to providing the model with a highly condensed contextual environment customized for the current question text. When generating the response text, the attention mechanism of the large language model can consult any part of this contextual environment, thereby obtaining specific details of the target text fragment in the extremely long text, rather than relying solely on the generalized knowledge in the model parameters. This ensures that the large language model can accurately generate response text based on the complete content related to the question text in the extremely long text, greatly improving the accuracy of text analysis.
[0086] Based on the above embodiments, Figure 2 This diagram illustrates the process of applying text processing methods to a script analysis scenario. For example... Figure 2 As shown, the processing can be divided into two stages: an offline preprocessing stage and an online analysis and inference stage.
[0087] In the offline processing stage, the script source file is first segmented by episode / scene / character using the script preprocessing module to obtain structured text fragments and generate metadata for each text fragment. Then, the key-value cache generation module calls a pre-trained large language model to generate key-value cache sets for each text fragment. Next, the key-value cache compression module calculates a scaling factor and quantizes the key-value cache sets for each text fragment. The scaling factor, the quantized cache data, and the metadata for each text fragment are then stored in the cache library.
[0088] During the online analysis and inference phase, users or large language model evaluators can input question text through a user query interface. The query understanding module then analyzes the question text to determine the target entity, analysis dimensions, and retrieval plan. Based on the results output by the query understanding module, the cached data retrieval module performs metadata retrieval on the cache library, returning the cached data of the target text fragment corresponding to the retrieved metadata, and extracting the scaling factor stored in the cache library. The dequantization module then uses this scaling factor to dequantize the cached data of the target text fragment, obtaining the dequantized result, i.e., the target key-value cache set. Next, the cross-cache dynamic fusion module concatenates the key and value tensors corresponding to the self-attention sublayers in the same transformer layer of the target key-value cache set, obtaining the fused cached data for each transformer layer. The cross-cache dynamic fusion module can be implemented on a GPU (Graphics Processing Unit) using dedicated CUDA (Compute Unified Device Architecture) kernel functions. Finally, the question text is input into the large language model, and the fused cached data is injected into the corresponding self-attention sublayer in the large language model, yielding the response text output by the large language model.
[0089] By combining offline preprocessing with online intelligent analysis, efficient and in-depth analysis of ultra-long script texts is achieved. In the offline stage, structured segmentation, cache generation, and quantized storage transform the script content into a lightweight and reusable key-value cache resource library, fundamentally overcoming the input length limitations of large language models. In the online stage, relying on semantic understanding and caching fusion technology, a global context environment relevant to user queries is dynamically constructed, enabling the model to generate accurate and coherent responses without repeatedly processing the original text. This improves the efficiency and accuracy of ultra-long text analysis while reducing manual labor costs.
[0090] Further reference Figure 3 As an implementation of the methods shown in the above figures, this application provides an embodiment of a text processing apparatus, which is similar to... Figure 1 Corresponding to the method embodiments shown, this device can be specifically applied to various electronic devices.
[0091] like Figure 3 As shown, the text processing device 300 of this embodiment includes: a receiving unit 301, used to receive a question text for an ultra-long text, wherein the ultra-long text is text whose length is greater than the upper limit of the input length of the large language model; a retrieval unit 302, used to retrieve cached data of target text segments associated with the question text from a pre-established cache library, wherein the cache library is used to store cached data of each text segment in the ultra-long text, and the cached data of each text segment is generated based on the key-value cache set generated when the large language model pre-fills the text segment; a fusion unit 303, used to fuse the cached data of the target text segments to obtain fused cached data; and a generation unit 304, used to generate a response text for the question text based on the fused cached data and using the large language model.
[0092] In some optional implementations of this embodiment, the cached data of each text segment in the ultra-long text is generated through the following steps: the ultra-long text is segmented to obtain multiple text segments; for each text segment in the ultra-long text, the text segment is input into the large language model, and pre-filling processing is performed through the large language model to obtain the key-value cache set corresponding to the text segment, the key-value cache set including the key-value cache generated by each attention sublayer of the large language model; the key-value cache set corresponding to each text segment in the ultra-long text is quantized to obtain the cached data of each text segment in the ultra-long text.
[0093] In some optional implementations of this embodiment, the long text includes a film or television script; the step of segmenting the long text to obtain multiple text fragments includes: querying script elements in the long text, wherein the script elements include at least one of the following: episode number, scene number, and character name; and segmenting the long text based on the script elements to obtain multiple text fragments.
[0094] In some optional implementations of this embodiment, the step of quantizing the key-value cache set corresponding to each text segment in the ultra-long text to obtain the cached data of each text segment in the ultra-long text includes: inputting the calibration dataset into the large language model, performing pre-filling processing through the large language model, and statistically analyzing the original numerical distribution of the key-value cache in the key-value cache set generated by the large language model; searching for an optimal pruning threshold through an optimization algorithm, wherein the optimal pruning threshold is obtained by minimizing the difference between the original numerical distribution and the reconstructed numerical distribution, wherein the reconstructed numerical distribution is the numerical distribution obtained after pruning the data according to the optimal pruning threshold and then performing quantization and dequantization in sequence; determining a scaling factor based on the optimal pruning threshold and a preset maximum integer value; and quantizing the values in the key-value cache set corresponding to each text segment in the ultra-long text based on the scaling factor and the maximum integer value to obtain the cached data of each text segment in the ultra-long text.
[0095] In some optional implementations of this embodiment, the retrieval unit 302 is further configured to: extract target entities from the question text; determine target text segments in the long text based on the target entities; and retrieve cached data of the target text segments from the cache library as target cached data.
[0096] In some optional implementations of this embodiment, the cache library also stores metadata for each text fragment, including the text identifier of the long text, the text fragment type, and the text fragment identifier under the text fragment type; the retrieval unit 302 is further configured to: retrieve the metadata in the cache library based on the target entity, and determine the text fragment corresponding to the retrieved metadata as the target text fragment.
[0097] In some optional implementations of this embodiment, the large language model includes multiple transformer layers, each transformer layer includes a self-attention sub-layer, the key-value cache set includes key-value caches generated by the respective attention sub-layers in the large language model, and the cache data of each text segment is the quantization result of the key-value cache set generated when the large language model performs pre-filling processing on the text segment.
[0098] In some optional implementations of this embodiment, the key-value cache includes a key tensor and a value tensor, wherein the key tensor is the identifier of a word in the text segment, and the value tensor represents the information content actually contained in the word; the fusion unit 303 is further configured to: dequantize the cached data of the target text segment to obtain a target key-value cache set; for each self-attention sublayer in the large language model, concatenate the key tensors corresponding to the self-attention sublayer in the target key-value cache set to obtain a fused key tensor corresponding to the self-attention sublayer; concatenate the value tensors corresponding to the self-attention sublayer in the target key-value cache set to obtain a fused value tensor corresponding to the self-attention sublayer; and summarize the fused key tensor and the fused value tensor to obtain the fused cache data corresponding to the self-attention sublayer.
[0099] In some optional implementations of this embodiment, the generation unit 304 is further configured to: input the question text into the large language model, inject the fusion cache data corresponding to each self-attention sub-layer in the large language model into the self-attention sub-layer, and perform inference through the large language model to obtain the response text.
[0100] The apparatus provided in the above embodiments of this application, upon receiving a question text for an excessively long text, first retrieves cached data of target text segments associated with the question text from a pre-established cache library. The cache library stores cached data of each text segment in the excessively long text, and the cached data of each text segment is generated based on a key-value cache set generated when the large language model pre-fills the text segment. Then, the cached data of the target text segments are fused to obtain fused cached data. Finally, based on the fused cached data, the large language model is used to generate a response text for the question text. On the one hand, since the cached data is generated based on a key-value cache set generated when the large language model pre-fills the text segments, it can serve as an intermediate representation of the text segments. By converting the excessively long text into cached data of multiple text segments, the large language model can avoid directly processing the excessively long text during inference, thus avoiding problems such as information forgetting and logical confusion caused by the text length exceeding the input length limit, thereby improving analysis efficiency and the accuracy of analysis results. On the other hand, by fusing cached data of target text fragments related to the question text, the large language model can simultaneously access all information related to the question text during the reasoning process. This enables it to capture long-distance dependencies between text fragments related to the question text, avoids missing details, and further improves the accuracy of the analysis results.
[0101] The following is for reference. Figure 4 It shows a schematic diagram of the structure of an electronic device used to implement some embodiments of this application. Figure 4The electronic device shown is merely an example and should not be construed as limiting the functionality and scope of the embodiments of this application.
[0102] like Figure 4 As shown, electronic device 400 may include a processing device (e.g., a central processing unit, a graphics processor, etc.) 401, which can perform various appropriate actions and processes according to a program stored in read-only memory (ROM) 402 or a program loaded from storage device 408 into random access memory (RAM) 403. RAM 403 also stores various programs and data required for the operation of electronic device 400. Processing device 401, ROM 402, and RAM 403 are interconnected via bus 404. Input / output (I / O) interface 405 is also connected to bus 404.
[0103] Typically, the following devices can be connected to I / O interface 405: input devices 406 including, for example, touchscreens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 407 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 408 including, for example, disks, hard disks, etc.; and communication devices 409. Communication device 409 allows electronic device 400 to communicate wirelessly or wiredly with other devices to exchange data. Although Figure 4 An electronic device 400 with various devices is shown; however, it should be understood that it is not required to implement or possess all of the devices shown. More or fewer devices may be implemented or possessed alternatively. Figure 4 Each box shown can represent a device or multiple devices as needed.
[0104] In particular, according to some embodiments of this application, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, some embodiments of this application include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via communication device 409, or installed from storage device 408, or installed from ROM 402. When the computer program is executed by processing device 401, it performs the functions defined above in the methods of some embodiments of this application.
[0105] It should be noted that the computer-readable medium described in some embodiments of this application may be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium may be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In some embodiments of this application, a computer-readable storage medium may be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In some embodiments of this application, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals may take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium can be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wires, optical fibers, RF (radio frequency), etc., or any suitable combination thereof.
[0106] In some implementations, clients and servers can communicate using any currently known or future-developed network protocol such as HTTP (Hypertext Transfer Protocol) and can interconnect with digital data communication (e.g., communication networks) of any form or medium. Examples of communication networks include local area networks (“LANs”), wide area networks (“WANs”), the Internet (e.g., the Internet of Things), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future-developed networks.
[0107] The aforementioned computer-readable medium may be included in the aforementioned electronic device; or it may exist independently and not assembled into the electronic device. The aforementioned computer-readable medium carries one or more programs that, when executed by the electronic device, cause the electronic device to: receive a question text for an excessively long text, where the excessively long text is text whose length exceeds the upper limit of the input length of the large language model; retrieve cached data of target text segments associated with the question text from a pre-established cache library, the cache library being used to store cached data of each text segment in the excessively long text, the cached data of each text segment being generated based on a key-value cache set generated when the large language model pre-fills the text segment; fuse the cached data of the target text segments to obtain fused cached data; and generate a response text for the question text based on the fused cached data using the large language model.
[0108] Computer program code for performing operations of some embodiments of this application can be written in one or more programming languages or a combination thereof, including object-oriented programming languages such as Java, Smalltalk, and C++; and conventional procedural programming languages such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network, or it can be connected to an external computer (e.g., via the Internet using an Internet service provider), including local area networks (LANs) or wide area networks (WANs).
[0109] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0110] The units described in some embodiments of this application can be implemented in software or hardware. The described units can also be housed in a processor; for example, a processor may be described as including a first determining unit, a second determining unit, a selecting unit, and a third determining unit. The names of these units do not necessarily limit the specific unit itself.
[0111] The functions described above in this document can be performed at least in part by one or more hardware logic components. For example, exemplary types of hardware logic components that can be used, without limitation, include: field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), system-on-a-chip (SoCs), complex programmable logic devices (CPLDs), and so on.
[0112] The above description is merely a selection of preferred embodiments of this application and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of the invention involved in the embodiments of this application is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the above-described inventive concept. For example, technical solutions formed by substituting the above-described features with (but not limited to) technical features with similar functions disclosed in the embodiments of this application.
Claims
1. A text processing method, characterized in that, The method includes: Receive question text for excessively long text, where the text length is greater than the upper limit of the input length of the large language model; Retrieve cached data of target text fragments associated with the problem text from a pre-established cache library. The cache library is used to store cached data of each text fragment in the ultra-long text. The cached data of each text fragment is generated based on the key-value cache set generated when the large language model pre-fills the text fragment. The cached data of the target text fragment is fused to obtain fused cached data; Based on the fused cached data, the large language model is used to generate a response text for the question text.
2. The method according to claim 1, characterized in that, The cached data for each text segment in the extremely long text is generated through the following steps: The extremely long text is segmented to obtain multiple text fragments; For each text segment in the ultra-long text, the text segment is input into the large language model, and pre-filling processing is performed through the large language model to obtain the key-value cache set corresponding to the text segment. The key-value cache set includes the key-value cache generated by each attention sub-layer in the large language model. The key-value cache set corresponding to each text segment in the ultra-long text is quantized to obtain the cache data of each text segment in the ultra-long text.
3. The method according to claim 2, characterized in that, The extremely long text includes film and television scripts; the process of segmenting the extremely long text to obtain multiple text fragments includes: Query the script elements in the long text, where the script elements include at least one of the following: episode number, scene number, and character name; Based on the script elements, the extremely long text is segmented to obtain multiple text fragments.
4. The method according to claim 2, characterized in that, The step of quantizing the key-value cache set corresponding to each text segment in the ultra-long text to obtain the cache data of each text segment in the ultra-long text includes: The calibration dataset is input into the large language model, and pre-filling processing is performed through the large language model. The original numerical distribution of the key-value cache in the key-value cache set generated by the large language model is statistically analyzed. The calibration dataset is obtained by randomly extracting some text segments from multiple text segments of an ultra-long text and summarizing them. The optimal pruning threshold is searched by an optimization algorithm. The optimal pruning threshold is used to minimize the original numerical distribution and the reconstructed numerical distribution. The reconstructed numerical distribution is the numerical distribution obtained after pruning the data according to the optimal pruning threshold and then performing quantization and dequantization in sequence. The scaling factor is determined based on the optimal cropping threshold and the preset maximum integer value; Based on the scaling factor and the maximum integer value, the values in the key-value cache set corresponding to each text segment in the ultra-long text are quantized to obtain the cached data of each text segment in the ultra-long text.
5. The method according to claim 1, characterized in that, The step of retrieving cached data of target text fragments associated with the question text from a pre-established cache library includes: Extract the target entity from the question text; Based on the target entity, determine the target text segment in the ultra-long text; Retrieve cached data of the target text fragment from the cache library, and use it as the target cached data.
6. The method according to claim 5, characterized in that, The cache library also stores metadata for each text fragment, including the text identifier of the long text, the text fragment type, and the text fragment identifier under the text fragment type; determining the target text fragment in the long text based on the target entity includes: Based on the target entity, the metadata in the cache library is retrieved, and the text fragment corresponding to the retrieved metadata is determined as the target text fragment.
7. The method according to claim 1, characterized in that, The large language model includes multiple transformer layers, each of which includes a self-attention sub-layer. The key-value cache set includes key-value caches generated by the respective attention sub-layers in the large language model. The cached data for each text segment is the quantized result of the key-value cache set generated when the large language model pre-fills the text segment.
8. The method according to claim 7, characterized in that, The key-value cache includes key tensors and value tensors. The key tensor is the identifier of a word in the text segment, and the value tensor represents the actual information content contained in the word. The fusion of the cached data of the target text segment to obtain fused cached data includes: The cached data of the target text fragment is dequantized to obtain the target key-value cache set; For each self-attention sublayer in the large language model, the key tensors corresponding to the self-attention sublayer in the target key-value cache set are concatenated to obtain the fused key tensor corresponding to the self-attention sublayer; the value tensors corresponding to the self-attention sublayer in the target key-value cache set are concatenated to obtain the fused value tensor corresponding to the self-attention sublayer; the fused key tensor and the fused value tensor are summarized to obtain the fused cache data corresponding to the self-attention sublayer.
9. The method according to claim 8, characterized in that, The step of generating a response text for the question text using the large language model based on fused cached data includes: The question text is input into the large language model, and the fusion cache data corresponding to each self-attention sub-layer in the large language model is injected into that self-attention sub-layer. The response text is obtained by reasoning through the large language model.
10. A text processing device, characterized in that, The device includes: The receiving unit is used to receive the question text for the ultra-long text, which is text whose length is greater than the upper limit of the input length of the large language model; The retrieval unit is used to retrieve cached data of target text segments associated with the question text from a pre-established cache library. The cache library is used to store cached data of each text segment in the ultra-long text. The cached data of each text segment is generated based on the key-value cache set generated when the large language model pre-fills the text segment. The fusion unit is used to fuse the cached data of the target text fragment to obtain fused cached data; The generation unit is used to generate a response text for the question text based on the fused cached data and the large language model.
11. An electronic device, characterized in that, include: One or more processors; Storage device, on which one or more programs are stored, When the one or more programs are executed by the one or more processors, the one or more processors implement the method as described in any one of claims 1-9.
12. A computer-readable medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the method as described in any one of claims 1-9.