Method and medium for compressing context of large language model

By obtaining the window occupancy and semantic drift of a large language model and performing a weighted summation, combined with a differentiated compression strategy, the blindness problem in context compression of large language models is solved, and the compression efficiency and accuracy are improved.

CN122242527APending Publication Date: 2026-06-19VOYAH AUTOMOBILE TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
VOYAH AUTOMOBILE TECH CO LTD
Filing Date
2026-01-29
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Large language models face a contradiction between context size and model input capacity limitations in practical applications. Existing compression methods are somewhat blind, resulting in poor efficiency.

Method used

Before the dialogue in the current round of the large language model begins, the window occupancy and semantic drift of the current context text are obtained, and a weighted sum is performed. If the trigger value exceeds the threshold, compression is performed, and a differentiated compression strategy is used to process different types of subtext.

Benefits of technology

It improves the efficiency of context compression for large language models, avoids blindness, alleviates information overload and semantic mismatch problems, and ensures the accuracy and efficiency of compressed information.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122242527A_ABST
    Figure CN122242527A_ABST
Patent Text Reader

Abstract

This application discloses a method and medium for compressing the context of a large language model. The method includes: before the start of the current round of dialogue in the large language model, obtaining the window occupancy rate of the current context text and the semantic drift of the user input in the current round; the current context text includes the user input in the current round and the dialogue text from previous rounds, the window occupancy rate being the percentage of the current context text's input capacity to the large language model, and the semantic drift being the degree of semantic deviation between the user input in the current round and the dialogue text from previous rounds; a weighted sum of the window occupancy rate and the semantic drift is obtained to obtain a compression trigger value; if the compression trigger value is greater than or equal to a preset compression trigger threshold, then the current context text is compressed. This application, to a certain extent, avoids the blindness of compressing solely based on the window reaching the occupancy rate threshold, thus improving the compression efficiency of the large language model context.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application belongs to the field of text compression technology, and in particular relates to a compression method and medium for a large language model context. Background Technology

[0002] With the development of Large Language Models (LLMs), they have been widely applied in various fields, such as assisted programming. One of the problems faced by LLMs in practical applications is the contradiction between the context size and the model input capacity limit. To alleviate this contradiction, related technologies have proposed schemes to compress the context.

[0003] Currently, related technologies exhibit a degree of randomness in context compression, resulting in poor efficiency. Summary of the Invention

[0004] The embodiments of this application provide a method and medium for compressing large language model contexts, which at least to some extent avoids the blindness of compressing based solely on the window reaching the occupancy threshold, thereby improving the efficiency of large language model context compression.

[0005] Other features and advantages of this application will become apparent from the following detailed description, or may be learned in part from practice of this application.

[0006] The first aspect of this application provides a method for compressing the context of a large language model, comprising: Before the dialogue in the current round of the large language model begins, the window occupancy of the current context text and the semantic drift of the user input in the current round are obtained; wherein, the current context text includes the user input in the current round and the dialogue text in previous rounds, the window occupancy is the proportion of the current context text to the input capacity of the large language model, and the semantic drift is the degree of semantic deviation between the user input in the current round and the dialogue text in previous rounds; The window occupancy rate and the semantic drift are weighted and summed to obtain a compression trigger value. If the compression trigger value is greater than or equal to a preset compression trigger threshold, the current context text is compressed.

[0007] Optionally, obtaining the semantic drift of the user input in the current round includes: Encode the user input of the current round into a first semantic vector, and encode the dialogue text of the previous rounds into a second semantic vector; Obtain the semantic similarity between the first semantic vector and the second semantic vector, and determine the semantic drift degree based on the semantic similarity, wherein the semantic similarity and the semantic drift degree are negatively correlated.

[0008] Optionally, the current context text includes multiple types of sub-text, and the compression of the current context text includes: For each type of subtext, a compression strategy corresponding to the subtext is used to compress the subtext, wherein at least some types of subtext have different compression strategies.

[0009] Optionally, the large language model is used to assist programming, and the various types of subtext include at least one of: code snippet subtext, non-code subtext, and tool call result subtext. The compression of the subtext using a compression strategy corresponding to the subtext includes: For the code snippet subtext, an abstract syntax tree parser is used to parse the code snippet subtext, remove the non-syntactically necessary elements obtained from the parsing, and obtain the compressed code subtext corresponding to the code snippet subtext; For the non-code subtext, a lightweight summarization model is used to generate a summary subtext of the non-code subtext. The summary subtext includes: user goal, user input in the current round, and at least one of the key dependency list. For the tool call result subtext, the tool call subtext is converted into field subtext in a preset format, retaining the field names and key values ​​of the field subtext, to obtain compressed field subtext.

[0010] Optionally, compressing the current context text further includes: Obtain a dialogue queue, which includes multiple dialogue sub-texts sorted by dialogue time from most recent to oldest, and each dialogue sub-text is a dialogue sub-text of one of the historical rounds. Extract a reference dialogue segment from the tail of the dialogue queue, and obtain a target dialogue segment in the dialogue queue whose semantic similarity to the reference dialogue segment is greater than or equal to a preset similarity threshold. The reference dialogue fragment and the target dialogue fragment are removed from the dialogue queue, and the reference dialogue fragment and the target dialogue fragment are aggregated to obtain an aggregated summary fragment, which is then inserted at the end of the dialogue queue.

[0011] Optionally, after obtaining the aggregated summary fragment, the method further includes: Set a fragment identifier for the aggregated summary fragment; The reference dialogue fragment and the target dialogue fragment are stored, and a mapping relationship is established between the fragment identifier and the storage paths of the reference dialogue fragment and the target dialogue fragment.

[0012] Optionally, each dialogue subtext includes structured fields including: user input for the corresponding round, large language model output for the corresponding round, dialogue timestamp for the corresponding round, and dialogue intent tag for the corresponding round.

[0013] Optionally, the current context text includes multiple different types of information, each with different tags, which can be retained or discarded. Before compressing the current context text, the method further includes: Obtain the label corresponding to each type of information, and remove information labeled as disposable from the current context text.

[0014] Optionally, before obtaining the window occupancy of the current context text and the semantic drift of the user input in the current round, the method further includes: If a text compression instruction is received from the user, then the step of compressing the current context text is executed.

[0015] A second aspect of this application provides a compression device for large language model contexts, comprising: The acquisition unit is used to acquire the window occupancy rate of the current context text and the semantic drift of the user input in the current round before the dialogue in the current round of the large language model begins; wherein, the current context text includes the user input in the current round and the dialogue text in the previous rounds, the window occupancy rate is the occupancy rate of the current context text to the input capacity of the large language model, and the semantic drift is the degree of semantic deviation between the user input in the current round and the dialogue text in the previous rounds; The compression unit is used to perform a weighted summation of the window occupancy rate and the semantic drift to obtain a compression trigger value. If the compression trigger value is greater than or equal to a preset compression trigger threshold, the current context text is compressed.

[0016] A third aspect of this application provides a computer-readable storage medium storing at least one computer program instruction, which is loaded and executed by a processor to perform the operations described in any of the methods described in the first aspect.

[0017] A fourth aspect of this application provides an electronic device including one or more processors and one or more memories, wherein at least one piece of program code is stored in the one or more memories, and the at least one piece of program code is loaded and executed by the one or more processors to perform the operation as described in any of the methods in the first aspect.

[0018] The one or more technical solutions provided in the embodiments of the present invention achieve at least the following technical effects or advantages: The method for compressing the context of a large language model according to embodiments of this application includes: before the start of the current round of dialogue in the large language model, obtaining the window occupancy rate of the current context text and the semantic drift degree of the user input in the current round; wherein, the current context text includes the user input in the current round and the dialogue text in previous rounds, the window occupancy rate is the percentage of the current context text's share of the input capacity of the large language model, and the semantic drift degree is the degree of semantic deviation between the user input in the current round and the dialogue text in previous rounds; the window occupancy rate and the semantic drift degree are weighted and summed to obtain a compression trigger value, and if the compression trigger value is greater than or equal to a preset compression trigger threshold, then the current context text is compressed. It can be understood that the window occupancy rate reflects the input capacity pressure of the large language model, and the semantic drift degree can measure the degree of correlation between the user input in the current round and the historical dialogue. When the weighted sum of the two compression trigger values ​​exceeds the compression trigger threshold, it indicates that both capacity bottleneck and topic shift have occurred. In this case, compressing the current context can alleviate information overload and semantic mismatch to some extent, and avoid the blindness of compressing based solely on the window reaching the occupancy threshold, thereby improving the compression efficiency of the large language model context.

[0019] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and do not limit this application. Attached Figure Description

[0020] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application. It is obvious that the drawings described below are merely some embodiments of this application, and those skilled in the art can derive other drawings from these drawings without creative effort. In the drawings: Figure 1 A flowchart illustrating a method for compressing large language model contexts according to an embodiment of this application is shown; Figure 2 A structural diagram of a large language model context compression device according to an embodiment of this application is shown; Figure 3 A schematic diagram of the structure of a computer system suitable for implementing the electronic device of the present application is shown. Detailed Implementation

[0021] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of the embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.

[0022] Furthermore, the described features, structures, or characteristics can be combined in any suitable manner in one or more embodiments. Numerous specific details are provided in the following description to give a thorough understanding of embodiments of this application. However, those skilled in the art will recognize that the technical solutions of this application can be practiced without one or more of the specific details, or other methods, components, apparatuses, steps, etc., can be employed. In other instances, well-known methods, apparatuses, implementations, or operations are not shown or described in detail to avoid obscuring various aspects of this application.

[0023] The block diagrams shown in the accompanying drawings are merely functional entities and do not necessarily correspond to physically independent entities. That is, these functional entities can be implemented in software, in one or more hardware modules or integrated circuits, or in different models and / or processor devices and / or microcontroller devices.

[0024] The flowcharts shown in the accompanying drawings are merely illustrative and do not necessarily include all content and operations / steps, nor do they necessarily have to be performed in the described order. For example, some operations / steps can be broken down, while others can be combined or partially combined; therefore, the actual execution order may change depending on the specific circumstances.

[0025] It should also be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such uses of these terms can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described.

[0026] With the development of Large Language Models (LLM), LLMs are widely used in various fields, such as assisted programming. For example, known assisted programming tools include, but are not limited to, GitHub Copilot, Claude Code, and AmazonCodeWhisperer. These assisted programming tools have shown great potential in the field of software development, such as their excellent performance in code completion, code generation, and error correction.

[0027] The performance of auxiliary programming tools is highly dependent on a precise understanding of multi-source heterogeneous programming contexts. Contexts typically include: (1) target code snippets (local code at the current editing location); (2) global project code (related code in other files or modules of the same project); (3) version control history (such as the modification intent and logical evolution in Git commit records); (4) tool call results (such as the detection feedback output by static code analysis tools); and (5) user interaction logs (historical question-answer pairs), etc. This information together constitutes the knowledge base of large language model reasoning, directly affecting the semantic accuracy, logical coherence, and task adaptability of code generation.

[0028] However, one of the problems faced by large language models in practical applications is the contradiction between the context size and the model input capacity limit. Although the context window of current mainstream large models has been expanded to 128K-200K tokens, the context overload problem remains significant in complex programming scenarios (such as large-scale project development and multi-round error debugging). In addition, a context exceeding 10K tokens can lead to a 5-10 times increase in response time and a more than 3-fold increase in token consumption cost, affecting user experience and development efficiency. Furthermore, an excessively large context size can also cause the large language model's attention to become scattered, such as a decrease in focus on key code logic and confusion about the modification intentions of different versions of code, thus weakening the accuracy of model inference.

[0029] To alleviate the above contradictions, related technologies have proposed schemes for compressing the context, such as: (1) Simple truncation method: When the context window is close to its maximum capacity, the early dialog content is directly truncated. Although this method is simple, it may lead to the loss of key information. Especially in code generation scenarios, the loss of variable scope and function call relationships will seriously affect the correctness of the generated code.

[0030] (2) Overall summarization method: This method uses a large language model to generate a summary of the entire context. While this method can reduce the number of tokens, it may lose the specific syntactic structure and variable relationships required for code generation, resulting in generated code that does not meet user requirements.

[0031] (3) Retrieval enhancement method based on RAG (Retrieval-Augmented Generation): After retrieving relevant documents, they are concatenated into the context. This method is prone to context bloat in multi-round interactions and is not optimized for code generation scenarios, failing to effectively preserve variable scope and function call relationships.

[0032] (4) Fixed window method: Only the content of the most recent N rounds of dialogue is retained. This method may lose key requirements and constraints that the user has explicitly specified in the early stages of the interaction, especially in the code generation process, where the user may have defined important variables or functions in the early stages of the dialogue, and this information is still crucial in the later stages of code generation.

[0033] It is evident that existing methods for compressing the context of large language models are somewhat arbitrary. Therefore, this application provides a method for compressing the context of large language models, which to some extent avoids the arbitrariness of compressing solely based on whether a window reaches a certain occupancy threshold.

[0034] The compression method for large language model contexts according to embodiments of this application will be described below with reference to the accompanying drawings.

[0035] Figure 1 A flowchart illustrating a method for compressing large language model contexts according to an embodiment of this application is shown.

[0036] The first aspect of this application provides a method for compressing the context of a large language model, including but not limited to: Step S10. Before the dialogue of the current round of the large language model begins, obtain the window occupancy of the current context text and the semantic drift of the user input in the current round; wherein, the current context text includes the user input of the current round and the dialogue text of the previous round, the window occupancy is the occupancy of the current context text to the input capacity of the large language model, and the semantic drift is the degree of semantic deviation between the user input of the current round and the dialogue text of the previous round; In this embodiment, the current round can refer to the new round of interaction that the user has just initiated with the large language model, such as when the user inputs information into the large language model. Before responding to the user input in the current round, the large language model decides whether to compress the current context text.

[0037] It is understandable that the input capacity (or context window, context length) of a large language model can refer to the maximum amount of text (generally measured in terms of tokens or characters) that the large language model can process in a single operation. Input capacity can be used to describe the upper limit of the range of dialogues, documents, or instructions that a large language model can utilize when generating responses.

[0038] The current context text can refer to all the information needed to understand the current dialogue. It includes not only the new question the user has just asked but also historical dialogue text from previous rounds, allowing the model to maintain dialogue coherence. Window occupancy refers to the percentage of the total length of the current context text relative to the model's maximum input capacity. It reflects the degree of context window consumption; when the occupancy is too high, the model may be unable to accept new long text or begin to forget the earliest key information.

[0039] To make it easier to understand, the calculation process of window occupancy is explained below using a formula: ; in, Indicates window occupancy. This represents the total number of tokens in the current context text. This represents the input capacity of a large language model.

[0040] Understandably, semantic drift can be used to measure the conversational coherence of the current user input relative to historical dialogue text, such as the degree of deviation from previous dialogue text in terms of topic, intent, or core entities. For example, a user might be in an in-depth tech discussion and suddenly ask for a cooking recipe without any transition, which results in high semantic drift.

[0041] In some embodiments, obtaining the semantic drift of the user input in the current round includes: Step S101. Encode the user input of the current round into a first semantic vector, and encode the dialogue text of the previous rounds into a second semantic vector; For example, a deep learning model (such as the Sentence-BERT model) is used to encode the current user input and historical dialogue text. The encoding process of the Sentence-BERT model is illustrated below.

[0042] Sentence-BERT is a pre-trained Transformer model variant based on a Siamese network / triple network architecture, used to generate semantically sensitive sentence embeddings. The Sentence-BERT model architecture removes the output layer used for downstream tasks (such as classification) from BERT and adds a pooling layer (usually mean pooling or max pooling) above the original BERT's classification label output or mean pooling layer. This pooling layer aggregates the variable-length sentence encoding output into a fixed-dimensional dense semantic vector.

[0043] During training, SBERT fine-tunes its model parameters in a supervised manner. Its training samples are sentence pairs or sentence triples containing semantic relationships. For example, positive sample pairs (similar sentences) and negative sample pairs (dissimilar sentences). The model utilizes a Siamese network structure, processing two sentences in parallel through the same BERT encoder with shared weights to obtain their respective vector representations. The training objective (loss function) typically employs a contrastive learning objective, such as cosine similarity mean squared error loss or triple loss. In triple loss, the model learns to adjust parameters so that the distance between an anchor sentence and a positive example sentence in the vector space is much smaller than its distance to a negative example sentence, thereby optimizing the semantic discriminative ability of the embedding space. The entire training process iteratively updates the model parameters through backpropagation and an optimizer, ultimately ensuring that the generated sentence vectors accurately reflect their semantic content, with semantically similar sentences being close in distance in the vector space.

[0044] Step S102. Obtain the semantic similarity between the first semantic vector and the second semantic vector, and determine the semantic drift degree based on the semantic similarity, wherein the semantic similarity and the semantic drift degree are negatively correlated.

[0045] To facilitate understanding, the calculation process of semantic drift is explained below using a formula: ; in, Indicates semantic drift. Represents cosine similarity. Represents the first semantic vector. Let k represent the second semantic vector corresponding to a certain historical round, k represent the total number of historical rounds or second semantic vectors, and i represent the i-th of the k second semantic vectors.

[0046] Understandably, there can be one or more second semantic vectors. For example, each historical round corresponds to a second semantic vector. You can calculate the cosine similarity between the current input vector and each historical vector, and take the average value as the semantic similarity.

[0047] Step S20. The window occupancy rate and the semantic drift are weighted and summed to obtain a compression trigger value. If the compression trigger value is greater than or equal to a preset compression trigger threshold, the current context text is compressed.

[0048] To facilitate understanding, the following explanation uses a formula to illustrate the process of weighted summation of the window occupancy rate and the semantic drift: ; in, The weighted value representing the window occupancy rate. The weight values ​​represent the degree of semantic drift. and It can be adjusted according to actual needs, for example Greater than ,or Less than ; This indicates the compression trigger value.

[0049] For example, the compression trigger threshold can be 0.4-0.6, such as 0.4, 0.5 or 0.6.

[0050] Understandably, window occupancy reflects the input capacity pressure of a large language model, while semantic drift measures the correlation between the current user input and historical dialogue. When the weighted sum of these two compression trigger values ​​exceeds the compression trigger threshold, it indicates both a capacity bottleneck and topic shift. In this situation, compressing the current context can alleviate information overload and semantic mismatch to some extent, and avoids the blind approach of compressing solely based on window occupancy reaching the threshold.

[0051] In some embodiments, the current context text includes multiple types of subtext, and the compression of the current context text includes: For each type of subtext, a compression strategy corresponding to the subtext is used to compress the subtext, wherein at least some types of subtext have different compression strategies.

[0052] It is understandable that, since different types of subtext may carry similar information density, redundancy, structuring degree, and criticality to subsequent dialogue, using the same compression method may lead to insufficient compression and low efficiency, or excessive compression causing loss of key semantics. Therefore, this application introduces a classification matching and strategy-based compression mechanism to improve compression efficiency based on differentiated compression methods.

[0053] In some embodiments, the large language model is used to assist programming, and the various types of subtext include at least one of: code snippet subtext, non-code subtext, and tool call result subtext. The compression of the subtext using a compression strategy corresponding to the subtext includes: Step S201. For the code fragment subtext, use an abstract syntax tree parser to parse the code fragment subtext, remove the non-syntactically necessary elements obtained from the parsing, and obtain the compressed code subtext corresponding to the code fragment subtext; Understandably, an Abstract Syntax Tree (AST) parser is used to parse the code snippet text, transforming it from text into a tree-like data representation that reflects its syntactic structure (such as functions, loops, and conditional branches). Then, based on this structure, all non-syntactically necessary elements are identified and removed. These non-syntactically necessary elements include, but are not limited to, comments, blank lines, newlines, and redundant indentation. Thus, by reconstructing and simplifying the AST, a syntactically equivalent but smaller compressed code snippet is obtained. This not only achieves a high compression ratio but, more importantly, ensures that the compressed code is syntactically and logically consistent with the original version, avoiding potential semantic distortions or syntax errors, and meeting the precision requirements of programming scenarios.

[0054] Step S202. For the non-code subtext, a lightweight summarization model is used to generate a summary subtext of the non-code subtext. The summary subtext includes: user goal, user input in the current round, and at least one of the key dependency list. For example, non-code subtext includes, but is not limited to, requirements, questions described in natural language, and explanatory text from the dialogue history. For such text, a lightweight summarization model (such as T5-Small or BART) is used for semantic condensation, extracting the key semantic elements driving the programming task, rather than retaining all literal information. The summary focuses on: User goals: inferring and summarizing the user's programming intent or functional goals from the dialogue history. Current round of user input: retaining the core content of the latest request in this round, which is the direct motivation triggering the large language model's response in the current round. List of key dependencies: extracting key dependency information such as explicit constraints, technology stack requirements, and interface specifications from the history. Thus, by transforming lengthy natural language dialogues into concise summaries, the volume of unstructured text is reduced, while retaining the core context required for decision-making and generation.

[0055] Taking the T5-Small model as an example, its summary extraction process begins with the model's encoder-decoder Transformer structure: the encoder processes the input non-code subtext sequence through a multi-layer self-attention mechanism and a fully connected feedforward network, transforming it into a hidden state sequence rich in contextual semantics; the decoder then generates a summary sequence word-by-word in an autoregressive manner based on these hidden states, focusing on the key information output by the encoder through a cross-attention mechanism in each generation step. The model's training process uses a large number of supervised samples consisting of long text-summary pairs, and its training objective is to maximize the likelihood probability of generating a true summary given the input text, i.e., optimization is performed through the negative log-likelihood loss from standard sequence to sequence. In specific training, the model parameters are iteratively updated through backpropagation and the Adam optimizer, learning to identify and condense core elements such as user goals, current input, and key dependencies from the complete context. In this process, T5-Small, with its relatively small number of parameters (approximately 60 million) and unified text-to-text framework, can effectively balance summary quality and computational efficiency. It is especially suitable for real-time compression scenarios with limited resources, accurately extracting lengthy dialogues into concise summary subtexts containing specified elements.

[0056] Step S203. For the tool call result subtext, convert the tool call subtext into a field subtext of a preset format, retain the field names and key values ​​of the field subtext, and obtain the compressed field subtext.

[0057] For example, the results of tool calls can be: static analysis reports, syntax check feedback, execution logs, error feedback, etc.

[0058] Understandably, tool call result subtext is typically the data response returned after a model calls an external API (such as a code executor, package management query, or database). The strategy for compressing tool call result subtext is structured extraction and formatting. First, the tool call result subtext is converted to a preset format, such as JSON or XML. Then, field names and key values ​​are extracted, where key values ​​are, for example: {"tool: pylint", "line: 12", "error_code: E111", "message: indentation error"}. This preserves key factual information from the structured data while removing redundant natural language interpretations, formatting delimiters, and other redundant metadata, reducing token overhead.

[0059] In some embodiments, compressing the current context text further includes: Step S304. Obtain a dialogue queue, the dialogue queue including multiple dialogue sub-texts sorted by dialogue time from most recent to oldest, each of the dialogue sub-texts being a dialogue sub-text of one of the historical rounds; In some embodiments, each dialogue subtext includes structured fields including: user input for the corresponding round, large language model output for the corresponding round, dialogue timestamp for the corresponding round, and dialogue intent tag for the corresponding round.

[0060] Taking a large language model-based assisted programming tool as an example, user input includes, but is not limited to: technical requirement descriptions (such as implementing a Python-based quicksort function), code snippets to be edited (such as partial function body code), error feedback (such as a TypeError occurring at runtime), and syntax consultation. The large language model output includes, but is not limited to: generated code, tool invocation instructions (such as calling pylint to check syntax), and error messages (such as solutions for undefined variables). Dialogue timestamps are used to record the UTC time (world clock reference time) of that round of dialogue for context recovery and sequence tracing. Dialogue intent labels are automatically generated by a pre-trained intent recognition model (a four-class classification model fine-tuned based on BERT), for example, categorized into four types: code generation, debugging, syntax consultation, and tool invocation, providing a basis for subsequent similarity judgment and compression strategy selection.

[0061] For example, the conversation queue is maintained in memory as a list and synchronized to a local cache (such as Redis or a local file) in real time to ensure that the context can be fully restored after a power outage or restart, avoiding information loss.

[0062] The dialogue queue includes multiple dialogue sub-texts sorted by dialogue time from most recent to oldest. For example, the dialogue queue includes three dialogue sub-texts from three historical rounds, sorted by time from oldest to most recent: dialogue sub-text A from the first round, dialogue sub-text B from the second round, and dialogue sub-text C from the third round. Therefore, the dialogue queue is {dialogue sub-text C, dialogue sub-text B, dialogue sub-text A}, with dialogue sub-text A located at the end of the queue.

[0063] Step S305. Extract a reference dialogue segment from the tail of the dialogue queue, and obtain a target dialogue segment in the dialogue queue whose semantic similarity to the reference dialogue segment is greater than or equal to a preset similarity threshold. For example: extract reference dialogue fragments from the tail of the dialogue queue, traverse the queue forward, use the Sentence-BERT model to calculate the cosine similarity of semantic vectors between dialogues, identify similar dialogues with a threshold of ≥0.7, collect a maximum of N similar dialogues or traverse to the head of the queue, avoid over-aggregation leading to information ambiguity, for example: N=4, 5, 6, etc.

[0064] Step S306. Remove the reference dialogue fragment and the target dialogue fragment from the dialogue queue, and perform summary aggregation on the reference dialogue fragment and the target dialogue fragment to obtain an aggregated summary fragment, and insert the aggregated summary fragment into the tail of the dialogue queue.

[0065] For example, by extracting common requirements, core code logic, and unresolved issues through the BART-large model, aggregating summary dialogues are generated. The newly generated aggregated summary dialogues are then inserted at the end of the queue, replacing multiple redundant entries. This reduces token consumption while maintaining the logical coherence of the context.

[0066] For example, the BART-large model generates aggregated summaries through its encoder-decoder architecture: its encoder, based on a multi-layer bidirectional Transformer structure, uses a self-attention mechanism to jointly encode the input reference dialogue fragment and the target dialogue fragment, deeply capturing the semantic connections and common patterns between multiple turns of dialogue; the decoder, acting as an autoregressive unidirectional Transformer, generates aggregated summaries word-by-word based on the fused semantic representation output by the encoder. The model's training process relies on a large-scale dialogue summarization dataset, whose samples typically consist of multiple turns of dialogue and corresponding manually annotated summaries (covering common needs, core decisions, and unresolved issues). The training objective is to minimize the cross-entropy loss between the decoder-generated sequence and the standard summary. During parameter optimization, approximately 400 million parameters of the model are updated through backpropagation and the Adam optimizer, enabling it to learn to abstract shared user intents, key code patterns, and unresolved issues from multiple similar dialogues. The resulting aggregated summaries can concisely replace the original multiple dialogue fragments, significantly compressing the context length while preserving the core logical coherence.

[0067] In some embodiments, after obtaining the aggregated summary fragment, the method further includes: Set a fragment identifier for the aggregated summary fragment; The reference dialogue fragment and the target dialogue fragment are stored, and a mapping relationship is established between the fragment identifier and the storage paths of the reference dialogue fragment and the target dialogue fragment.

[0068] Understandably, to achieve long-term traceability of the text, the aggregated reference dialogue fragments and the target dialogue fragments are packaged into JSONL format (each dialogue is a JSON object), with additional metadata (including compressed timestamps, the corresponding aggregate summary ID, etc.), and stored in a distributed cold storage system (such as MinIO). An association index is established in a relational database (such as SQLite) between the aggregate summary fragments and the cold storage data (recording the mapping relationship between the aggregate ID and the cold storage file path). This allows for rapid tracing of the original reference and target dialogue fragments using the aggregate summary fragments, ensuring complete data traceability, for example, enabling long-term effective storage and management of dialogue data in auxiliary programming tools.

[0069] In some embodiments, the current context text includes multiple different types of information, each with different tags, which may be retained or discarded. Before compressing the current context text, the method further includes: Obtain the label corresponding to each type of information, and remove information labeled as disposable from the current context text.

[0070] Understandably, before compression is performed, the current context text is hierarchically marked according to semantic importance and functional value, thereby providing a structured basis for subsequent differentiated compression.

[0071] For example, top-level information, including initial user requirements, project-level constraints (such as language version and framework limitations), and core architectural decisions, is tagged as "Retain" to avoid compilation errors or logical deviations due to information loss, ensuring that the generated code reaches an engineering-ready level. Mid-level information, covering currently edited code snippets, unresolved compile / runtime errors, tool call results, key variable definitions and their scopes, is also tagged as "Retain." Low-level information, such as fixed historical error logs, repetitive confirmation messages, and redundant debug output, is tagged as "Discard."

[0072] Therefore, the current context text can be preliminarily filtered before compression is performed, further reducing the text volume.

[0073] In some embodiments, before obtaining the window occupancy of the current context text and the semantic drift of the user input in the current round, the method further includes: If a text compression instruction is received from the user, then the step of compressing the current context text is executed.

[0074] Understandably, to enhance user control, this application provides a manual compression interface, allowing users to actively trigger compression operations based on task progress. For example, parameters such as window occupancy and semantic drift of a large language model can be displayed in real time on the human-computer interaction interface. Users can manually trigger text compression as needed. This manual command has the highest priority and can override the automatic judgment logic, ensuring high user controllability.

[0075] Figure 2 A structural diagram of a large language model context compression device according to an embodiment of this application is shown.

[0076] A second aspect of this application provides a compression device 200 for a large language model context, comprising: The acquisition unit 201 is used to acquire the window occupancy rate of the current context text and the semantic drift of the user input in the current round before the dialogue in the current round of the large language model begins; wherein, the current context text includes the user input in the current round and the dialogue text in the previous round, the window occupancy rate is the occupancy rate of the current context text to the input capacity of the large language model, and the semantic drift is the degree of semantic deviation between the user input in the current round and the dialogue text in the previous round; Compression unit 202 is used to perform a weighted summation of the window occupancy rate and the semantic drift to obtain a compression trigger value. If the compression trigger value is greater than or equal to a preset compression trigger threshold, the current context text is compressed.

[0077] A third aspect of this application provides a computer-readable storage medium storing at least one computer program instruction, which is loaded and executed by a processor to perform the operations as described in any of the methods in the first aspect.

[0078] Computer-readable storage media may be portable compact disc read-only memory (CD-ROM) and include program code, and may run on a terminal device, such as a personal computer. However, the computer-readable storage medium of this application is not limited thereto. In this application, the readable storage medium may be any tangible medium that contains or stores a program that may be used by or in conjunction with an instruction execution system, apparatus, or device.

[0079] A readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples (a non-exhaustive list) of readable storage media include: an electrical connection having one or more wires, a portable disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof.

[0080] Program code for performing the operations of this application can be written in any combination of one or more programming languages, including object-oriented programming languages ​​such as Java and C++, and conventional procedural programming languages ​​such as C or similar languages. The program code can execute entirely on the user's computing device, partially on the user's device, as a standalone software package, partially on the user's computing device and partially on a remote computing device, or entirely on a remote computing device or server. In cases involving remote computing devices, the remote computing device can be connected to the user's computing device via any type of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (e.g., via the Internet using an Internet service provider).

[0081] In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. The device embodiments described above are merely illustrative; for example, the division of units can be a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the displayed or discussed mutual couplings, direct couplings, or communication connections may be through some interfaces; indirect couplings or communication connections between units or modules may be electrical or other forms.

[0082] The units described as separate components may or may not be physically separate. Similarly, the components of the control device may or may not be physical units; that is, they may be located in one place or distributed across multiple units. Some or all of the units can be selected to achieve the purpose of this embodiment, depending on actual needs.

[0083] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, read-only memory (ROM), random access memory (RAM), portable hard drives, magnetic disks, or optical disks.

[0084] Figure 3 A schematic diagram of the structure of a computer system suitable for implementing the electronic device of the present application is shown.

[0085] According to a fourth aspect of the present application, an electronic device is provided, including one or more processors and one or more memories, wherein at least one piece of program code is stored in the one or more memories, the at least one piece of program code being loaded and executed by the one or more processors to perform the operations performed as described in any of the methods in the first aspect.

[0086] like Figure 3 As shown, the electronic device 400 is manifested in the form of a general-purpose computing device. The components of the electronic device 400 may include, but are not limited to: at least one processing unit 410, at least one storage unit 420, and a bus 430 connecting different system components (including storage unit 420 and processing unit 410).

[0087] The storage unit stores program code, which can be executed by the processing unit 410, causing the processing unit 410 to perform the steps described in the "Embodiment Method" section above according to various exemplary embodiments of this application.

[0088] Storage unit 420 may include readable media in the form of volatile storage units, such as random access memory (RAM) 421 and / or cache 422, and may further include read-only memory (ROM) 423.

[0089] Storage unit 420 may also include a program / utility 424 having a set (at least one) of program modules 425, such program modules 425 including but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of these examples may include an implementation of a network environment.

[0090] Bus 430 can represent one or more of several bus structures, including a memory cell bus or memory cell controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local bus using any of the multiple bus structures.

[0091] Electronic device 400 can also communicate with one or more external devices 500 (e.g., keyboard, pointing device, Bluetooth device, etc.), and with one or more devices that enable a user to interact with electronic device 400, and / or with any device that enables electronic device 400 to communicate with one or more other computing devices (e.g., router, modem, etc.). This communication can be performed through I / O (input / output) interface 450, which can also be connected to display unit 440 to display the communication content. Furthermore, electronic device 400 can also communicate with one or more networks (e.g., local area network (LAN), wide area network (WAN), and / or public network, such as the Internet) through network adapter 460. As shown, network adapter 460 communicates with other modules of electronic device 400 via bus 430. It should be understood that, although not shown in the figures, other hardware and / or software modules can be used in conjunction with electronic device 400, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems.

[0092] The functions described herein can be implemented in hardware, software executed by a processor, firmware, or any combination thereof. If implemented in software executed by a processor, the functions can be stored as one or more instructions or codes on or transmitted via a computer-readable medium. Other examples and embodiments are within the scope and spirit of this invention and the appended claims. For example, due to the nature of software, the functions described above can be implemented using software executed by a processor, hardware, firmware, hardwired, or any combination thereof. Furthermore, the functional units can be integrated into a single processing unit, or each unit can exist physically separately, or two or more units can be integrated into a single unit.

[0093] The above description is merely an embodiment of this application and is not intended to limit this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of the claims of this application.

Claims

1. A method for compressing a large language model context, the method comprising: include: Before the dialogue in the current round of the large language model begins, the window occupancy of the current context text and the semantic drift of the user input in the current round are obtained; wherein, the current context text includes the user input in the current round and the dialogue text in previous rounds, the window occupancy is the proportion of the current context text to the input capacity of the large language model, and the semantic drift is the degree of semantic deviation between the user input in the current round and the dialogue text in previous rounds; The window occupancy rate and the semantic drift are weighted and summed to obtain a compression trigger value. If the compression trigger value is greater than or equal to a preset compression trigger threshold, the current context text is compressed.

2. The method of claim 1, wherein, The step of obtaining the semantic drift of the user input in the current round includes: Encode the user input of the current round into a first semantic vector, and encode the dialogue text of the previous rounds into a second semantic vector; Obtain the semantic similarity between the first semantic vector and the second semantic vector, and determine the semantic drift degree based on the semantic similarity, wherein the semantic similarity and the semantic drift degree are negatively correlated.

3. The method of claim 1, wherein, The current context text includes multiple types of subtext, and the compression of the current context text includes: For each type of subtext, a compression strategy corresponding to the subtext is used to compress the subtext, wherein at least some types of subtext have different compression strategies.

4. The method of claim 3, wherein, The large language model is used to assist programming. The various types of subtext include at least one of: code snippet subtext, non-code subtext, and tool call result subtext. The compression of the subtext using a compression strategy corresponding to the subtext includes: For the code snippet subtext, an abstract syntax tree parser is used to parse the code snippet subtext, remove the non-syntactically necessary elements obtained from the parsing, and obtain the compressed code subtext corresponding to the code snippet subtext; For the non-code subtext, a lightweight summarization model is used to generate a summary subtext of the non-code subtext. The summary subtext includes: user goal, user input in the current round, and at least one of the key dependency list. For the tool call result subtext, the tool call subtext is converted into field subtext in a preset format, retaining the field names and key values ​​of the field subtext, to obtain compressed field subtext.

5. The method according to any of claims 1 to 4, characterized in that, The compression of the current context text further includes: Obtain a dialogue queue, which includes multiple dialogue sub-texts sorted by dialogue time from most recent to oldest, and each dialogue sub-text is a dialogue sub-text of one of the historical rounds. Extract a reference dialogue segment from the tail of the dialogue queue, and obtain a target dialogue segment in the dialogue queue whose semantic similarity to the reference dialogue segment is greater than or equal to a preset similarity threshold. The reference dialogue fragment and the target dialogue fragment are removed from the dialogue queue, and the reference dialogue fragment and the target dialogue fragment are aggregated to obtain an aggregated summary fragment, which is then inserted at the end of the dialogue queue.

6. The method of claim 5, wherein, After obtaining the aggregated summary fragment, the method further includes: Set a fragment identifier for the aggregated summary fragment; The reference dialogue fragment and the target dialogue fragment are stored, and a mapping relationship is established between the fragment identifier and the storage paths of the reference dialogue fragment and the target dialogue fragment.

7. The method of claim 5, wherein, Each of the dialogue subtexts contains structured fields including: user input for the corresponding round, large language model output for the corresponding round, dialogue timestamp for the corresponding round, and dialogue intent tag for the corresponding round.

8. The method of claim 1, wherein, The current context text includes various types of information, each with different tags. These tags can be retained or discarded. Before compressing the current context text, the method further includes: Obtain the label corresponding to each type of information, and remove information labeled as disposable from the current context text.

9. The method of claim 1, wherein, Before obtaining the window occupancy of the current context text and the semantic drift of the user input in the current round, the method further includes: If a text compression instruction is received from the user, then the step of compressing the current context text is executed.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores at least one computer program instruction, which is loaded and executed by a processor to perform the operation as described in any one of claims 1-9.