A multi-stage context compression method and system

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By employing multi-level context compression methods and reinforcement learning optimization, the problems of information loss, unstable summaries, and inaccurate resource tracking in data management are solved. Fine-grained control and cross-compression boundary resource tracking are achieved, providing multiple recovery paths to meet the resource management and information retention requirements of data processing systems.

CN122285884APending Publication Date: 2026-06-26HANGZHOU MAGIC BYTE TECHNOLOGY CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: HANGZHOU MAGIC BYTE TECHNOLOGY CO LTD
Filing Date: 2026-04-01
Publication Date: 2026-06-26

Application Information

Patent Timeline

01 Apr 2026

Application

26 Jun 2026

Publication

CN122285884A

IPC: G06F16/34; G06N3/042; G06N3/092; G06F16/174; H03M7/30

AI Tagging

Technology Topics

Information processing Algorithm

Technical Efficacy Phrases

Guaranteed accuracy Realize adaptive adjustment

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Electrode sheet
CN224505453UReduce the probability of separation and sheddingReduce the need to repeatedly plug in wiresMechanical engineering Biomedical engineering
An in-situ permeability testing system and method for underground coal gasification
CN122259424AGuaranteed accuracy Guaranteed reliability Permeability/surface area analysis Thermodynamics Petroleum engineering
Vehicle underbody living body detection method, computer program product, electronic device and vehicle
CN122200729AImprove targeting Improve efficiency Biometric pattern recognition Alarms
A skeleton structure for a wind tunnel test model, a wind tunnel test model and a method of use
CN122409128AAdapt to support needsImprove shape fitting accuracyClassical mechanics Structural engineering
Screen color uniformity on-line detection system based on dynamic rotating polarizing spectrum
CN122385147AGuaranteed accuracyRealize true online detectionData acquisition Light beam

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing data management technologies suffer from serious information loss, unstable summary quality, lack of fine-grained control, inaccurate resource tracking, and imperfect recovery mechanisms when processing large amounts of data. This leads to inaccurate resource budget calculations and a lack of effective recovery paths when information is insufficient.

Method used

A multi-level context compression method is adopted, including semantic pre-screening, content budget control, fragment compression, lightweight compression, context folding and full summary compression. Combined with dialogue state graph and reinforcement learning optimization, it achieves fine-grained control and resource tracking across compression boundaries, and provides multiple recovery paths.

Benefits of technology

It prioritizes fine-grained compression over coarse-grained summarization, ensuring data detail preservation, accurate resource tracking across compression boundaries, and providing a flexible recovery mechanism to meet the resource efficiency and information integrity requirements of different application scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122285884A_ABST

Patent Text Reader

Abstract

This invention discloses a multi-level context compression method and system, belonging to the field of information processing technology. Before performing multi-level compression, the method first performs semantic pre-screening, using a lightweight embedding model to evaluate the information entropy and semantic importance of messages, achieving context-aware dynamic budget allocation. Then, it performs multi-level compression sequentially: the first level is content budget control, which enforces an upper limit on the aggregated content size of each message before lightweight compression; the second level is fragment compression, selectively removing historical content with feature gating; the third level is lightweight compression, performing lightweight content compression and supporting delayed boundary message processing; the fourth level is context folding, performing fine-grained selective compression based on the dialogue state graph, retaining key decision nodes; and the fifth level is complete summarization, performing a complete content summary when resource usage exceeds a threshold. This invention also introduces reinforcement learning optimization, dynamically adjusting the compression strategy based on task completion rate feedback. This invention achieves a strategy that prioritizes fine-grained compression over coarse-grained summarization, differential compression based on semantic importance, topological compression based on graph structures, and adaptive strategy optimization.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of information processing technology, specifically to context management technology for data processing systems, and more particularly to a context compression method and system that supports multi-level progressive compression and resource tracking across compression boundaries. Background Technology

[0002] Data processing systems face storage resource constraints when processing large amounts of data. Existing data management technologies typically employ simple truncation or summarization methods, which suffer from the following problems: First, significant information loss. Simple truncation methods directly discard early data content, potentially leading to the loss of important information. Second, inconsistent summarization quality. Automatically generated summaries may omit crucial details, affecting the coherence of subsequent data processing. Third, lack of fine-grained control. Existing technologies often employ an "all or nothing" compression strategy, failing to selectively compress based on content importance. Fourth, inaccurate resource tracking. The system cannot access the original data after compression, leading to inaccurate resource budget calculations. Fifth, inadequate recovery mechanisms. When compression results in insufficient information, there is a lack of effective recovery paths.

[0003] Therefore, there is an urgent need for a more refined and intelligent data compression solution that can support multi-level progressive compression, achieve fine-grained control, maintain the accuracy of resource tracking, and provide multiple recovery paths. Summary of the Invention

[0004] To address the shortcomings of existing technologies, this invention provides a multi-level context compression method and system that performs multi-level compression sequentially, implementing a strategy that prioritizes fine-grained compression over coarse-grained summarization.

[0005] The present invention provides the following technical solution.

[0006] 1. Multi-level context compression method.

[0007] This invention provides a multi-level context compression method, characterized by comprising the following six processing steps: Level 0 Semantic Pre-screening: Before the compression process begins, a lightweight model (such as an embedding model or a lightweight language model) is used to evaluate the information entropy and semantic importance of each message. Based on the generated importance score, context-aware dynamic budget allocation is performed to assign differentiated resource quotas to messages of different importance.

[0008] Level 1 Content Budget Control: Enforces a maximum size limit for the aggregated content of each message. If the content exceeds the preset limit, it is truncated or replaced. This content replacement operation is invisible to the cache path to ensure cache consistency.

[0009] Second-level fragment compression: This stage uses a feature-gating mechanism to selectively remove historical content. It can be enabled or disabled via a feature switch; when enabled, earlier message content is deleted while system messages and recent context content are retained. The amount of resources released is configured to adjust the threshold for subsequent compression stages, thus delaying the triggering of the full summary.

[0010] Level 3 Lightweight Compression: Performs lightweight content compression. For messages near the context window boundaries, cached information is used for delayed processing. The compressed message retains its original structure; only the content is compressed to maintain format compatibility.

[0011] Level 4 context folding: Performed before the complete summary, this stage involves fine-grained, context-selective compression. Incremental folding is supported, gradually increasing the folding range until resource usage meets the requirements. If resource usage after folding is below the summary threshold, the fifth-level complete summary is skipped to reduce computational overhead.

[0012] Level 5 Full Summary: Triggered when resource usage exceeds a threshold. During compression, the pre-compression context state is captured, and the pre-compression resource usage is deducted from the resource budget to enable resource tracking across compression boundaries. The tracking flag is then reset to distinguish between content rounds before and after compression, and the original content array is replaced with the later-compressed content.

[0013] As attached Figure 1 As shown.

[0014] 2. Optimization based on dialogue state graph.

[0015] In one embodiment of the invention, the method further includes context folding based on a dialogue state graph: Graph Construction: Construct a dialogue state transition graph, representing messages as nodes in the graph and dialogue flow as edges.

[0016] Key Node Identification: Identifying key decision nodes (i.e., dialogue branching points and information convergence states). By extracting message intent tags, entity information, and sentiment polarity, node connections are constructed based on time windows and semantic similarity, and node centrality indices are calculated to identify key decision points.

[0017] Topology folding: Selective folding based on graph topology preserves key decision paths and compresses only redundant paths. A graph traversal algorithm is used to determine the optimal folding range, maximizing information retention.

[0018] 3. Reinforcement learning optimization strategies.

[0019] In one embodiment of the present invention, the method further includes a reinforcement learning-optimized compression strategy: Agent Deployment: Deploy a lightweight reinforcement learning agent and dynamically adjust compression parameters at each level based on task completion rate feedback.

[0020] State and Action: Define the state space (current token usage, task type, historical compression success rate, user satisfaction index) and action space (adjust thresholds at all levels, select different digest models, switch compression strategy modes).

[0021] Reward Mechanism: Use a reward function to balance resource saving and information retention, and optimize the success rate of long-term tasks.

[0022] 4. Multi-level context compression system.

[0023] The present invention also provides a multi-level context compression system, characterized in that it includes: Multi-level compression engine: As the core coordination component, it includes a sequence executor (ensuring that multi-level compression is executed in a strict order), a condition checker (checking the trigger conditions of each compression level), a feature gating manager (managing the feature on / off status of each level), and a compression result aggregator.

[0024] Semantic pre-screening module: Configured to evaluate the information entropy and semantic importance of messages using a lightweight model, generating a dynamic budget map. It includes an information entropy calculation unit, an importance scoring unit, a dynamic budget allocation unit, and a threshold adjustment unit.

[0025] Content budget control module: Configured to enforce a maximum aggregate content size for each message. Includes a content size calculator, a maximum size comparator, a content processor, and a cache consistency guarantee. (See attached) Figure 2 As shown.

[0026] Fragment Compression Module: Configured for selective removal of historical content with feature gating. Includes a feature switch checker, a history message selector, a context retainer, and a threshold adjuster.

[0027] Lightweight Compression Module: Configured for lightweight content compression. Includes a lightweight compressor, boundary message handler, and cache optimizer.

[0028] A comparison chart of fragment compression and lightweight compression is attached. Figure 3 As shown.

[0029] Context folding module: Configured for fine-grained context-selective compression. (See attached...) Figure 4 As shown.

[0030] Full summary module: Configured to be triggered when resource usage exceeds a threshold, including pre-compression state capture, summary generation, tracking identifier reset, and content replacement functions.

[0031] Cross-boundary resource tracing module: Configured to deduct pre-compression resource usage from the resource budget during the execution of the full summary module to achieve cross-compression boundary resource tracing. Includes a pre-compression state capture unit, budget calculator, continuity guarantee unit, and system synchronizer. A diagram of the full summary and resource tracing is attached. Figure 5 As shown.

[0032] Dialogue State Graph Construction Module (Optional): Configured to construct a dialogue state transition graph, identify key decision nodes and convergence states. Includes a feature extraction unit, a graph construction unit, a convergence state identification unit, and a key node identification unit.

[0033] This invention also includes a compression strategy optimized by reinforcement learning: deploying a lightweight reinforcement learning agent to dynamically adjust compression parameters at each level based on task completion rate feedback; defining a state space: current token usage, task type, historical compression success rate, and user satisfaction index; defining an action space: adjusting thresholds at each level, selecting different summary models, and switching compression strategy modes; and using a reward function to balance resource saving and information retention to optimize long-term task success rate.

[0034] Reinforcement learning optimization includes: collecting correlation data between compressed decisions and task results to construct a feedback dataset; training a compressed policy network using policy gradient methods; enabling online learning by continuously optimizing policy parameters based on real-time task feedback; and supporting multi-objective optimization by simultaneously considering resource efficiency, information integrity, and response latency.

[0035] The present invention also includes multiple recovery paths: collapse emptying recovery, submitting the temporarily stored collapse digest; reactive compression recovery, performing full digest compression; resource limit upgrade recovery, increasing the resource budget limit; and resource recovery, using the reserved resource quota.

[0036] The recovery path selection logic includes: when context folding can release sufficient resources, choose folding and emptying recovery; when resource usage is close to the limit and cannot be resolved by folding, choose reactive compression recovery; when the resource limit is insufficient to complete the task, choose resource limit upgrade recovery; when reserved resource quota is available, choose resource recovery. The recovery path selection logic state machine diagram is attached. Figure 7 As shown. Beneficial effects

[0037] Compared with existing technologies, this invention has the following advantages: First, multi-level progressive compression achieves fine-grained control, from content budget control to full summarization, with each level optimized for different scenarios; Second, fine-grained compression takes precedence over coarse-grained summarization, preserving data details as much as possible, and performing full summarization only when necessary; Third, cross-compression boundary resource tracking ensures the accuracy of resource calculation, enabling the system to correctly calculate the resource consumption after compression; Fourth, multiple recovery paths provide a flexible recovery mechanism, selecting the optimal recovery strategy according to specific circumstances; Fifth, feature gating control supports flexible enabling / disabling of compression at each level, facilitating experimentation and optimization; Sixth, deep integration with the resource budget system ensures that compression decisions consider resource budget constraints; Seventh... Dynamic budget allocation based on semantic importance enables differentiated compression, with high-importance messages receiving more retention resources and low-importance messages being compressed first, maximizing information value retention under equal resource constraints; eighth, the dialogue state graph structure representation introduces topological features, which, unlike simple time-series compression, can identify and retain key decision nodes and convergence states, maintaining the logical structural integrity of the dialogue; ninth, the reinforcement learning-optimized compression strategy achieves adaptive adjustment, continuously optimizing compression parameters based on task completion rate feedback to achieve a dynamic balance between resource efficiency and task success rate; tenth, multi-objective optimization capabilities simultaneously consider resource efficiency, information integrity, and response latency, meeting the needs of different application scenarios. Attached Figure Description

[0038] Figure 1 The overall flowchart of the multi-level context compression method shows the sequential execution flow of the five-level compression strategy, including the cascading relationship of content budget control, fragment compression, lightweight compression, context folding, and full summary. Arrows are used to mark the triggering conditions and jump logic between each level, and resource tracking paths across compression boundaries are marked in particular.

[0039] Figure 2 The diagram illustrates the content budget control module, showing the component relationships between the content size calculator, upper limit comparator, and content processor. It uses sample data to demonstrate the content truncation / replacement process and marks the data flow of the cache consistency guarantee mechanism.

[0040] Figure 3 The chart compares fragment compression and lightweight compression, with columns on the left and right. The left side shows the process of selectively removing historical messages (fragment compression), while the right side shows the process of lightweight content compression (preserving the original structure). It also includes a diagram showing the enabled / disabled status of feature gating.

[0041] Figure 4 : Context folding workflow diagram, showing the fine-grained selective compression process, including a step-by-step expansion diagram of incremental folding, and marked conditional branches that skip the complete summary.

[0042] Figure 5 The diagram shows the complete summary and resource tracking process. The upper part displays the summary generation process: trigger condition detection, pre-compression state capture, and content replacement process. The lower part shows cross-boundary resource tracking: resource budget calculation before and after compression, and tracking identifier reset process.

[0043] Figure 6 The system architecture diagram shows all the modules: multi-level compression engine (including sub-modules), five core compression modules, cross-boundary resource tracking module, recovery path module, and uses a hierarchical structure to show the data interaction relationships between modules.

[0044] Figure 7 The recovery path selection logic state machine diagram shows the triggering conditions for four recovery paths: collapse and empty recovery, reactive compression recovery, resource limit upgrade recovery, and resource recovery. The judgment conditions of the selection logic are marked with state transition arrows, including resource usage threshold labels. Detailed Implementation

[0045] To enable those skilled in the art to better understand the present invention, the present invention will be further described in detail below with reference to a specific embodiment of an AI-assisted programming system for children.

[0046] I. Case Overview.

[0047] This implementation case details the application of a multi-level context compression method to a children's AI-assisted programming system. Based on an intelligent programming tutoring system that supports code generation, debugging guidance, and knowledge point explanation, this system faces the challenge of context length limitations when handling long dialogues in children's programming learning. By applying the six-level progressive compression strategy of this invention, the system achieves the optimization goal of prioritizing fine-grained compression over coarse-grained summarization, effectively managing token resource usage while ensuring learning continuity.

[0048] This case study focuses on an AI-assisted programming system for children, which has the following characteristics: 1. Multi-round learning interaction: Children engage in continuous multi-round learning dialogues with AI tutors, with an average of 40-80 rounds of interaction per session, accumulating 120,000-250,000 tokens.

[0049] 2. Frequent code execution: AI tutors need to frequently call tools for code execution, syntax checking, and error diagnosis. Each tool call may return results containing hundreds to tens of thousands of tokens.

[0050] 3. Sensitive to learning coherence: The coherence of the learning process is highly dependent on the historical context, and the understanding of knowledge points established in early dialogues is crucial for subsequent learning.

[0051] 4. Resource constraints: The system operates under a limited token budget, requiring precise tracking and management of token consumption.

[0052] II. Implementation details of Level 6 compression.

[0053] This case strictly follows the six-level compression sequence described in the invention to perform context compression. The following details the specific implementation process, triggering conditions, and execution results of each level of compression.

[0054] 2.0 Level Zero: Semantic Pre-screening.

[0055] 2.0.1 Purpose of implementation.

[0056] Semantic pre-screening uses a lightweight embedding model to evaluate the information entropy and semantic importance of each message before performing six-level compression, providing intelligent guidance for subsequent compression.

[0057] 2.0.2 Specific implementation steps.

[0058] Step 1: Load a lightweight language model. Use a pre-trained language model with a parameter size between 100M and 1B, such as DistilBERT or MiniLM, as the base model for information entropy calculation.

[0059] Step 2: Calculate message information entropy. For each message, calculate its information entropy value using the model. The following is pseudocode to illustrate the process: Python entropy_value = -sum(prob log(prob) for prob in token_probabilities).

[0060] Where prob represents the predicted probability of each token in the message.

[0061] Step 3: Calculate the overall importance score. This considers the following factors: Message position weight: earlier messages have lower weight, and more recent messages have higher weight; Content type weighting: System messages have the highest weight, followed by children's questions, and code execution results have the lowest weight. Knowledge point relevance: Messages that are relevant to the current learning topic have higher weight.

[0062] The pseudocode is as follows: python composite_score = alpha entropy_value + beta position_weight + gammatype_weight + delta relevance_score .

[0063] Step 4: Message Categorization. Messages are categorized into three types based on their overall score: High importance (score > 0.7): Retain more than 90% of the content; Medium Importance (Score 0.3-0.7): Retain 50-70% of the content; Low importance (score < 0.3): Retain 30% of the following content or compress completely; Step 5: Dynamic Budget Allocation. Based on the classification results, allocate compressed budget quotas to each message category, with higher-importance messages receiving higher quotas and lower-importance messages receiving lower quotas.

[0064] 2.1 Level 1: Budget control of tool results.

[0065] 2.1.1 Purpose of implementation.

[0066] Tool result budget control is the first level of compression, performed before lightweight compression. Its main purpose is to enforce a maximum size limit for the aggregated tool result for each message, preventing excessively large results returned by a single tool call from consuming too much context space. Content replacement operations at this level are not visible to the cached lightweight compression path, ensuring cache consistency.

[0067] 2.1.2 Specific implementation steps.

[0068] Step 1: Iterate through the message list and identify messages that contain tool results. For each message, perform the following operations: First, parse the message structure and extract the tool call result. The tool result may contain the following types of content: code execution output, syntax check results, error diagnosis information, knowledge point query results, code suggestions, etc.

[0069] Step 2: Calculate the size of the aggregation tool result for each message.

[0070] The pseudocode for the calculation method is as follows: Python tool_result_size = command_output_tokens + file_content_tokens +error_message_tokens + metadata_tokens.

[0071] The token count is calculated using the same tokenizer as LLM.

[0072] Step 3: Compare the size with the preset upper limit. The configuration in this case is as follows: Maximum number of tokens for a single tool result: 2000; The maximum number of tokens that can be generated by the aggregation tool is 5000 per message. Threshold for excessively long results: 1000 tokens; Step 4: Process tool results that exceed the limit. Processing methods include: Truncation: The tool retains the beginning and end of the result, using ellipses to indicate the middle. Specifically, it retains the first 500 tokens and the last 500 tokens, replacing the middle part with "...[content truncated]...". This method ensures that children can see the beginning and end of the result and understand the overall structure.

[0073] Replacement processing: For certain types of tool results (such as large error logs), the original content is completely replaced with the fixed string "[Large output compressed]", while the original size and compression method are recorded in the message metadata.

[0074] 2.1.3 Implementation example.

[0075] Here is a specific example of implementing tool-based budget control: Original message: json { "role": "assistant", "content": null, "tool_invocations": [...] } { "role": "tool", "invocation_id": "inv_456", "content": "<Code execution result: 15000 tokens>" } Post-processing message: json { "role": "tool", "invocation_id": "inv_456", "content": "<First 500 tokens>...[Content truncated, original size 15000 tokens]...<Last 500 tokens>" }

[0076] 2.1.4 Implementation results.

[0077] In a typical learning session of this case, the tool's outcome budget control level processed 20 messages, of which 6 triggered truncation, releasing approximately 72,000 tokens. The average execution time for this level was 4 milliseconds, having a minimal impact on system response latency.

[0078] 2.2 Second level: Fragment compression.

[0079] 2.2.1 Purpose of implementation.

[0080] Fragment compression is the second level of compression. It involves aggressive removal of historical content through feature gating, releasing tokens to adjust the automatic compression threshold. This level is performed after the tool's result budget and before lightweight compression, primarily handling early children's messages and AI tutor messages.

[0081] 2.2.2 Feature-gated control.

[0082] Fragment compression is controlled via the SEGMENT_COMPACT feature switch. This feature is enabled by default in this case, but can be dynamically adjusted via configuration files or runtime parameters.

[0083] 2.2.3 Specific implementation steps.

[0084] Step 1: Identify removable historical content. The system identifies content according to the following rules: Retain system messages: These contain important context such as role settings and learning rules; Retain the most recent N rounds of dialogue: In this case, N=8 is set to ensure learning continuity; Identify early childhood messages and AI tutor messages: historical messages that are beyond the scope of retention.

[0085] Step 2: Perform aggressive content removal. For identified early messages, perform the following: Children's message processing: Retain the existence marker of the message but clear the content, replace it with the placeholder "[historical learning questions]", and record the timestamp and summary information of the original message in the metadata.

[0086] AI tutor message processing: The existing markers are retained, but the content is cleared and replaced with a placeholder "[Historical tutor replies]". If the AI tutor message contains tool calls, the tool call structure is preserved, but the returned results are compressed.

[0087] Step 3: Calculate the number of tokens released. The calculation formula is: Python released_tokens = sum(original_tokens for msg in removed_messages) adjusted_threshold = original_threshold + released_tokensadjustment_factor adjustment_factor = 0.8 ```.

[0088] 2.2.4 Implementation example.

[0089] Assuming the current learning session has gone through 30 rounds, the system status is as follows: System message: 1 message, approximately 180 tokens; Children's messages: 15, averaging 120 tokens per message; AI mentor messages: 14 messages, averaging 700 tokens per message (including tool results); Current total number of tokens: approximately 15,200 tokens.

[0090] Fragment compression execution process: Retain the most recent 8 rounds of dialogue (system messages + 4 pairs of messages to children / AI tutors); Identify removable content: 11 early childhood messages + 10 early AI mentor messages.

[0091] Number of tokens released: 11 × 120 + 10 × 700 = 1320 + 7000 = 8320 tokens.

[0092] Adjusted threshold: Original 100000 + 8320 × 0.8 = 106656 tokens.

[0093] 2.2.5 Implementation results.

[0094] In this case, fragment compression releases an average of 7,000-10,000 tokens each time, effectively increasing the trigger threshold for automatic compression by 7%-10% and significantly delaying the triggering of full summary compression.

[0095] 2.3 Third level: Lightweight compression.

[0096] 2.3.1 Purpose of implementation.

[0097] The third level of compression performs lightweight tool result compression. This level supports delayed boundary message handling; for messages approaching the context window boundary, it uses the cache_deleted_input_tokens information reported by the API.

[0098] 2.3.2 Lightweight compression strategy.

[0099] Micro-compression employs a lightweight compression strategy, compressing only the result portion of the tool while preserving the complete structure of the message. The advantages of this strategy are: the compressed message retains its original format, facilitating subsequent processing; the compression granularity is fine, allowing selection of different compression methods based on content type; and the compression and decompression overhead is low, minimizing impact on system performance.

[0100] 2.3.3 Specific implementation steps.

[0101] Step 1: Identify the messages that require micro-compression. Identification criteria include: The message contains the result of the tool call; The tool results were not processed at the tool results budget level; The message is located within a specified range of the context window; Step 2: Apply a lightweight compression algorithm. The compression algorithms implemented in this case include: First and last token retention algorithm: For structured data (such as JSON, XML), retain the first 200 tokens and the last 200 tokens, and replace the middle part with "...[data structure has been compressed]...".

[0102] Line sampling algorithm: For line-based text (such as code, log), retain the first 20 lines and the last 20 lines, and sample and retain the middle lines at fixed intervals, using "...[sampled N lines omitted]..." to indicate the number of skipped lines.

[0103] Summary replacement algorithm: For natural language text, a pre-trained small summary model is used to generate a summary of no more than 100 tokens, and the original text is completely replaced with the summary.

[0104] Step 3: Handle delayed compression of boundary messages. For messages near the context window boundary: Python def handle_boundary_message(message, cache_info): if cache_info.deleted_tokens > 0: Skip tokens that have been cached and deleted return skip_compression(message, cache_info.deleted_tokens) elif cache_info.partial_cached: Compress only the parts that are not cached return compress_uncached_portion(message, cache_info.uncached_range) ```.

[0105] 2.3.4 Cache optimization.

[0106] The CACHED_MICROCOMPACT feature controls whether cache optimization is used. When enabled, the system will: cache already compressed message content to avoid duplicate compression; manage the cache using an LRU eviction policy; and return the compressed result directly when a cache hit occurs, reducing execution time to less than 1 millisecond.

[0107] 2.3.5 Implementation results.

[0108] In this case, the micro-compression level releases an average of 2,500-4,500 tokens per compression, with an execution time of 8-18 milliseconds and a cache hit rate of approximately 62%, significantly improving compression efficiency.

[0109] 2.4 Level 4: Context folding.

[0110] 2.4.1 Purpose of implementation.

[0111] Context folding is the fourth level of compression, performed before automatic compression. This level performs feature-gated, fine-grained context compression, selectively folding parts of the message rather than all of it. Automatic compression is skipped if the number of tokens after folding is below the automatic compression threshold.

[0112] 2.4.2 Context folding based on dialogue state graph.

[0113] This case study employs a context folding method based on dialogue state graphs to convert linear dialogue sequences into graph structures and perform selective compression based on topological relationships.

[0114] The implementation steps are as follows: Step 1: Node Feature Extraction. For each message, extract the following features as node attributes: Learning Intent Labels: Use an intent recognition model to extract children's learning intent, such as "requesting explanation", "helping with code", "confirming understanding", etc. Key knowledge points: Extract key programming concepts from the message, such as variables, loops, and functions; Comprehension level: Analyze the child's comprehension status and divide it into three categories: mastered, learning, and not understood; Step 2: Graph Construction. Construct edge connections based on the following rules: Time edge: Establish a time sequence edge between adjacent messages; Semantic edges: Semantic association edges are established between messages whose semantic similarity exceeds a threshold (e.g., 0.8); Referencing edges: Messages that explicitly reference previous text establish referencing edges; Step 3: Convergence State Identification. When multiple dialogue paths are detected to have reached semantically similar states, they are marked as convergence states. Python convergence_detected = semantic_similarity(state_i, state_j) >convergence_threshold threshold = 0.85 ```.

[0115] Step 4: Key Node Identification. Key learning nodes are identified using node centrality metrics. Degree centrality: A node whose number of connections exceeds the average. Betweenness centrality: A node that lies on multiple shortest paths; Eigenvector centrality: Nodes connected to important nodes; Step 5: Selective Folding. Folding decisions are made based on the graph structure. Retain all key learning nodes and their neighborhoods; Preserve converged nodes (convergence points of multiple paths); Compress redundant paths (paths consisting of non-critical nodes); Use graph traversal algorithms (such as DFS) to determine the optimal folding range.

[0116] 2.4.3 Incremental folding strategy.

[0117] Context folding supports an incremental folding strategy, gradually increasing the folding range until the condition is met: Initial attempt: Select dialogue rounds 3-5 for collapse and generate collapsed summaries. The summaries contain the core information of that round, such as learning intentions, key knowledge points, and important conclusions.

[0118] Check criteria: Calculate the total number of tokens after folding. If it is lower than the automatic compression threshold (100,000 tokens in this case), stop folding and skip the automatic compression level.

[0119] Expand the fold: If it is still above the threshold, expand the fold range to rounds 3-7, regenerate the summary and check the conditions. Repeat this process until the conditions are met or the maximum fold range is reached.

[0120] 2.4.4 Folded summary format.

[0121] ``` [Learning Summary] Round range: Round X to Round Y Learning Intent: [Extracted Summary of Learning Intent] Key knowledge points: [Important knowledge points in the dialogue] Key conclusions: [Understandings or conclusions reached] Learning Status: [Learning context information that needs to be retained] ```.

[0122] 2.4.5 Folding and emptying recovery mechanism.

[0123] Context folding supports folded content recovery. When the system detects that folded content needs to be recovered, a temporary fold summary can be submitted, or detailed content can be regenerated based on the summary.

[0124] 2.4.6 Implementation results.

[0125] In this case, the context folding level successfully converted approximately 58% of the potential automatic compression operations into no-ops (no need to perform them), releasing an average of 13,000-22,000 tokens per fold, while preserving the key contextual information learned.

[0126] 2.5 Level 5: Automatic compression.

[0127] 2.5.1 Purpose of implementation.

[0128] Automatic compression is the fifth and final level of compression. It is triggered when the number of tokens exceeds a threshold, executing a complete learning session summary.

[0129] 2.5.2 Trigger condition check.

[0130] Automatic compression is triggered by the following conditions: The current number of tokens exceeds the preset threshold (in this case: 100,000 tokens). Even after the first five levels of compression, the number of tokens still exceeds the threshold; System resource utilization has exceeded the warning threshold; 2.5.3 Specific implementation steps.

[0131] Step 1: Capture the pre-compact context window. Before performing compression, the system records: Python pre_compact_token_count = calculate_total_tokens(current_messages) pre_compact_message_count = len(current_messages) pre_compact_timestamp = get_current_time() context_snapshot = create_shallow_copy(current_messages) ```.

[0132] Step 2: Subtract the pre-compact value from taskBudgetRemaining. This is crucial for budget tracking across compression boundaries: Python task_budget_remaining -= pre_compact_token_count Ensure the server can still accurately calculate budget consumption after compression. ```.

[0133] Step 3: Generate a complete learning session summary. The process of generating a summary using LLM: Summary building prompts: Take all current messages as input and attach summary generation instructions. Instructions require the summary to include: learning topic, key questions, solutions, outstanding issues, and children's learning preferences.

[0134] Execute digest generation: Call LLM to generate a digest, with the target length controlled between 2000-5000 tokens.

[0135] Verify digest quality: Check if the digest contains key information. If the quality is not up to standard, regenerate or retain the original message.

[0136] Step 4: Reset the tracking with the new turn ID. Differentiate between learning rounds before and after compression: Python new_turn_id = generate_new_turn_id() compressed_message = { "role": "system", "content": "[Learning Summary] " + summary_text, "metadata": { "compressed_from": str(pre_compact_message_count) + "messages", "original_tokens": pre_compact_token_count, "compressed_at": pre_compact_timestamp, "turn_id": new_turn_id } } ```.

[0137] Step 5: Post-compact message replacement. Replace the original message array with a summary message: Retain system messages (character settings, etc.); Add a summary message as a new system message; Retain the most recent 3-5 rounds of dialogue (to ensure continuity); Discard the remaining original messages.

[0138] 2.5.4 Compressed Event Log Each automatic compression session records detailed event information: json { "event_type": "auto_compact", "timestamp": "2024-01-15T10:30:00Z", "pre_compact_tokens": 125000, "post_compact_tokens": 4500, "compression_ratio": "96.4%", "messages_removed": 87, "turn_id": "turn_20240115_103000", "recovery_available": true } ```.

[0139] 2.5.5 Implementation Results In this case, the automatic compression level achieved an average compression rate of over 94%, compressing the context of 100,000-140,000 tokens to 3,500-5,500 tokens, while retaining the core information of the learning session through the digest.

[0140] 2.6 Level 6: Reinforcement Learning Optimization.

[0141] 2.6.1 Purpose of implementation.

[0142] Reinforcement learning optimization is an advanced strategy that dynamically adjusts compression parameters based on task completion rate feedback to continuously optimize compression performance.

[0143] 2.6.2 Specific implementation steps.

[0144] Step 1: Define the state space. The state vector contains the following dimensions: Current token usage (normalized to 0-1); Learn type coding (code practice / concept learning / project practice / Q&A, etc.); Historical compression success rate (average success rate of the past 10 compression attempts). Child satisfaction index (based on feedback ratings).

[0145] Step 2: Define the action space. Executable actions include: Adjust the thresholds at each level: ±10%, ±20%, unchanged; Choose the summary model: Lightweight model / Standard model / High-quality model; Switch compression strategies: Conservative mode / Balanced mode / Aggressive mode.

[0146] Step 3: Define the reward function. The reward takes into account multiple objectives: Python reward = w1 resource_saving_rate + w2 info_retention_rate + w3task_success_rate w4 response_latency ```.

[0147] w1, w2, w3, and w4 are weight parameters that can be adjusted according to the application scenario.

[0148] Step 4: Policy Network Training. Train the policy network using the PPO (Proximal Policy Optimization) algorithm: Network structure: Input layer (state dimension) → Hidden layer (128 units) → Output layer (action probability); Training data: Collect and compress data relating decisions to task outcomes; Update frequency: The strategy is updated once every 100 episodes.

[0149] Step 5: Online learning. The system continuously collects feedback and optimizes its strategies. Record decision parameters and task results after each compression. Fine-tune the strategy network periodically (e.g., daily) using new data; Supports A / B testing to compare the effects of different strategies.

[0150] III. Implementation of budget tracking across compression boundaries.

[0151] 3.1 Tracking Mechanism Design: Budget tracing across compression boundaries is one of the key innovations of this invention. Its core objective is to maintain the continuity of budget tracing before and after compression, ensuring that the server can accurately calculate the token consumption after compression.

[0152] 3.2 Specific implementation steps: Step 1: Capture the number of pre-compact context window tokens before active automatic compression: Python def capture_pre_compact_window(): token_count = calculate_total_tokens(current_messages) pre_compact_record = { "count": token_count, "timestamp": get_current_time(), "message_snapshot": create_shallow_copy(current_messages) } return token_count ```.

[0153] Step 2: Subtract the pre-compact value from taskBudgetRemaining: Python def deduct_pre_compact_budget(token_count): Key: Deduct the number of pre-compact tokens before compression. global task_budget_remaining task_budget_remaining -= token_count Record deduction operation budget_log.append({ "operation": "pre_compact_deduction", "amount": token_count, "remaining": task_budget_remaining, "timestamp": get_current_time() }) ```.

[0154] Step 3: The server calculates the post-compact cost after compression. Python After compression, the server only sees the post-compact message. Server calculation: post_compact_tokens = 4000 Deducted from taskBudgetRemaining: task_budget_remaining -= post_compact_tokens deducts 4000 Final budget calculation: Initial budget: 200,000 pre-compact deduction: -125000 Post-compact deduction: -4000 Remaining budget: 71,000 (exact) Comparison: Without pre-compact deduction The server will calculate: 125000 + 4000 = 129000 (incorrect). ```.

[0155] 3.3 Continuity guarantee.

[0156] The key to maintaining budget continuity across compression boundaries lies in the fact that pre-compact deductions occur before compression, while post-compact deductions occur after compression; these two events are separate in time but logically continuous. The server always sees a consistent budget state without being aware of the compression operation.

[0157] 3.4 Verification of implementation effects.

[0158] In this case, the accuracy of budget tracking across compression boundaries was verified through comparative testing. The test scenario included 100 consecutive learning sessions, with an average of 2.1 automatic compressions triggered per session. The results show that after adopting the tracking mechanism of this invention, the budget calculation error was reduced from an average of 8.2% to less than 0.18%, fully meeting the accuracy requirements of the production environment.

[0159] IV. Implementation of multiple recovery paths.

[0160] 4.1 Overview of recovery path.

[0161] When compression results in insufficient information, the system selects the optimal recovery path based on the specific circumstances. This case study implements four recovery paths, covering different recovery scenarios.

[0162] 4.2 Folding and emptying restoration.

[0163] 4.2.1 Triggering conditions.

[0164] When it is detected that a context fold can release enough tokens, choose to fold, empty, and restore: The number of tokens after folding is lower than the automatic compression threshold; There is a temporarily stored folded summary; The child or the system requests that the collapsed content be restored.

[0165] 4.2.2 Implementation steps.

[0166] Step 1: Submit the temporarily stored folded summary. The system will then re-inject the previously generated folded summary into the learning session context.

[0167] Step 2: Reconstruct the context of the folded region. Based on the folded summary, the system can choose: Use the summary content directly (quick recovery); Regenerate detailed content based on the summary (complete recovery); Retrieve the original message from persistent storage (precise recovery).

[0168] 4.2.3 Implementation example.

[0169] Before restoration: Message count: 42 (including folded placeholders); Number of Tokens: 95000; Folded area: Rounds 10-20 (folded as placeholders); Folding and emptying restoration: 1. Extract the folded summary from rounds 10-20; 2. Replace the placeholders with the summary content; 3. Update the token count; Post-recovery status: Number of messages: 42 (including summaries); Number of Tokens: 84500; Key information has been recovered; 4.3 Reactive compression recovery.

[0170] 4.3.1 Triggering conditions.

[0171] When token usage is detected to be nearing its limit and cannot be resolved through folding, reactive compression recovery is selected: The token usage rate exceeds 95%; Context folding has been performed, but the effect is insufficient; The new information is expected to cause a budget overrun.

[0172] 4.3.2 Implementation steps.

[0173] Step 1: Perform full digest compression. Unlike active automatic compression, reactive compression is passively triggered: Python def check_reactive_compact_trigger(): if current_tokens > budget_threshold 0.95 and not can_fold_more(): trigger_reactive_compact() def trigger_reactive_compact(): 1. Save the current state immediately. emergency_snapshot = capture_context() 2. Quickly generate summaries (using simplified hints) quick_summary = generate_quick_summary(current_messages) 3. Immediately replace the message array replace_messages_with_summary(quick_summary) 4. Asynchronously generate a complete summary and replace it. generate_full_summary(emergency_snapshot).then(lambda full_summary: update_summary(full_summary)) ```.

[0174] 4.3.3 Rapid summarization strategy.

[0175] Reactive compression employs a fast summarization strategy, prioritizing response speed: it uses a lightweight summarization model (rather than a full LLM); the summarization length is controlled within 1000 tokens; the most recent 5 rounds of dialogue are retained without compression; and the full summarization is generated asynchronously and replaced in the background.

[0176] 4.4 Token limit upgrade restored.

[0177] 4.4.1 Triggering conditions.

[0178] If the token limit is insufficient to complete the task, select "upgrade and restore token limit": The current learning task requires dealing with a large amount of code (such as analyzing large projects); It is anticipated that token consumption will exceed the current limit; The child or system explicitly requests an increase in the budget.

[0179] 4.4.2 Implementation steps.

[0180] Step 1: Assess upgrade needs. The system analyzes current tasks and remaining budget: Python def evaluate_upgrade_need(): estimated_need = estimate_token_need(current_task) current_budget = get_current_budget() remaining_budget = get_remaining_budget() if estimated_need > remaining_budget: upgrade_amount = estimated_need remaining_budget + buffer_size return {"need_upgrade": True, "amount": upgrade_amount} return {"need_upgrade": False} ```.

[0181] Step 2: Execute the cap upgrade. Increase the token budget according to the upgrade strategy: Temporary upgrade: Temporarily increases the budget for the current learning task, which will be restored after the task ends; Permanent Upgrade: Permanently increase the child's token budget cap; Tiered upgrades: The upgrade level is adjusted according to the learning level.

[0182] 4.4.3 Upgrade Strategy.

[0183] The upgrade strategy adopted in this case is: Beginner learners: Up to 400,000 tokens per upgrade; Advanced learners: Up to 1,200,000 tokens can be upgraded at a time; Advanced learners: Up to 6,000,000 tokens per upgrade; 4.5 Token recovery.

[0184] 4.5.1 Triggering conditions.

[0185] When the reserved token quota is detected to be available, select token recovery: The child purchased additional tokens; The system has reserved an emergency token pool; Other sessions released the reclaimable token.

[0186] 4.5.2 Implementation steps.

[0187] Step 1: Check available recovery quota. The system queries available token sources, pseudocode: Python def check_recoverable_tokens(): sources = [] Check the extra amount purchased by the child. purchased_tokens = get_purchased_token_balance(child_id) if purchased_tokens > 0: sources.append({"type": "purchased", "amount": purchased_tokens}) Check the system emergency pool emergency_tokens = get_emergency_pool_balance() if emergency_tokens > 0: sources.append({"type": "emergency", "amount": emergency_tokens}) Check recyclable tokens recyclable_tokens = calculate_recyclable_tokens() if recyclable_tokens > 0: sources.append({"type": "recyclable", "amount": recyclable_tokens}) return sources ```.

[0188] Step Two: Restore using reserved credit limit. Use available credit limit according to priority: Prioritize the use of recyclable tokens (no cost); Secondly, use the emergency pool token (system cost); Finally, use the additional credit purchased (user cost).

[0189] 4.6 Recovery path selection logic.

[0190] The system automatically selects the optimal recovery path based on the current and target states. (Pseudocode:) Python def select_recovery_path(current_state, target_state): Priority 1: Folding and emptying for restoration (lowest cost) if can_fold_release_enough_tokens(current_state, target_state): return "fold_drain" Priority 2: Token recovery (no compression loss) if has_recoverable_tokens() and recoverable_amount >= target_state["need"]: return "token_recovery" Priority 3: Token cap upgrade (expansion capability) if can_upgrade_token_limit(target_state["need"]): return "limit_upgrade" Priority 4: Reactive compression (last resort) return "reactive_compact" ```.

[0191] V. Execution order of multi-level compression.

[0192] This implementation case strictly follows the following order to perform six levels of compression: 1. Level 0: Semantic pre-screening uses a lightweight embedding model to evaluate message importance; 2. Level 1: Tool result budget control enforces a maximum size limit for the aggregated tool results for each message; 3. Second level: Segment compression selectively removes historical content; 4. Third level: Micro-compression performs lightweight tool result compression; 5. Fourth level: Context folding for fine-grained context-selective compression; 6. Level 5: Automatically compress and execute complete learning session summaries; 7. Level 6: Reinforcement learning optimization dynamically adjusts compression parameters based on feedback.

[0193] Each compression level has a condition check, and it is only executed if the condition is met. If resource usage is below a threshold after a certain compression level, subsequent levels may be skipped.

[0194] VI. Evaluation of Implementation Effectiveness.

[0195] 6.1 Performance Indicators

[0196] Following implementation, the system achieved significant improvements across several key performance indicators. Specifically, the average learning session length increased dramatically from 32 rounds to 62 rounds, a 94% increase. The context interruption rate decreased significantly from 14% to 1.8%, an improvement of 87 percentage points. Budget calculation accuracy was greatly improved, with the error rate dropping from 8.2% to 0.18%, a 98% improvement. System response speed was also significantly optimized, with compression response time reduced from 820 milliseconds to 305 milliseconds, an efficiency improvement of 63%. Furthermore, children's satisfaction scores rose from 4.1 out of 5 to 4.6, an increase of 12%.

[0197] 6.2 Compression effect statistics.

[0198] During the 30-day actual operation, the triggering of the system's six-level compression mechanism was as follows: Semantic pre-screening processed 150,000 messages with an average computational cost of 15 milliseconds. Tool result budget compression was triggered 118,000 times, releasing an average of 8,000 tokens per instance, with an execution time of only 4 milliseconds. Fragment compression was triggered 42,000 times, releasing an average of 9,500 tokens per instance, with an execution time of 7 milliseconds. Micro-compression was triggered 35,000 times, releasing an average of 3,800 tokens, with an execution time of 14 milliseconds. Context folding was triggered 11,000 times, releasing an average of 19,000 tokens, with an execution time of approximately 115 milliseconds. Automatic compression was triggered 7,800 times, releasing an average of 112,000 tokens, with an execution time of 2,400 milliseconds. Reinforcement learning optimization updated the policy network once a day, improving compression efficiency by an average of 3%.

[0199] 6.3 Improved user experience.

[0200] Based on children's feedback and system monitoring, the user experience was significantly improved after implementation in this case. Improved learning coherence: Due to the fine-grained compression priority strategy, children's perceived "memory gaps" are reduced by 76%; Reduced response latency: The average time for compression operations decreased from 820ms to 305ms; Improved task completion rate: The learning failure rate due to insufficient context decreased from 11% to 1.8%; Improved child satisfaction: Child satisfaction score increased from 4.1 to 4.6 (out of 5).

[0201] 6.4 Resource utilization optimization.

[0202] The implementation of this invention significantly optimizes the use of system resources: Token usage efficiency is improved by 33%, supporting longer learning sessions within the same budget; API call costs were reduced by 26% due to compression, which reduced unnecessary token transmission. Server load balancing has been improved, with compression operations distributed across different levels.

[0203] VII. Summary of Implementation Cases

[0204] This implementation case demonstrates in detail the complete process of applying the multi-level context compression method to a children's assisted AI programming system.

[0205] By strictly following the six-level compression sequence (semantic pre-screening → tool result budgeting → fragment compression → micro-compression → context folding → automatic compression → reinforcement learning optimization), the system achieves the optimization goal of prioritizing fine-grained compression over coarse-grained summarization.

[0206] The cross-compression boundary budget tracking mechanism ensures the accuracy of token calculation before and after compression, keeping the budget error within 0.18%. Multiple recovery paths provide a flexible recovery mechanism, automatically selecting the optimal recovery strategy based on specific circumstances.

[0207] Actual operational data shows that the implementation of this invention significantly improves system performance and user experience, providing a complete, efficient, and reliable solution for context management in this children's assisted AI programming system.

Claims

1. A multi-level context compression method, characterized in that, Includes the following steps: Level 0 semantic pre-screening uses a lightweight model to evaluate the information entropy and semantic importance of each message, and performs context-aware dynamic budget allocation based on the importance score; The first level of content budget control enforces the maximum size of aggregated content for each message; The second-level fragment compression selectively removes historical content using feature gating. The third level is lightweight compression, which performs lightweight content compression; the fourth level is context folding, which performs fine-grained context-selective compression, performed before the complete summary. If the compressed content is below the summary threshold, the complete summary is skipped. Level 5 complete summary is triggered when resource usage exceeds a threshold. During compression, the context state before compression is captured, the tracking flag is reset, and the original content is replaced with the compressed content.

2. The method according to claim 1, characterized in that, It also includes the following refined steps: In the first-level content budget control, the content replacement operation is not visible to the cache path to ensure cache consistency; in the fifth-level full digest, the pre-compression context state is captured and the pre-compression resource usage is deducted from the resource budget to achieve resource tracking across compression boundaries; wherein, the amount of released resources is configured to adjust the threshold for subsequent compression stages to delay the triggering of the full digest.

3. The method according to claim 2, characterized in that, It also includes context folding based on dialogue state graphs: constructing a dialogue state transition graph, representing messages as nodes in the graph, and dialogue flow as edges; identifying key decision nodes, namely dialogue branch points and information convergence states; Selective folding is performed based on graph topology, retaining key decision paths and compressing only redundant paths; a graph traversal algorithm is used to determine the optimal folding range to maximize information retention.

4. The method according to claim 2, characterized in that, The complete summary includes: checking the triggering conditions, triggering when resource usage exceeds a preset threshold; capturing the context state before compression and recording the resource usage before compression; generating a content summary; resetting the tracking identifier to distinguish the content rounds before and after compression; and replacing the original content array with the compressed content.

5. The method according to claim 2, characterized in that, It also includes compression strategies optimized by reinforcement learning: deploying lightweight reinforcement learning agents to dynamically adjust compression parameters at each level based on task completion rate feedback; defining the state space: current token usage, task type, historical compression success rate, and user satisfaction metrics; defining the action space: adjusting thresholds at each level, selecting different summary models, and switching compression strategy modes; and using reward functions to balance resource saving and information retention to optimize long-term task success rates.

6. The method according to claim 1, characterized in that, The execution logic for each level of compression includes: The fragment compression is enabled or disabled via a feature switch. When enabled, earlier message content is deleted while system messages and recent context content are retained. The lightweight compression uses cached information for delayed processing of messages near the context window boundary, and the compressed message retains its original structure. The context folding supports an incremental folding mode, gradually increasing the folding range until resource usage meets the conditions. If the resource usage after context folding is lower than the summary threshold, the fifth-level complete summary is skipped.

7. A multi-level context compression system, characterized in that, include: The semantic pre-screening module is configured to use a lightweight model to evaluate the information entropy and semantic importance of each message, and to perform context-aware dynamic budget allocation based on the importance score. The content budget control module is configured to enforce a maximum size limit for the aggregated content of each message; The fragment compression module is configured to selectively remove historical content with feature gating; the lightweight compression module is configured to perform lightweight content compression. The context folding module is configured to perform fine-grained context-selective compression, executed before the full summary. If the folded content is below the summary threshold, the full summary is skipped. The full summary module is configured to be triggered when resource usage exceeds the threshold. During compression, it captures the context state before compression, resets the tracking flag, and replaces the original content with the compressed content. The cross-boundary resource tracking module is configured to deduct the resource usage before compression from the resource budget when the full summary module is executed to achieve cross-compression boundary resource tracking.

8. The system according to claim 7, characterized in that: The system also includes a multi-level compression engine, comprising: a sequential executor to ensure that multi-level compression is performed in a strict order; a condition checker to check the trigger conditions of each compression level; a feature gating manager to manage the feature switch status of each level; and a compression result aggregator to summarize the compression results and status of each level. The content budget control module includes: a content size calculator, which calculates the aggregate content size of each message; an upper limit comparator, which compares the size with a preset upper limit; a content processor, which truncates or replaces content that exceeds the upper limit; and a cache consistency guarantor, which ensures that content replacement is not visible to the cache path. The fragment compression module includes: a feature switch checker, which checks whether the fragment compression feature is enabled; a history message selector, which selects early messages to be removed; a context retainer, which retains system messages and the most recent context; and a threshold adjuster, which adjusts the subsequent compression threshold using the freed resources. The lightweight compression module includes: a lightweight compressor for compressing specified content portions; a boundary message processor for processing messages near the context window boundaries; and a cache optimizer for using cache information for delay processing. The cross-boundary resource tracking module includes: a pre-compression state capturer, which captures contextual resource usage before a complete summary; a budget calculator, which deducts pre-compression resource usage from the remaining resource budget; a continuity guaranteeer, which ensures that resource tracking remains continuous across compression boundaries; and a system synchronizer, which ensures that the system can accurately calculate post-compression resource consumption.

9. The system according to claim 7, characterized in that, Also includes: The dialogue state graph construction module is configured to build a dialogue state transition graph and identify key decision nodes and convergence states. The graph structure folding module is configured to perform selective context folding based on the graph topology. The reinforcement learning optimization module is configured to dynamically adjust compression strategy parameters based on task completion rate feedback; the multi-level compression engine is also configured to coordinate the work of the dialogue state graph construction module and the reinforcement learning optimization module.

10. The system according to claim 9, characterized in that, The semantic pre-screening module includes: an information entropy calculation unit, which calculates message information entropy using a lightweight model; an importance scoring unit, which generates a score by comprehensively considering message location, content type, and historical relevance; a dynamic budget allocation unit, which allocates differentiated budget quotas to messages of different importance levels; and a threshold adjustment unit, which adjusts compression thresholds at each level according to the dynamic budget mapping. The dialogue state graph construction module includes: a feature extraction unit, which extracts the intent label, entity information, and sentiment polarity of the message; a graph construction unit, which constructs node connections based on time windows and semantic similarity; a convergence state identification unit, which identifies similar semantic states reached by multiple dialogue paths; and a key node identification unit, which calculates node centrality indices to identify key decision points. The reinforcement learning optimization module includes: a state monitoring unit, which collects current token usage, task type, and historical success rate; a policy network unit, which learns the optimal compression policy based on a neural network; an action execution unit, which executes threshold adjustment, model selection, and policy switching actions; and a feedback collection unit, which collects task completion rates for policy optimization.