Memory management method and apparatus for an agent
By employing a multi-layered memory architecture and a memory management decision-making model trained through reinforcement learning, the problem of insufficient memory management in large language model agents is solved. This enables agents to accumulate online experience and continuously evolve, thereby improving task execution efficiency and the ability to adapt to complex tasks.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHEJIANG LAB
- Filing Date
- 2026-05-14
- Publication Date
- 2026-06-19
AI Technical Summary
Existing memory management schemes for large language model agents suffer from problems such as limited memory capacity, inability to accumulate knowledge across sessions, static updates to the knowledge base, non-real-time memory updates, high randomness in memory writing decisions, and ineffective storage and reuse of trial and error experience. These issues result in low efficiency and limited application value of the agent in task execution.
It adopts a multi-layer memory architecture, including the current context layer, the external knowledge base layer, the optimizer momentum state layer, and the large model parameter layer. It trains the memory management decision model through reinforcement learning to realize intelligent reading and writing of memory, cross-level transfer and full life cycle management, and structured storage of trial and error experience for cross-session reuse.
It improves the memory utilization and task execution efficiency of the agent, enables online experience accumulation and continuous evolution, reduces the recurrence of the same errors, and enhances the agent's adaptability in complex task scenarios.
Smart Images

Figure CN122242560A_ABST
Abstract
Description
Technical Field
[0001] This specification relates to the field of artificial intelligence technology, and more particularly to memory management technology for large language model intelligent agents, specifically a memory management method and apparatus for an intelligent agent. Background Technology
[0002] In recent years, Large Language Models (LLMs) have made groundbreaking progress in the field of natural language processing. Intelligent agents built upon LLMs, with their powerful language understanding, logical reasoning, and task execution capabilities, have demonstrated enormous application potential in numerous scenarios such as automated task processing, code generation, intelligent information retrieval, human-computer interaction, and intelligent decision-making. The memory management system, as a core component of the intelligent agent, is the foundation for achieving continuous task execution, experience accumulation, and capability evolution. Its rational design and efficiency directly determine the actual efficiency of the intelligent agent, the quality of task completion, and the user experience.
[0003] However, existing memory management schemes for large language model agents still have many fundamental flaws, severely restricting the agent's continuous evolution capability during actual deployment and use. These flaws are manifested in the following five aspects: First, traditional context window memory schemes rely solely on the large language model's own context window for memory storage. While this scheme updates quickly, its memory capacity is strictly limited by the model's context window length, and it is only temporary memory, cleared after the session ends, making it impossible to achieve cross-session knowledge accumulation and experience reuse; Second, based on static retrieval-augmented generation (Retrieval-Augmented... The memory management scheme of Generation (RAG) uses a fixed knowledge base to provide information support for the agent. After deployment, the knowledge base cannot be dynamically updated according to the agent's actual usage experience, and the agent lacks the ability to "learn by doing". Third, although the memory injection-based scheme can integrate memory information into the model structure, memory updates only occur during the offline training phase of the model. During the inference deployment phase, it cannot be updated in real time according to new experience, and the memory system is disconnected from the actual task execution process. Fourth, some memory management schemes rely on the prompt word decision of the large language model itself to realize memory reading and writing. They lack explicit learning signals and clear optimization goals. The memory writing decision is random, and the accuracy and effectiveness are low. Fifth, the trial and error experience management scheme only reflects and summarizes the trial and error process, loses the complete trial and error trajectory and structured details, and cannot support the accurate matching and reuse of similar scenarios, resulting in the repeated occurrence of the same error.
[0004] More importantly, the trial-and-error experiences generated by the agent during task execution, which are the most valuable learning signals, are mostly discarded directly in existing solutions, failing to be effectively stored and reused. This not only reduces the agent's task completion efficiency but also significantly impacts its practical application value. Therefore, a systematic agent memory management method is urgently needed, capable of intelligent decision-making for memory writing, structured storage and cross-session reuse of trial-and-error experiences, and adaptive transfer between multi-level memories. This would enable the agent to accumulate online experience, learn through trial and error, and continuously evolve during the deployment and use phase, addressing the technical shortcomings of existing solutions. Summary of the Invention
[0005] This specification addresses the aforementioned problems in intelligent agent memory management by providing an intelligent agent memory management method, device, and computer-readable storage medium. It solves the technical problem that existing intelligent agents cannot achieve online experience accumulation, trial-and-error learning, and continuous evolution during the deployment and use phase, thereby improving the memory utilization rate, task execution efficiency, and continuous evolution capability of intelligent agents.
[0006] To achieve the above objectives, according to a first aspect of one or more embodiments of this specification, a memory management method for an intelligent agent is proposed. The memory system corresponding to the intelligent agent includes a current context layer of the current task and an external knowledge base layer accessible to the intelligent agent. The method includes: Obtain the current memory state vector, wherein the memory state vector is encoded based on the state information of the current task and the state information of the memory system; Based on the current memory state vector, a pre-trained memory management decision model is invoked to perform memory read / write management on the agent, wherein performing the memory read / write management includes at least one of the following steps A1 to A4: Step A1: Based on the information in the current context layer, summarize the information into knowledge-based memory entries and write them into the external knowledge base layer; Step A2: Identify the trial-and-error trajectory of the agent from failure to success during task execution from the current context layer, extract and encode the trial-and-error trajectory into trial-and-error memory entries, and write them into the external knowledge base layer; Step A3: Do not perform a write operation; Step A4: Read knowledge-based or trial-and-error memory entries related to the current task from the external knowledge base layer and inject them into the current context layer.
[0007] More preferably, step A4, which involves reading trial-and-error memory entries related to the current task from the external knowledge base layer and injecting them into the current context layer, includes: encoding the current task description into a vector; retrieving trial-and-error memory entries related to the current task from the external knowledge base layer based on similarity; and injecting the successful strategy summary of the matched trial-and-error memory entries into the current context of the current task when the similarity exceeds a preset threshold.
[0008] More preferably, the method further includes: after the session of the current task ends, invoking the memory management decision model to extract session information from the current context layer, and integrating the extracted information into knowledge-based memory entries or trial-and-error-based memory entries for archiving to the external knowledge base layer.
[0009] More preferably, the method further includes: when the number of similar memory entries in the external knowledge base layer exceeds the merging threshold, merging multiple similar memory entries into a strategic memory entry and then writing it into the external knowledge base layer.
[0010] More preferably, the memory system further includes an optimizer momentum state layer for optimizing the training of a large model of the agent; The method further includes: filtering target memory entries in the external knowledge base layer whose access frequency is higher than a set threshold and whose content change rate is lower than a set threshold; constructing a fine-tuning training set based on the target memory entries when the preset migration triggering conditions are met; and performing lightweight parameter fine-tuning on the parameter layer of the large model through the optimizer momentum state layer.
[0011] More preferably, the memory entries in the external knowledge base layer include memory type tags, content text, creation time, last access time, access count, TTL value, and associated metadata; the method further includes performing TTL value expiration eviction or LRU value sorting eviction on the memory entries of the external knowledge base layer; The step of eviction of memory entries in the external knowledge base layer by expiring TTL values includes: scanning the external knowledge base layer at preset intervals and obtaining the access frequency of memory entries based on the last access time and access count of the memory entries; extending the TTL value corresponding to the memory entries based on the access frequency; and eviction of the memory entries when the extended TTL value expires. Performing LRU sorting and eviction on the memory entries of the external knowledge base layer includes: when the storage utilization rate of the external knowledge base layer exceeds a first threshold, sorting the memory entries of the external knowledge base layer based on the LRU representation index, and evictioning the memory entries that have not been accessed for the longest time in sequence according to the sorting results, until the storage utilization rate drops below a second threshold.
[0012] More preferably, the memory management decision model is obtained through reinforcement learning training including the following steps: The read and write operations of the memory system corresponding to steps A1, A2, A3, and A4 are modeled as discrete actions in reinforcement learning. First, a cold start is completed by supervised fine-tuning using manually labeled datasets. Then, a proximal policy optimization algorithm is used, with a weighted combination of task success rate, task completion efficiency, and memory hit rate as the reward function, to train and optimize the optimal memory writing decision strategy online.
[0013] More preferably, the discrete actions include writing knowledge-based memory, writing trial-and-error trajectories, not writing to memory, and reading memory. The state space of reinforcement learning consists of one or more of the agent's current task context, existing memory bank content summary, and current interaction round information.
[0014] More preferably, the method further includes: After any step A1 to A4 is completed, the reward corresponding to that step is calculated, and a data pool for updating the training memory management decision model is constructed based on the reward. After the current task is completed, when the preset triggering conditions are met, the memory management decision model is updated using a reinforcement learning algorithm based on the data in the data pool.
[0015] According to a second aspect of one or more embodiments of this specification, a memory management device for an intelligent agent is also provided, wherein the memory system corresponding to the intelligent agent includes a current context layer of the current task and an external knowledge base layer accessible to the intelligent agent; the device includes: An acquisition unit is used to acquire a current memory state vector, wherein the memory state vector is encoded based on the state information of the current task and the state information of the memory system; A memory read / write unit is configured to invoke a pre-trained memory management decision model based on the current memory state vector to perform memory read / write management of the agent, wherein performing the memory read / write management includes at least one of the following steps A1 to A4: Step A1: Based on the information in the current context layer, summarize the information into knowledge-based memory entries and write them into the external knowledge base layer; Step A2: Identify the trial-and-error trajectory of the agent from failure to success during task execution from the current context layer, extract and encode the trial-and-error trajectory into trial-and-error memory entries, and write them into the external knowledge base layer; Step A3: Do not perform a write operation; Step A4: Read knowledge-based or trial-and-error memory entries related to the current task from the external knowledge base layer and inject them into the current context layer.
[0016] Through the technical solutions provided by one or more of the above implementation methods, the trial-and-error experience of the intelligent agent during task execution is completely extracted and structured and encoded. In subsequent new tasks, the trial-and-error experience can be reused across sessions through semantic similarity matching. This solves the problem of discarded or unreusable trial-and-error experience, effectively reduces the recurrence rate of the same errors, and significantly improves task completion efficiency. By constructing a positive cycle of "trial and error → memory writing → cross-session reuse → reduced trial and error", the intelligent agent can continuously accumulate task execution experience during the deployment and use phase, continuously optimize its task execution strategy, achieve online learning and continuous evolution, and better adapt to complex and ever-changing task scenarios. Attached Figure Description
[0017] Figure 1 This is a structural diagram of the memory system of an intelligent agent provided in an exemplary embodiment.
[0018] Figure 2 This is a flowchart of an exemplary embodiment of a memory management method for an intelligent agent.
[0019] Figure 3 This is an exemplary embodiment of a memory read / write management flowchart for reinforcement learning.
[0020] Figure 4 This is a data migration flowchart of a memory system provided in an exemplary embodiment.
[0021] Figure 5 This is a flowchart of lifecycle management of a memory system provided using an exemplary embodiment.
[0022] Figure 6 This is a schematic diagram of the structure of a device provided in an exemplary embodiment.
[0023] Figure 7 This is a structural block diagram of a memory management device for an intelligent agent provided in an exemplary embodiment. Detailed Implementation
[0024] To enable those skilled in the art to better understand the technical solutions in this specification, the technical solutions in the embodiments of this specification will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this specification, and not all embodiments. Based on the embodiments in this specification, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of this specification.
[0025] It should be noted that the "large model" mentioned in this specification may include a large language model (LLM), which serves as the foundational model for building intelligent agents. The text encoder, Transformer encoder, and multilayer perceptron (MLP) of the large model can adopt existing general model structures, with parameters adjusted according to the application scenario. The "intelligent agent" mentioned in this specification includes a computer-executable program built based on the large language model, possessing autonomous task execution, information interaction, and memory management capabilities. An intelligent agent typically includes the following modules: a perception module—responsible for environmental information acquisition (including multimodal sensors and user-end task command input) and preprocessing; a cognition module—usually driven by the large language model, responsible for reasoning, task decomposition, and memory management; a memory module—responsible for maintaining the agent's historical experience and contextual continuity, typically including the current context of the large model's session window and an external knowledge base providing static retrieval for the large model; the vector database used can be mature products such as FAISS, Milvus, and Redis; and an execution module—executes specific application operations according to the task path output by the cognition module.
[0026] The core of this manual is to build a memory system with a multi-layered memory architecture for intelligent agents. By training the memory management decision model through reinforcement learning, and based on this model, intelligent reading and writing of memory, cross-level transfer, and full lifecycle management of the external knowledge base layer are realized, ultimately enabling the intelligent agent to accumulate online experience and continuously evolve.
[0027] Figure 1 The illustration shows the memory system structure of an intelligent agent provided in an exemplary embodiment. It comprises a four-layer memory architecture: a current context layer (hereinafter sometimes referred to as L1 layer), an external knowledge base layer (hereinafter sometimes referred to as L2 layer), an optimizer momentum state layer (hereinafter sometimes referred to as L3 layer), and a large model parameter layer (hereinafter sometimes referred to as L4 layer). These layers have different carriers, update frequencies, and persistence, and form a clear information flow path. The specific design is as follows: (a) L1 layer: current context layer The L1 layer is the agent's temporary memory system, simulating human working memory. It is the core carrier for direct interaction between the agent and the user / task environment, and real-time task information is first stored in the L1 layer.
[0028] Key features: The carrier is the context window of the large language model itself. Currently, the context window of the large model will set a maximum supported context length limit, with tokens as the length unit; the update frequency is the fastest among the four layers, updating every token; the persistence is temporary, and it is cleared after the session ends, releasing memory resources. Storage content: Stores real-time interactive information of the current task / session, including task description, user instructions, agent output results, dialogue interaction content, real-time status of task execution, etc. Storage rules: Information is concatenated in chronological order as a text sequence to maintain timeliness and integrity; when the capacity reaches its limit, a sliding window mechanism is used to delete the oldest worthless information and retain the latest core interactive information.
[0029] (ii) L2 layer: External knowledge base layer Layer L2 is the core persistent memory system of the agent, simulating human note-taking and memos. It is the core carrier for the agent to accumulate experience and reuse memories across conversations.
[0030] Key features: The carrier is usually a vector database (such as FAISS / Milvus), which supports the storage of millions of memory entries and can be horizontally expanded according to needs; the update frequency is moderate, and read and write are performed on demand (based on the decision results of the memory management decision model); persistence is configurable, and the retention time of memory entries is controlled by the TTL value; Memory entry structure: Each memory entry in the L2 layer is standardized structured data to ensure the integrity, retrieval, and reusability of information. Specific fields can be designed as follows:
[0031] Storage content: Stores structured knowledge-based memories and trial-and-error-based memories. Knowledge-based memories are general knowledge such as objective facts, formulas and theorems, while trial-and-error-based memories are structured trial-and-error trajectories of "failure → success". Storage rules: Similar memory entries are categorized and stored according to memory type tags and task type; all memory entries must undergo semantic similarity retrieval before being written to avoid duplicate storage.
[0032] (III) L3 layer: Optimizer momentum state layer The L3 layer is an auxiliary memory system for the training phase of the agent. It is an important bridge connecting the L2 and L4 layers, providing optimizer state support for the memory solidification and transfer from L2 to L4, and does not directly store task-related memory information.
[0033] Key features: The carrier is the momentum state maintained by the optimizer (e.g., Adam optimizer) during the training of a large language model, requiring no additional storage medium; the update frequency is relatively slow, updating every gradient step (only during fine-tuning at L4 layer); persistence is maintained during training, and can be saved or released after fine-tuning. Core content: The core is the first-order momentum mt and second-order momentum vt maintained by the optimizer. Both are consistent with the dimensions of the weight parameters of the large language model: (1) First-order momentum mt: records the exponential moving average of the gradient of the model parameters, reflects the trend direction of the gradient, and makes the parameter update have "inertia" to avoid oscillations in the parameter update process; (2) Second-order momentum vt: records the exponential moving average of the square of the gradient of the model parameters, which can adaptively adjust the learning rate of each parameter to achieve refined parameter updates. Usage rules: Update and use during the L2→L4 memory solidification fine-tuning phase. Before fine-tuning, initialize mt=0, vt=0 or load historical state values. After fine-tuning, you can save the state or release resources as needed.
[0034] (iv) L4 layer: Large model parameter layer Layer L4 is the long-term fixed memory system of the agent and the carrier of the agent's core capabilities. The stored memory information becomes the inherent capabilities of the large language model and can be directly called without additional retrieval.
[0035] Key features: The carrier is the weight parameters θ of a large language model (including weights and bias parameters of all layers such as embedding layer, attention layer, and fully connected layer); the update frequency is the slowest among the four layers, and full parameter tuning or lightweight fine-tuning can be performed as needed; persistence is long-term, and parameter updates are permanently saved (unless new parameter tuning is performed). Storage content: Stores high-frequency, stable core memory that has been validated over a long period of time and reused frequently, derived from high-quality memory in the L2 layer through fine-tuning and solidification; Update rules: Full parameter tuning is possible, or lightweight fine-tuning (such as LoRA) techniques can be used for parameter fine-tuning; fine-tuning can be done only on the attention layer of the model, setting a small rank to minimize computation and storage overhead while ensuring memory retention; after fine-tuning, the effect needs to be verified, and the updated parameters are saved only after verification.
[0036] The four-layer memory system described above is not independent of each other, but forms a bidirectional information flow path, enabling adaptive migration and reuse of memory. All operations can be executed automatically in the background, transparent to the user / task environment. The intelligent agent's management operations on the four-layer memory system (such as reading and writing at each layer, data migration between layers, and data lifecycle management within layers) can all be decided and executed using the memory management decision model.
[0037] This explanation will first detail the reinforcement learning training process of the memory management decision-making model. This process is divided into two stages: supervised fine-tuning (SFT) cold start and proximal policy optimization (PPO) online optimization. The core of reinforcement learning is the interaction between the model and the environment. The model obtains rewards and optimizes its policy by performing actions. Before training, three key elements must be clearly defined: the state space S, the action space A, and the reward function R. The state space S describes the real-time task state and memory system state of the agent, serving as the input for model decision-making. In one illustrated implementation, the state space S may include a description of the current task, the history of the most recent K rounds of dialogue, the memory bank capacity utilization rate, and summaries of the most recent N memory entries, where K and N are both preset positive integers. After processing by a state encoder, the state space is mapped to a fixed-dimensional current memory state vector. The state encoder can employ a lightweight Transformer encoder to achieve the fusion and extraction of multi-dimensional state information, ensuring the integrity and relevance of the state representation.
[0038] The action space S is a discrete space, and memory read and write operations can be modeled as four core discrete actions, corresponding to the core operations of memory management. At each decision step, the model selects the action with the highest probability to execute: A1: Writing knowledge-based memories: Summarize knowledge-based memory information from layer L1, encode it as knowledge-based memory entries, and write it into layer L2; A2: Write the trial and error trajectory: Extract the "failure → success" trial and error trajectory from the L1 layer, structure it into trial and error memory entries, and write it into the L2 layer; A3: No write: No memory read / write operations are performed, and the agent continues to perform its task; A4: Retrieve memory: Retrieve memory related to the current task from layer L2 and inject it into layer L1 to assist task execution.
[0039] The reward function R provides explicit learning signals for model policy optimization. It can be implemented using a multi-dimensional weighted combination, with task success rate as the core factor, supplemented by task completion efficiency and memory hit rate. The formula is as follows:
[0040] Where: α, β, γ are weighting coefficients, satisfying... And α>β, α>γ, in this embodiment, we assume The reward for a successful task is +1, and the reward for a failure is -1. For efficiency rewards, the number of interactions with the user upon task completion is inversely proportional to the reward amount, as shown in the formula: (T is the actual number of interaction rounds), and 0 is given for failure; The reward is calculated based on the quality of the memory entry and the subsequent hit count of the memory entry; if no entry is written, the reward is 0. The reward function is calculated after each task episode ends, and an intermediate reward of +0.1 is set at key steps (such as the first trial and error reuse, and successful memory transfer) to guide the model to quickly learn the optimal strategy.
[0041] To achieve a cold start for the model, a manually labeled memory-written decision dataset needs to be constructed. The steps are as follows: Dataset collection: Collect historical interaction trajectories of intelligent agents in typical scenarios such as text classification, machine translation, code generation, and intelligent question answering. Cover three levels of task difficulty: simple, medium, and hard, and four levels of knowledge base capacity: empty, low, medium, and high. Each trajectory contains complete information such as task description, dialogue history, memory base status, task result, and trial and error trajectory. The total sample size is no less than 100,000.
[0042] Annotation content and rules: Annotators mark the optimal memory operation that should be performed at each step based on the historical interaction context of the agent: if there is general knowledge, mark it A1; if there is an effective trial and error trajectory, mark it A2; if there is no valuable information and no need to retrieve it, mark it A3; if the task encounters a bottleneck, mark it A4.
[0043] Data preprocessing: Remove samples with incorrect labeling or missing information, encode the state information to generate the current memory state vector, divide it into training set, validation set and test set, and perform data augmentation such as synonym replacement and word order adjustment on the training set to improve the model's generalization ability.
[0044] Next, supervised fine-tuning (SFT) cold start training can be performed. The goal of supervised fine-tuning is to train the initial parameters of the model using a manually labeled dataset, enabling the model to possess basic memory and decision-making capabilities. The steps are as follows: Determine the model network structure and training hyperparameter settings: The core of the memory management decision model is a multilayer perceptron (MLP), with the structure as follows: input layer → hidden layer → ReLU activation + LayerNorm + Dropout → output layer (Softmax activation), which outputs the probability distribution of each action; a mini-batch gradient descent optimizer and cross-entropy loss function can be used. After setting parameters such as batch size, initial learning rate, number of training epochs, maximum norm of gradient clipping, and optimizer momentum, the training process can be started.
[0045] Training execution: Initialize model parameters, load pre-trained parameters of the state encoder, iterate over the training set and calculate the loss, and update parameters through backpropagation; evaluate the action prediction accuracy with the validation set after each training round, and trigger early stop if the accuracy does not improve after 3 consecutive rounds; save the parameter with the highest accuracy on the validation set as the cold start model parameter, and require the accuracy on the test set to be no less than 80%.
[0046] The cold-start model only has manually labeled decision logic. Next, it needs to be optimized online in a real-world environment using the PPO algorithm. In one illustrated embodiment, the PPO-Clip algorithm can be used to address the instability issue in traditional policy gradient training. The steps are as follows: Training environment setup: Build an intelligent agent task execution and memory system environment that is consistent with the real scene, with six major functions: task generation, intelligent agent interaction, memory system simulation, state perception, reward calculation, and data recording, to ensure that the model's learning strategy can be directly transferred to the actual scene.
[0047] The dual-network structure consists of a policy network (Actor) which is a memory management decision-making model with initial parameters that are cold-start parameters, and outputs action probabilities; and a value network (Critic) with the same structure as the policy network, which outputs state value estimates and evaluates the effects of actions.
[0048] Hyperparameter settings: Set parameters such as batch size, policy network learning rate, value network (linear decay), discount factor, advantage function coefficient, experience pool size, number of samples per update, and clip threshold to limit the policy update magnitude.
[0049] Online interactive training execution: Initialize network parameters and experience pool; generate initial tasks in the environment; collect state data to generate the current memory state vector; randomly sample actions for execution in the policy network; the environment provides feedback on rewards and the next state; store the sequence "state-action-reward-next state-task completion" in the experience pool until the sample size reaches the update sample size; extract samples from the experience pool; calculate state value using the value network; calculate action advantage value using generalized advantage estimation (GAE); calculate cumulative reward using a discount factor; input samples into the dual networks in batches; update the policy network (PPO-Clip loss + entropy loss) and the value network (mean squared error loss); train each batch for a preset number of rounds; evaluate the model using a test task set; calculate average cumulative reward, task success rate, memory hit rate, and average number of steps completed; save the optimal parameters; repeat the above steps; if the average cumulative reward does not improve for a preset number of rounds, and the task success rate and memory hit rate reach preset standards, the policy is considered converged, and training stops. Through the above steps, a trained memory management decision model is obtained, which can make optimal memory management decisions based on the agent's real-time state.
[0050] It is worth noting that the above management decision model can be used as another model independent of the large model of the agent, as part of the execution module of the agent, to manage the above four memory systems; of course, the above management decision model can also be the large model of the agent itself, which is not limited in this specification.
[0051] When the aforementioned management decision-making model is the large model of the agent itself, during the training phase, there is no need to train a separate independent PPO model. Instead, the reward signals for memory management (task success rate, memory hit rate, execution efficiency) are integrated into the large model's RLHF training process: In the cold start phase, manually labeled "state → optimal action" samples are collected, and the large model learns the basic memory decision-making logic through supervised fine-tuning (SFT); In the online optimization phase, after the large model performs memory operations, the system calculates the reward value according to the reward function, updates the large model parameters through PPO algorithms (such as reinforcement learning based on human feedback), optimizes its decision preferences, and makes the decisions more aligned with the "high reward" goal; Reward signal injection: The reward value is converted into natural language feedback (such as "choose to write knowledge-based memory, improve subsequent task hit rate, reward +0.2"), or the reward is directly passed through model gradient updates to guide the large model to learn the optimal decision strategy. Instead of designing a separate state encoder, the system organizes state information such as "current task description, most recent K rounds of dialogue, and memory bank status (capacity utilization, recent memory summary)" into natural language text in a fixed format, and inputs it into the large model as part of the prompt words. After the large model outputs the action number, the system parses and executes the corresponding memory operation, realizing the connection between decision-making and execution: naturally connecting the L1 layer context (its own context window), L2 layer retrieval (through tool calls to the interface), and L2→L4 fine-tuning (updating its own parameters), forming an end-to-end memory management closed loop.
[0052] Once the trained memory management decision model is deployed to the agent memory management system (or the agent's large model itself is trained through RLHF to complete the above memory management decisions), it can execute core management decisions such as current memory state vector acquisition, memory read and write management, cross-level memory transfer, full lifecycle management of external knowledge base layer, and cross-session reuse of trial and error experience, thereby realizing full-process memory management of the agent. Figure 2 A memory management method for an intelligent agent provided in an exemplary embodiment, wherein the executing entity of the method is a computing device deployed by the aforementioned intelligent agent, specifically includes: Step 202: Obtain the current memory state vector, wherein the memory state vector is a fixed-dimensional state vector obtained by the state encoder by concatenating the current task state information and the state information of the memory system and mapping them through the encoder; The current memory state vector is the core input of the memory management decision model. It is generated in each interaction round (decision step) of the agent. It collects task state information (such as current task description, task progress, execution status) and memory system state information (such as the history of the most recent K rounds of dialogue, memory bank capacity utilization rate, summary of the most recent N memory entries, current interaction round, memory bank hit rate, and the proportion of knowledge-based / trial-and-error memory) in real time to ensure that the information is real-time and complete.
[0053] Next, the memory state vector is constructed: the collected state information of the current task and the state information of the memory system are input into the state encoder to generate a feature vector of a preset dimension. After normalizing the numerical information (such as capacity utilization), the vector is concatenated through an embedding layer to obtain the original feature vector. The original feature vector is then input into a lightweight Transformer encoder, where features are fused using a multi-head self-attention mechanism. After CLS pooling, a fixed-dimensional current memory state vector is generated and input into the memory management decision model.
[0054] Step 204: Based on the current memory state vector, a pre-trained memory management decision model is invoked to perform memory read / write management on the agent. The current memory state vector is input into the memory management decision model, which outputs the probabilities of each action. The action with the highest probability is selected for execution, thereby achieving intelligent memory read / write. Specifically, the above memory read / write management includes at least one of the following steps 2041 to 2044: Step 2041: Based on the information in the current context layer, summarize it into knowledge-based memory and write it into the external knowledge base layer.
[0055] First, core knowledge with universal value, such as objective facts, formulas, theorems, and fixed processes, is extracted from the L1 layer, excluding temporary and personalized information. Next, the extracted knowledge is summarized and organized, redundancy is removed, and errors are corrected to generate knowledge-based memory entries conforming to the L2 layer storage format. For example, a UUID memory ID is generated, the content of the memory entry is encoded as a content vector, the memory type is set to "knowledge-based," the creation time / last access time is set to the current time, the access count is 0, the default TTL value is 30 days, and associated metadata is filled in. Optionally, before writing the above knowledge-based memory entries to the L2 layer, a semantic similarity search can be performed with existing memory entries in the L2 layer. The knowledge-based memory entries generated above are only written to the L2 layer if the similarity reaches a preset threshold, to prevent redundancy of synonymous information in the L2 layer and save storage space in the knowledge base. After writing, the memory base statistics should also be updated and operation information recorded for subsequent reward calculations.
[0056] Step 2042: Identify the "failure → success" trial-and-error trajectory during the execution of the agent's task from the current context layer, extract and encode the trial-and-error trajectory into trial-and-error memory entries, and write them into the external knowledge base layer.
[0057] The extracted and encoded data are used to create structured memory entries containing scenario description vectors, failure action sequences, error type labels, success strategy summaries, and applicable condition constraints, which are then written into the external knowledge base layer.
[0058] Specifically, the process first detects the "failure → success" transition point in the L1 layer. Failure is defined as output not meeting requirements or a significant error occurring, while success is defined as completing the task objective and achieving the desired result. Next, the complete trial-and-error trajectory is extracted and encoded. This trajectory can include a sequence of failed actions and a summary of successful strategies. It can also include error type labels (such as logical errors or parameter errors) and task scenario information. After extracting the textual information from the trial-and-error trajectory, structured encoding is performed: scenario information is encoded as a scenario description vector, associated with failed actions and error labels, a summary of successful strategies is summarized, applicable conditions and constraints are defined and labeled. Then, trial-and-error type memory entries can be constructed: for example, a UUID memory ID is generated, scenario information and successful strategies are concatenated and encoded as a content vector, the memory type is set to "trial-and-error," the creation time / last access time is the current time, the access count is 0, the default TTL value is 90 days, and associated metadata is filled in.
[0059] For the verified "failure → success" complete trial-and-error trajectory in the current task, after being structured and encoded into the L2 layer, the agent can directly retrieve and reuse this trial-and-error memory when performing the same scenario / type task in the future, without having to repeatedly explore the failed path (such as logical errors, parameter errors, missing steps, etc.), greatly reducing the number of invalid actions. Trial-and-error memory entries are stored in the L2 layer with a preset standardized structure (scenario vector, error type label, success strategy summary, applicable conditions, etc.), getting rid of the temporary memory characteristic of the L1 layer that is "lost when the session ends". This achieves the persistence, retrieval, and cross-session reuse of trial-and-error experience, which is one of the core advantages of this specification, ultimately forming a positive cycle of "trial and error → memory writing → cross-session reuse → reduced trial and error". Writing trial-and-error memory entries eliminates the need for manual annotation of trial-and-error experiences. The system automatically identifies the "failure → success" trajectory, structures and encodes it, and writes it into storage. This eliminates the reliance on manual summarization of trial-and-error experiences and enables end-to-end autonomous learning for intelligent agents in actual deployment scenarios, allowing them to "execute tasks, accumulate experience, and optimize capabilities simultaneously." This significantly improves the practicality and cost-effectiveness of deployment.
[0060] Similarly, before writing the aforementioned trial-and-error memory entries into the L2 layer, semantic similarity retrieval can be performed in the L2 layer. Only when the similarity with the existing trial-and-error memory entries in the L2 layer is less than a preset threshold can the entries be written into the L2 layer. After writing, the memory bank statistics should be updated and the operation information should be recorded for subsequent reward calculation.
[0061] It is worth noting that for failure paths that are not fully resolved in the current task, the write operation can temporarily store the incomplete path, and then complete it after the successful path is obtained later before writing it. This avoids erroneous decisions caused by "fragmented failure information" and ensures the accuracy and robustness of the memory content.
[0062] Step 2043: Do not perform a write operation, continue task execution.
[0063] When the L1 layer contains only temporary information, no valid knowledge or trial-and-error trajectory, or the agent's task is executed normally and does not require memory retrieval, or the memory bank capacity is too high, a no-write operation is performed: the memory system state remains unchanged, the agent continues to execute the task, and only the state of the current decision step and the model action are recorded for subsequent policy evaluation: for example, if no-write is performed for 5 consecutive rounds, the agent's task state is re-evaluated.
[0064] Step 2044: Read the memory entries related to the current task from the external knowledge base layer and inject them into the current context layer.
[0065] Specifically, the current task description can be input into the text encoder to generate a task retrieval vector; in the L2 layer, a cosine similarity algorithm is used to calculate the similarity between the retrieval vector and the content vectors of all memory entries; candidate memory entries with similarity greater than a preset threshold are filtered and sorted by weighted "similarity + access frequency"; one or more entries with the highest weighted scores are selected, and all or part of their original text information is injected into the L1 layer; in the L2 layer, the last access time of the memory entry is updated to the current time, the access count is incremented by 1, and the TTL value is dynamically adjusted according to the access frequency; operation information is recorded for subsequent reward calculation.
[0066] This implementation does not limit whether the memory entries injected into the L1 layer are knowledge-based or trial-and-error-based. When injecting knowledge-based memory entries, for example, objective knowledge verified in step 2041 during previous task execution is summarized and written into L2. During the current task execution, the objective knowledge verified in the L2 layer is accurately injected into the current context. The agent does not need to re-derive / verify basic knowledge (such as formulas, rules, industry standards) and can directly reuse mature knowledge to complete the current task, avoiding task failure due to knowledge blind spots or derivation errors. It also avoids L1 layer context redundancy and reduces the computational overhead of context processing in large models. During the memory entry reading process, high-frequency accessed high-quality memories are prioritized to ensure that the knowledge injected into the current task is a newer and more accurate version, avoiding the use of outdated / incorrect knowledge in the original parameters of large models (such as outdated industry policies or invalid calculation formulas).
[0067] When trial-and-error memory entries are injected, the agent reads these entries from the L2 layer. These entries contain trial-and-error experiences from past tasks related to the current task. They include both failed action sequences from past tasks and successful strategies for task execution, thus saving the agent the cost and time of trial and error and directly providing reusable successful experiences, thereby improving the efficiency of task execution. Furthermore, the trial-and-error memory entries include standardized error type labels (e.g., logical errors, parameter errors, missing steps, etc.). After being written, these labels can help the agent establish an association mapping of "error type - scenario - solution". By reading the trial-and-error memory entries as described in step 2044, the agent can quickly identify similar error symptoms in the current task and avoid risks in advance, thereby helping the agent reduce trial and error costs when executing the current task.
[0068] The action execution of the memory management decision-making model is not a single closed loop, but rather a continuous iteration of "action execution → reward calculation → policy update" to dynamically optimize the model's decision-making capabilities. After the model executes any of the actions A1 (writing knowledge-based memory), A2 (writing trial-and-error trajectory), A3 (not writing), or A4 (reading memory), the system generates a quantified reward signal based on the action execution effect. Then, it uses a reinforcement learning algorithm to update the model parameters in reverse, allowing the model to gradually learn "high-reward action preferences," improving the accuracy and effectiveness of memory management decisions, and ultimately adapting to the agent's actual task execution scenarios. The retraining and updating of the memory management decision-making model is implemented based on the PPO (Proximal Policy Optimization) algorithm. The core is to use reward signals to adjust the model's action selection probability, gradually increasing the execution probability of high-reward actions and gradually decreasing the execution probability of low-reward actions.
[0069] As described above, after each action (A1, A2, A3, A4) is completed, the system can immediately calculate the instant reward related to the effectiveness of the action execution: ,in, This indicates the reward for a successful task. If the current task is successful, If it fails ; The task efficiency reward is usually represented as the reciprocal of the number of rounds T of interaction between the large model and the user during the task completion process. The reward for memory value is usually represented as the value contribution made by the memory entries obtained based on the actions A1-A4 above. For example, for actions A1 or A2, the reward can be represented as the number of times H memory entries are retrieved and hit in the following 30 days after they are stored in the L2 layer. For action A3, the reward can be represented as the number of times subsequent tasks fail to retrieve memory entries if they are not written into the decision (e.g., if there are no relevant memories to call). (Otherwise, it is 0), corresponding to action A4. This reward can be represented as the "effective utilization rate" of the read memory entries in this task.
[0070] After the preset reward calculation phase ends, the total reward value of each action in this reward calculation phase is calculated based on the above reward details, serving as the core basis for model updates. Model retraining can adopt a combination of "batch triggering + real-time triggering" to ensure a balance between update efficiency and model stability. For example, as the agent's task is executed and the amount of action-reward data collected in the data pool reaches a preset threshold, batch retraining is automatically triggered; or, if the reward value of a single action-reward data reaches a preset high reward or a preset high penalty, a single real-time retraining is immediately triggered to quickly strengthen or correct the model's decisions; or, if the amount of data in the data pool has not reached the batch trigger threshold, but more than a preset time has passed since the last retraining, supplementary retraining is triggered to avoid the backlog of experience data.
[0071] The training and updating process of the memory management strategy model based on the PPO algorithm in this manual is similar to the training process described above, and will not be repeated here. Figure 3 This is a flowchart illustrating a reinforcement learning-based memory read / write management process as an exemplary implementation. Figure 3 As shown, through closed-loop optimization of reward feedback, the policy model gradually learns the optimal action selection in different scenarios, reduces the execution frequency of low-value actions such as invalid writing and erroneous reading, thereby increasing the value brought by the memory system to the agent in performing tasks, and allowing the agent's memory management efficiency and task execution capability to iterate synchronously.
[0072] Figure 4 This specification illustrates the data migration process of a memory system provided in an illustrative embodiment. In addition to summarizing and extracting information from the current context layer into knowledge-based or trial-and-error memory entries and writing them into the external knowledge base layer during steps 2041 and 2042, the aforementioned agent's memory management method further includes step 206: after the current task's session ends, a pre-trained memory management decision model is invoked to extract session information from the current context layer, and the extracted information is integrated into knowledge-based or trial-and-error memory entries for archiving in the external knowledge base layer.
[0073] Specifically, processing strategies such as session archiving and recursive merging can be used to ensure that valuable information from layer L1 is migrated to layer L2. Session archiving strategy: It can be triggered after the session ends. It scans the session interaction history of L1 layer. The memory management decision model can filter valuable information, compress and structure it into text, construct corresponding memory entries and write them into L2 layer, and finally clear L1 layer; if there is no valuable information, only L1 layer is cleared.
[0074] Recursive merging strategy: This strategy is triggered when the number of memory entries of the same type and task scenario in the L2 layer exceeds a preset merging threshold. For example, similar entries can be clustered using the K-Means algorithm to abstract common patterns and high-level strategies layer by layer, merging multiple specific entries into one strategic memory entry (TTL value of 180 days). The original entry is then deleted, and the new entry is written to the L2 layer, achieving information compression and redundancy elimination. The aforementioned strategic memory entry is an abstract and universal strategy formed by extracting common patterns and general logic from multiple specific memories (which may be multiple knowledge-based memories, multiple trial-and-error memories, or a mixture of both), rather than a specific trial-and-error trajectory or isolated knowledge. Its structure can follow the standardized requirements of knowledge-based memory entries (including content vectors, original text, applicable condition constraints, etc.), supplementing the source information of "abstracted from N specific memories" in the associated metadata, without the need for fields such as "failure action sequence" and "error type label" specific to trial-and-error memories. The core function of the trial-and-error memory entries described in this specification is to avoid known failure paths by relying on the strong association between "specific scenario - failed action - successful strategy"; while the core function of the strategic memory entries is to provide general solutions, such as extracting general strategies for improving marketing campaign retention from 10 trial-and-error memory entries from different e-commerce activities.
[0075] Furthermore, the above-mentioned memory management method for intelligent agents also includes step 208: filtering target memory entries in the external knowledge base layer whose access frequency is higher than a set threshold and whose content change rate is lower than a set threshold, and performing lightweight parameter fine-tuning of the large model based on the target memory entries when the preset triggering conditions are met.
[0076] The dynamic updating of the "access count and last access time" of memory entries during the L2 layer reading process described in step 2044 provides the core basis for the migration of memory information from the external knowledge base layer to the large model parameter layer—memory entries with high frequency of access and strong stability will be preferentially selected as migration candidates and solidified into the inherent capabilities of the large model parameter layer; the triggering timing of the above-mentioned L2 layer and L4 layer memory migration is determined by the optimal strategy of reinforcement learning training, comprehensively balancing the migration benefits and computational costs, and the specific steps are as follows: Step 2081: The core logic for determining the trigger timing.
[0077] The triggering of L2→L4 transfer usually requires two prerequisites: First, the memory itself has solidification value: the memory entries have been verified over a long period of time, are frequently reused, and have stable content (without frequent modifications), and solidification can significantly improve the model's native capabilities; Second, the transfer execution has "cost-effectiveness": the benefits of transfer (task success rate / efficiency improvement) are higher than the computational cost (fine-tuning computing power / time cost), and it does not affect the normal task execution of the agent.
[0078] Based on this, the determination of trigger timing adopts a two-step method of "quantitative index screening of candidate memories + decision migration timing" to avoid false triggering or missed triggering caused by a single threshold. First, by monitoring the L2 layer, all memory entries are periodically scanned to screen out memory entries that meet the following two core quantitative indicators as target memory entries for candidate migration: access frequency is higher than a set threshold and content change rate is lower than a set threshold (the above two indicators can be defined according to the specific computing resource configuration); access frequency is used to verify the reuse value of memory entries. A high access frequency indicates that the task performed by the agent urgently needs the entry. For example, the access frequency can be defined as "the number of times it has been retrieved and injected into the L1 layer in the past 30 days"; content change rate is used to verify the stability of memory. Low change indicates that the content is mature. The content change rate can be defined as "the number of times the core content of the memory entry has been modified in the past 30 days ÷ the original text length".
[0079] In addition to the two core quantitative indicators mentioned above, those skilled in the art can also set other types of screening indicators according to specific task requirements to obtain the memories that need to be solidified into the capabilities of the large model itself from the L2 layer. For example, the retention time of memory items (avoiding short-term, high-frequency memories with no long-term value), the number of memory items of the same category (avoiding model bias caused by solidifying a single memory), and the confidence score used to verify the effectiveness of the memory (the proportion of tasks that succeed after injecting the memory). A high confidence score indicates that the average success rate of the tasks associated with the memory item is high, and so on.
[0080] After selecting target memory entries as transfer candidates, the L2 and L4 memory transfer processes can be triggered when preset transfer trigger conditions are met. The setting of the preset transfer trigger conditions usually needs to balance the transfer benefits and computational costs. For example, this balance can be achieved by setting a transfer cycle (such as setting the transfer cycle to every 30 days), or by using a policy model trained by reinforcement learning (such as a memory management decision model) to determine whether to transfer immediately.
[0081] The specific logic of the strategy model's judgment includes: First, obtaining the input of the memory management decision model from the state space. This input can integrate one or more of the following memory state information: the value of the target memory item (such as the average access frequency, confidence score, and number of covered task scenarios), system resource status (such as the current hardware computing load and peak periods of agent task execution), migration history effects (the improvement rate of model task success rate and execution efficiency after the last migration), and computational cost estimation (such as the estimated LoRA fine-tuning time based on the number of candidate memories and model size). These memory system state information are used to determine whether the migration is suitable to be performed. Next, based on the above-obtained input information, the strategy model is invoked to trigger its output. For example, the output can be to immediately trigger the migration (current resources are sufficient + candidate value is high + off-peak period); delay the trigger by one scan cycle (current resources are tight or during peak periods, re-evaluation next week); or temporarily not trigger the migration (the value of the target memory item is insufficient or the effect of the last migration did not meet the standard, and observation for another 1-2 scan cycles is required). The training objective of the strategy model is to maximize the "net transfer benefit". The reward function can be defined as the weighted difference between the benefit and the cost, but this specification does not specify a particular limitation.
[0082] Step 2082: Construct a fine-tuned training set. Extract the core information of the target memory entries from the above candidate transfers, such as constructing a supervised training sample of <task description, main text of the target memory entry, target output>. After cleaning, divide it into a training set and a validation set, convert it into JSONL standard format, and the total number of samples should not be less than 1000.
[0083] Step 2083: Based on the fine-tuning training set, perform lightweight parameter fine-tuning on the parameter layer of the large model through the momentum state layer of the optimizer.
[0084] The L3 layer (optimizer momentum state layer) is the "training aid core" for the L2→L4 memory solidification transfer. In a specific embodiment, the momentum state of the Adam optimizer can be used in conjunction with LoRA (low-rank adaptive) technology to achieve lightweight fine-tuning of the L4 layer (large model parameter layer), which ensures the memory solidification effect while avoiding the high cost and risk of full fine-tuning. The specific process is as follows: 1. Preparations before fine-tuning: Initialization of L3 layer state and locking and fine-tuning configuration of L4 layer parameters.
[0085] L3 layer momentum state loading: Before fine-tuning begins, the core states of the L3 layer—the first-order momentum mt and second-order momentum vt of the Adam optimizer—are initialized or loaded. If it is the first fine-tuning, both mt and vt are set to 0; if it is incremental fine-tuning (supplementing new memories based on historical fixed memories), the mt and vt saved at the end of the last fine-tuning are loaded to ensure the continuity of the training state and avoid parameter update oscillations.
[0086] L4 Layer Parameter Locking and LoRA Configuration: Lock the core parameters of the L4 layer model (such as the weights of the embedding layer and fully connected layer), and only open the attention layer as the target for fine-tuning; configure the core parameters of LoRA—for example, the rank r of the low-rank matrix and the scaling factor. Insert two low-rank matrices into the attention layer (matrix A: dimension d×r, matrix B: dimension r×d). Fine-tuning only updates the parameters of matrices A and B, while the core core parameters remain unchanged.
[0087] 2. Fine-tuning execution: L3 layer momentum guidance parameter update.
[0088] Mini-batch iterative training: The training set is input into the large model in batches. The model generates output results based on the current parameters, and the loss (cross-entropy loss) is calculated with the "target output" of the training set. Backpropagation is used to obtain the gradient g of the attention layer parameters. Dynamic update and gradient adjustment of momentum state in L3 layer: In each gradient step (parameter update once), the mt and vt of the L3 layer are dynamically updated with the gradient. The first-order momentum mt records the exponential moving average of the gradient, giving the parameter update "inertia" and avoiding oscillations in the update direction caused by noise in a single batch of data, making the parameter update more stable. The second-order momentum vt records the exponential moving average of the square of the gradient, adaptively adjusting the learning rate of each parameter—the learning rate of parameters with large gradients (large update fluctuations) is automatically reduced, and the learning rate of parameters with small gradients (requiring fine adjustment) is automatically increased, achieving fine-grained parameter optimization.
[0089] LoRA low-rank parameter update: Using the adjusted gradient (combined with adaptive corrections to mt and vt), only the parameters of the LoRA low-rank matrices A and B inserted in the attention layer are updated, while the core parameters remain locked, ensuring the "lightweight" nature of fine-tuning.
[0090] 3. Fine-tuning termination and parameter fusion.
[0091] Early stopping mechanism trigger: After each training round, the model performance is evaluated using a validation set (accuracy of solidified memory mastery, task success rate). If the validation set performance does not improve for three consecutive rounds, or the accuracy reaches a preset threshold (≥95%), early stopping is triggered to stop fine-tuning and avoid overfitting.
[0092] L3 layer state saving: After fine-tuning, save the current mt and vt of the L3 layer to provide state support for the next incremental fine-tuning.
[0093] LoRA parameters are fused with L4 backbone parameters: The trained LoRA low-rank matrices A and B are fused with the original parameters of the L4 attention layer to obtain the updated complete parameters of the L4 layer. At this point, the candidate memory of the L2 layer has been solidified into the inherent capability of the model and no longer needs to be retrieved.
[0094] 4. Redundancy cleanup: Delete L2 layer memory entries, update memory statistics, optimize the vector database index, and improve retrieval efficiency.
[0095] After the fine-tuning verification is passed, delete the candidate memory entries in the L2 layer that have been successfully solidified to the L4 layer, release storage resources, and avoid redundancy where the same memory is solidified in the L4 layer and stored in the L2 layer.
[0096] Compared to full fine-tuning, the lightweight fine-tuning described in this embodiment only updates the LoRA low-rank matrix parameters, significantly reducing the computational resource consumption during training and noticeably lowering the hardware threshold and time cost. Full fine-tuning may cause the model to forget the general knowledge learned during native training (such as language understanding and logical reasoning), while this solution ensures that the model's original core capabilities are not affected by locking the backbone parameters and only fine-tuning the low-rank matrix. The high-frequency stable memories of the L2 layer are solidified into the native capabilities of the L4 layer after fine-tuning. When the agent performs similar tasks subsequently, it does not need to retrieve memories from the L2 layer and can directly output accurate results, improving task response speed and knowledge application accuracy.
[0097] Furthermore, the momentum state in the L3 layer avoids parameter update oscillations, eliminating the need to increase batch size or extend training epochs to ensure stability, thus further reducing training costs. The mt and vt values stored in the L3 layer record the historical parameter update trends, allowing for incremental fine-tuning based on this state. This avoids capability fluctuations caused by parameter resets, enabling the model's capabilities to gradually accumulate on the original basis, forming a closed loop of continuous evolution.
[0098] In addition to the memory transfer management of the memory system in one or more of the above embodiments, such as Figure 5 As shown in another embodiment of this specification, initial attribute settings, semantic deduplication and merging, TTL expiration eviction, and LRU access frequency eviction can also be implemented for memory entries in the external knowledge base layer, covering the entire lifecycle management of memory entries from writing to eviction, to ensure the efficient and low-redundancy operation of the L2 layer memory.
[0099] Initial attribute settings can be executed synchronously when a memory entry is first written to the L2 layer, generating a unique memory ID, setting a memory type label, recording the creation time, initializing the last access time to the creation time and the access count to 0, setting a default TTL value according to the type (30 days for knowledge-based entries and 90 days for trial-and-error entries), filling in associated metadata, partitioning and storing by type, and updating the vector index.
[0100] Semantic deduplication and merging can be performed during periodic scanning of the L2 layer. A cosine similarity algorithm is used to calculate the semantic similarity between memory entries. For multiple memory entries with a similarity greater than the merging threshold, versions with higher access frequency, richer information, and higher confidence can be retained, while other versions are deleted. Alternatively, valid information from other versions can be added to versions with higher access frequency, richer information, and higher confidence, and then other versions are deleted. After merging, the associated metadata of the retained entries is updated, merging information is recorded, and the vector database index is optimized.
[0101] The system employs a Time-To-Live (TTL) expiration and eviction mechanism. The TTL value represents the lifespan of a memory entry. During task execution, whenever a memory entry is accessed, in addition to updating its access attributes, the TTL value can be dynamically extended based on the access frequency *f*: for example, if *f* < 1 access / day, no extension; if *1* ≤ *f* < 5, extension by 30%; if *5* ≤ *f* < 10, extension by 50%; if *f* ≥ 10, extension doubles, with no upper limit. The agent system can periodically inspect the L2 layer to check if the TTL value of a memory entry has expired. If expired, the entry is automatically deleted, and the memory bank statistics are updated.
[0102] The eviction process is based on LRU (Least Recently Used). When the L2 layer storage utilization exceeds a first threshold, an LRU eviction mechanism can be triggered to save storage space: all memory entries are sorted based on LRU metrics (e.g., last access time), and the least recently accessed entries are evicted one by one until the storage utilization falls below a second threshold. Specifically, frequently accessed and unexpired core entries can be retained during the eviction process to avoid mistakenly deleting high-value memories.
[0103] Figure 6 This is a schematic structural diagram of an electronic device provided in an exemplary embodiment. Please refer to... Figure 6At the hardware level, the device includes a processor 402, an internal bus 404, a network interface 406, memory 408, and non-volatile memory 410, and may also include other necessary hardware. One or more embodiments of this specification can be implemented in software, for example, the processor 402 reads the corresponding computer program from the non-volatile memory 410 into memory 408 and then runs it. Of course, in addition to software implementation, one or more embodiments of this specification do not exclude other implementation methods, such as logic devices or a combination of hardware and software, etc. That is to say, the execution subject of the following processing flow is not limited to each logic unit, but can also be hardware or logic devices.
[0104] Figure 7 This specification illustrates a memory management device 70 for an intelligent agent according to an exemplary embodiment. This device 70 can be applied to, for example... Figure 6 The illustrated electronic device implements the technical solution described in this specification. The device 70 includes: The acquisition unit 702 is used to acquire the current memory state vector, wherein the memory state vector is encoded based on the state information of the current task and the state information of the memory system; The memory read / write unit 704 is used to invoke a pre-trained memory management decision model based on the current memory state vector to perform memory read / write management of the agent, wherein performing the memory read / write management includes at least one of the following steps A1 to A4: Step A1: Based on the information in the current context layer, summarize the information into knowledge-based memory entries and write them into the external knowledge base layer; Step A2: Identify the trial-and-error trajectory of the agent from failure to success during task execution from the current context layer, extract and encode the trial-and-error trajectory into trial-and-error memory entries, and write them into the external knowledge base layer; Step A3: Do not perform a write operation; Step A4: Read knowledge-based or trial-and-error memory entries related to the current task from the external knowledge base layer and inject them into the current context layer.
[0105] More preferably, step A4, which involves reading trial-and-error memory entries related to the current task from the external knowledge base layer and injecting them into the current context layer, includes: encoding the current task description into a vector; retrieving trial-and-error memory entries related to the current task from the external knowledge base layer based on similarity; and injecting the successful strategy summary of the matched trial-and-error memory entries into the current context of the current task when the similarity exceeds a preset threshold.
[0106] More preferably, the device 70 further includes a memory transfer unit 706, which is used to invoke the memory management decision model after the current task session ends, extract session information from the current context layer, and integrate the extracted information into knowledge-based memory entries or trial-and-error-based memory entries for archiving to the external knowledge base layer.
[0107] More preferably, the memory transfer unit 706 is further configured to merge multiple similar memory entries into a strategic memory entry and write it into the external knowledge base layer when the number of similar memory entries in the external knowledge base layer exceeds the merging threshold.
[0108] More preferably, the memory system further includes an optimizer momentum state layer for optimizing the training of a large model of the agent; The memory transfer unit 706 is further configured to: filter target memory entries in the external knowledge base layer whose access frequency is higher than a set threshold and whose content change rate is lower than a set threshold; construct a fine-tuning training set based on the target memory entries when the preset transfer triggering conditions are met; and perform lightweight parameter fine-tuning on the parameter layer of the large model through the optimizer momentum state layer.
[0109] More preferably, the memory entries in the external knowledge base layer include memory type tags, content text, creation time, last access time, access count, TTL value, and associated metadata; the device 70 further includes a memory eviction unit 708, used to scan the external knowledge base layer at preset intervals and obtain the access frequency of the memory entries based on the last access time and access count; extend the TTL value corresponding to the memory entries based on the access frequency; and evict the memory entries when the extended TTL value expires; and when the storage utilization rate of the external knowledge base layer exceeds a first threshold, sort the memory entries of the external knowledge base layer based on the LRU representation index, and evict the memory entries that have not been accessed for the longest time in order of sorting results, until the storage utilization rate drops below a second threshold.
[0110] More preferably, the device 70 further includes a model training unit 710, which is used to model the read and write operations of the memory system corresponding to steps A1, A2, A3, and A4 as discrete actions in reinforcement learning. First, a cold start is completed by supervised fine-tuning through manually labeled datasets. Then, a proximal policy optimization algorithm is used, with a weighted combination of task success rate, task completion efficiency, and memory hit rate as the reward function, to train and optimize the optimal memory writing decision strategy online.
[0111] More preferably, the discrete actions include writing knowledge-based memory, writing trial-and-error trajectories, not writing to memory, and reading memory. The state space of reinforcement learning consists of one or more of the agent's current task context, existing memory bank content summary, and current interaction round information.
[0112] More preferably, the model training unit 710 is further configured to calculate the reward corresponding to any step A1 to A4 after the step is completed, and construct a data pool for updating the training memory management decision model based on the reward; after the current task is completed, when a preset trigger condition is met, the memory management decision model is updated using a reinforcement learning algorithm based on the data in the data pool.
[0113] For ease of description, the above device 70 is described by dividing it into various modules or units according to their functions. Of course, when implementing one or more of this specification, the functions of each module or unit can be implemented in one or more software and / or hardware, or a module that implements the same function can be implemented by a combination of multiple sub-modules or sub-units, etc. The device embodiments described above are merely illustrative. For example, the division of units is only a logical functional division, and there may be other division methods in actual implementation. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed.
[0114] Based on the same concept as the above methods, this specification also provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor implements the steps of the memory management method for an intelligent agent as described in any of the above embodiments by running the executable instructions.
[0115] Based on the same concept as the methods described above, this specification also provides a computer-readable storage medium having computer instructions stored thereon that, when executed by a processor, implement the steps of the methods as described in any of the above embodiments.
[0116] Based on the same concept as the methods described above, this specification also provides a computer program product, including a computer program / instructions that, when executed by a processor, implement the steps of the methods as described in any of the above embodiments.
[0117] What those skilled in the art will understand is: In this specification, the terms "comprising," "including," or any other variations thereof are intended to cover a non-exclusive inclusion, such that a process, method, product, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, product, or apparatus. Without further limitation, the presence of additional identical or equivalent elements in a process, method, product, or apparatus that includes said elements is not excluded.
[0118] In this specification, “a,” “an,” and “the” do not specifically refer to the singular, but may also include the plural.
[0119] In this specification, ordinal numbers such as "first," "second," etc., do not necessarily indicate order; they are often used to distinguish between objects. For example, "first server" and "second server" usually refer to two servers. To differentiate between these two servers, they are described as "first server" and "second server." Of course, sometimes these two servers may be the same server.
[0120] In this specification, unless explicitly stated otherwise, "receiving and sending data" does not necessarily mean direct receiving and sending; it can also mean indirect receiving and sending. For example, A receiving data sent by B can be understood as A directly receiving the data sent by B, or it can be understood as A indirectly receiving the data sent by B through other entities such as C. Similarly, B sending data to A can be understood as B sending the data directly to A, or it can be understood as B indirectly sending the data to A through other entities such as C. Here, C can be one entity, or it can be two or more entities.
[0121] In this specification, unless explicitly stated otherwise, the relationships between structures can be direct or indirect. For example, when describing "A is connected to B," unless it is explicitly stated that A and B are directly connected, it should be understood that A can be directly connected to B or indirectly connected to B. Similarly, when describing "A is on top of B," unless it is explicitly stated that A is directly above B (AB is adjacent and A is above B), it should be understood that A can be directly above B or indirectly above B (AB is separated by other elements, and A is above B). And so on.
[0122] This specification uses specific terms to describe embodiments thereof. For example, "one embodiment" and / or "some embodiments" refer to a particular feature, structure, or characteristic related to at least one embodiment of this specification. Therefore, it should be emphasized and noted that references to "one embodiment" or "an alternative embodiment" in different locations throughout this specification do not necessarily refer to the same embodiment. Furthermore, those skilled in the art can combine and integrate the different embodiments or examples described herein, as well as the features of those different embodiments or examples, without contradiction.
[0123] Although one or more embodiments of this specification provide method steps as described in the embodiments or flowcharts, it is understood that the order of steps listed in the embodiments or flowcharts is only one of many possible execution orders and does not represent the only execution order. Therefore, when the claims involve method steps, any changes or adjustments to the order of such steps, or the parallelism between steps, are also within the scope of protection of the claims.
Claims
1. A memory management method for an intelligent agent, characterized in that, The memory system corresponding to the agent includes the current context layer of the current task and an external knowledge base layer accessible to the agent; the method includes: Obtain the current memory state vector, wherein the memory state vector is encoded based on the state information of the current task and the state information of the memory system; Based on the current memory state vector, a pre-trained memory management decision model is invoked to perform memory read / write management on the agent, wherein performing the memory read / write management includes at least one of the following steps A1 to A4: Step A1: Based on the information in the current context layer, summarize it into knowledge-based memory entries and write them into the external knowledge base layer; Step A2: Identify the trial-and-error trajectory of the agent from failure to success during task execution from the current context layer, extract and encode the trial-and-error trajectory into trial-and-error memory entries, and write them into the external knowledge base layer; Step A3: Do not perform a write operation; Step A4: Read knowledge-based or trial-and-error memory entries related to the current task from the external knowledge base layer and inject them into the current context layer.
2. The method according to claim 1, characterized in that, Step A4, which involves reading trial-and-error memory entries related to the current task from the external knowledge base layer and injecting them into the current context layer, includes: encoding the current task description into a vector; retrieving trial-and-error memory entries related to the current task from the external knowledge base layer based on similarity; and injecting a summary of the successful strategies of the matched trial-and-error memory entries into the current context of the current task when the similarity exceeds a preset threshold.
3. The method according to claim 1 or 2, characterized in that, The method further includes: after the session of the current task ends, invoking the memory management decision model to extract session information from the current context layer, and integrating the extracted information into knowledge-based memory entries or trial-and-error-based memory entries for archiving to the external knowledge base layer.
4. The method according to claim 3, characterized in that, The method further includes: when the number of similar memory entries in the external knowledge base layer exceeds the merging threshold, merging multiple similar memory entries into a strategic memory entry and then writing it into the external knowledge base layer.
5. The method according to claim 3, characterized in that, The memory system also includes an optimizer momentum state layer for optimizing the training of a large model of the agent; The method further includes: filtering target memory entries in the external knowledge base layer whose access frequency is higher than a set threshold and whose content change rate is lower than a set threshold; constructing a fine-tuning training set based on the target memory entries when the preset migration triggering conditions are met; and performing lightweight parameter fine-tuning on the parameter layer of the large model through the optimizer momentum state layer.
6. The method according to claim 1 or 2, characterized in that, The memory entries in the external knowledge base layer include memory type tags, content text, creation time, last access time, access count, TTL value, and associated metadata; the method also includes performing TTL value expiration or LRU value sorting and eviction on the memory entries in the external knowledge base layer. The step of eviction of memory entries in the external knowledge base layer by expiring TTL values includes: scanning the external knowledge base layer at preset intervals and obtaining the access frequency of memory entries based on the last access time and access count of the memory entries; extending the TTL value corresponding to the memory entries based on the access frequency; and eviction of the memory entries when the extended TTL value expires. Performing LRU sorting and eviction on the memory entries of the external knowledge base layer includes: when the storage utilization rate of the external knowledge base layer exceeds a first threshold, sorting the memory entries of the external knowledge base layer based on the LRU representation index, and evictioning the memory entries that have not been accessed for the longest time in sequence according to the sorting results, until the storage utilization rate drops below a second threshold.
7. The method according to claim 1, characterized in that, The memory management decision model is obtained through reinforcement learning training including the following steps: The read and write operations of the memory system corresponding to steps A1, A2, A3, and A4 are modeled as discrete actions in reinforcement learning. First, a cold start is completed by supervised fine-tuning using manually labeled datasets. Then, a proximal policy optimization algorithm is used, with a weighted combination of task success rate, task completion efficiency, and memory hit rate as the reward function, to train and optimize the optimal memory writing decision strategy online.
8. The method according to claim 7, characterized in that, The discrete actions include writing knowledge-based memory, writing trial-and-error trajectories, not writing, and reading memory. The state space of reinforcement learning consists of one or more of the agent's current task context, existing memory bank content summary, and current interaction round information.
9. The method according to claim 7 or 8, characterized in that, The method further includes: After any step A1 to A4 is completed, the reward corresponding to that step is calculated, and a data pool for updating the training memory management decision model is constructed based on the reward. After the current task is completed, when a preset triggering condition is met, the memory management decision model is updated using a reinforcement learning algorithm based on the data in the data pool.
10. A memory management device for an intelligent agent, characterized in that, The memory system corresponding to the agent includes the current context layer of the current task and an external knowledge base layer accessible to the agent; the device includes: An acquisition unit is used to acquire a current memory state vector, wherein the memory state vector is encoded based on the state information of the current task and the state information of the memory system; The memory read / write unit is used to invoke a pre-trained memory management decision model based on the current memory state vector to perform memory read / write management of the agent, wherein performing the memory read / write management includes at least one of the following steps A1 to A4: Step A1: Based on the information in the current context layer, summarize it into knowledge-based memory entries and write them into the external knowledge base layer; Step A2: Identify the trial-and-error trajectory of the agent from failure to success during task execution from the current context layer, extract and encode the trial-and-error trajectory into trial-and-error memory entries, and write them into the external knowledge base layer; Step A3: Do not perform a write operation; Step A4: Read knowledge-based or trial-and-error memory entries related to the current task from the external knowledge base layer and inject them into the current context layer.