Embodied exploration method based on reinforcement learning and long-term memory active retrieval
By constructing an embodied exploration method based on reinforcement learning and long-range memory active retrieval, the problems of low memory utilization and decision-making disconnect of embodied agents in long-endurance missions are solved. It enables agents to actively construct memories and optimize decisions in unknown environments, improving perception and planning capabilities, especially exploration efficiency in complex dynamic environments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- EAST CHINA NORMAL UNIV
- Filing Date
- 2026-01-08
- Publication Date
- 2026-06-16
AI Technical Summary
Existing embodied intelligent agents suffer from passive memory utilization, limited context windows, and a disconnect between cognition and decision-making in long-endurance missions. They are unable to proactively utilize historical contextual memories for route planning and decision-making during exploration, and reinforcement learning lacks refined reward guidance, making it difficult to achieve multi-task collaborative optimization in complex and dynamic environments.
We employ a reinforcement learning and long-range memory-based active retrieval approach to construct a unified cognitive and decision-making framework. By combining a multimodal large language model with a multi-task reward function, we enable the agent to actively construct and retrieve memories in unknown environments and optimize decision-making. This includes multi-target navigation and memory-based question-and-answer tasks. We utilize the multimodal large language model for decision training and design a multi-task reward function to enhance the agent's decision-making and cognitive abilities.
Intelligent agents can proactively build and utilize memories in unknown environments, improving their perception, reasoning, and planning capabilities in complex and dynamic environments. This achieves simultaneous improvement in exploration efficiency and environmental understanding, significantly enhancing success rates and path efficiency in long-term tasks.
Smart Images

Figure CN121882273B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of memory agents and reinforcement learning technology, and in particular to an embodied exploration method based on reinforcement learning and long-range memory active retrieval. Background Technology
[0002] Embodied intelligence is an important research direction in the field of artificial intelligence, aiming to develop intelligent agents that can autonomously operate, learn, and evolve in complex, dynamic, and unknown physical environments. With the development of multimodal large language models, embodied intelligent agents have made significant progress in scene understanding, logical reasoning, and task planning. However, in handling long-endurance tasks and complex scenarios requiring decision-making based on long-range memory, current technologies still face the following bottlenecks:
[0003] 1) Limitations of the Embodied Exploration Task Paradigm: Current research on embodied tasks (such as target navigation and embodied question answering) mostly falls under the category of "one-off tasks." These tasks primarily focus on the final outcome of task completion (such as whether the target has been reached or whether the answer is correct), while neglecting the exploration process itself. Existing benchmark tests typically only require agents to find a specific target within a single round, lacking a comprehensive evaluation of the agent's ability to accumulate, maintain, and utilize contextual memory across time steps during continuous exploration. This makes it difficult for agents to cope with long-duration tasks requiring multi-stage planning and a deep understanding of the environment.
[0004] 2) Passivity of Memory Utilization Mode: Existing embodied intelligence methods typically employ a "passive reception" mode when processing memory. For example, imitation learning methods indirectly utilize historical information by replicating expert trajectories, lacking initiative in autonomous exploration, while some visual language-based methods rely on simple historical snapshot filtering mechanisms. This passive mode limits the agent's reasoning ability, preventing it from actively and accurately recalling specific spatiotemporal segments in the memory bank based on the progress of the current task. Especially when dealing with long-term time-series data, the limitation of the model's context window means that passively inputting a large amount of irrelevant memory can lead to model perceptual overload or "hallucinations."
[0005] 3) Disconnect between cognitive understanding and decision-making: Under the current technological framework, the perception and cognition of an intelligent agent (such as answering questions about the environment) are often separated from action decision-making (such as path planning). Existing intelligent agents typically only perform post-exploration reasoning after the exploration task is completed, or they cannot utilize stored memories to guide the selection of exploration boundaries during navigation. This "sensing without acting" or "acting without thinking" situation makes it impossible for models to effectively utilize information from explored areas to avoid repeated exploration or optimize navigation paths in complex and dynamic environments (such as large home or industrial scenarios).
[0006] 4) Reinforcement learning lacks refined reward guidance in embodied tasks: Although reinforcement learning has been applied to embodied tasks, existing reward functions are mostly based on sparse success signals, making it difficult to establish an effective correlation between long-range memory retrieval and multi-task collaborative execution (such as simultaneous navigation and online question answering). This makes it difficult for models to evolve efficient tool-calling logic and proactive exploration strategies through self-play or environmental interaction.
[0007] In summary, existing embodied intelligent agents suffer from drawbacks and problems in long-endurance missions, such as passive memory utilization, limited context windows, and a disconnect between cognition and decision-making. They are unable to proactively utilize historical contextual memories for route planning and decision-making during exploration. Developing an embodied exploration framework that can unify embodied cognition and decision-making behaviors, support proactive retrieval of long-term contextual memories, and possess multi-task collaborative optimization capabilities is a key issue that urgently needs to be addressed to achieve general embodied intelligence. Summary of the Invention
[0008] The purpose of this invention is to address the shortcomings of existing technologies by providing an embodied exploration method based on reinforcement learning and long-range memory active retrieval. This method employs a unified cognitive and decision-making framework to enable an agent to proactively construct and retrieve memories in unknown environments, thereby optimizing decision-making and achieving lifelong learning. The method first constructs a long-range memory embodied exploration data acquisition framework, combining multi-target navigation with memory-based question answering and a training framework based on a multimodal large language model. Using the multimodal large language model as the decision-making core, reinforcement learning is used to fine-tune the model, enabling it to proactively invoke memory retrieval tools. Then, a multi-task reward function is designed to comprehensively evaluate action prediction, frontier image selection, and question-answering accuracy. Finally, reinforcement learning is used to fine-tune the agent's decision-making and cognitive abilities in unknown environments. This invention effectively solves the problems of low memory utilization and lack of proactive exploration strategies in embodied agents during long-endurance missions, significantly improving the agent's perception, reasoning, and planning capabilities in complex dynamic environments, demonstrating promising application prospects and commercial development value.
[0009] The specific technical solution for achieving the objective of this invention is: an embodied exploration method based on reinforcement learning and long-range memory active retrieval. Its characteristic is the use of a unified cognitive and decision-making framework, enabling the agent to actively construct and retrieve memories in unknown environments and optimize decisions accordingly, thereby achieving lifelong learning for the agent. This method specifically includes:
[0010] Step 1: Constructing a data acquisition framework for long-range memory embodied exploration
[0011] Step 1.1: Based on the objects and corresponding room types in the scene, a multi-target navigation task is generated using a large language model. Based on the navigation algorithm provided by the Habitat-sim simulator, multiple target objects are sequentially searched in the unknown environment. During the exploration process, a long-range plot memory database containing images, object categories, and spatial coordinates is dynamically constructed. High-similarity memories are filtered using feature similarity, which is calculated by the following formula:
[0012] Feature similarity = .
[0013] in These are the object category text features, observed image features, and location information in the current state; An entry in the current memory; It is an exponential function of the Euclidean distance; , , These are the feature similarity weight coefficients for text features, image features, and location, respectively.
[0014] To effectively filter out similar memories with excessive repetition, the similarity scores of the k most recent samples are aggregated, their mean and standard deviation are calculated, and contextual memories are dynamically filtered at each step based on an adaptive similarity threshold, resulting in the final filtered long-range memory bank.
[0015] Step 1.2: The multimodal large language model generates questions and answers related to environmental cognition based on the constructed long-range contextual memory bank. The question types include open-ended and multiple-choice questions.
[0016] Step 2: Embodied Decision Training Process Based on Multimodal Large Language Model
[0017] Step 2.1: Set the current task instruction Current multi-view observation images Target-related issues And externally stored plot memories The input is a decision model based on a multimodal large language model, which generates an inferential response that includes current task instructions, observations, and problem analysis, and invokes external memory retrieval tools. Generate query text.
[0018] Step 2.2: Based on the query text generated by the multimodal large model, the retrieval tool uses the cosine similarity algorithm to extract the top k relevant memory fragments from the long-range episodic memory bank. The input of these input words as additional cue words is then used to input a multimodal large language model. This multimodal large language model combines observations with historical memory to output a single-step action prediction. Frontier Exploration Image Selection and answers to questions The memory fragment It can be expressed as follows:
[0019] .
[0020] in, The text features obtained by the text encoder for the query text; The text features obtained by the encoder are derived from the images and corresponding object category text information stored in the long-range memory. and observe image features ; This indicates that cosine similarity is calculated, and the top k similarities are selected. The memory pairs are used as retrieval memories and are input into the multimodal large language model for decision-making.
[0021] Step 3: Model Enhancement and Fine-tuning
[0022] Step 3.1: Use the constructed multi-task reward function to enhance and fine-tune the multimodal large language model, so that the model can accurately call the memory retrieval tool, and input the retrieved memory as additional information into the multimodal large language model for final decision-making. The final output is single-step action prediction S, frontier exploration image selection F, and question answer A. The multi-task reward function is composed of a weighted sum of action prediction accuracy reward, frontier exploration image selection correctness reward, answer accuracy reward, and output format integrity reward.
[0023] Each entry in the long-range contextual memory database contains an image, object category text information, and location coordinates, and features are extracted using a visual language model and spatiotemporal consistency filtering is performed.
[0024] The reward function introduces a logical consistency coefficient. and scaling factor Logical consistency coefficient If the direction of the action predicted by the model is inconsistent with the direction of the selected frontier exploration image, the corresponding reward score will be reduced. Differentiated reward multipliers are set for "tool call successful" or "tool call failed" to adjust the sub-reward weights in scenarios with or without tool calls.
[0025] The action accuracy reward is given by the optimal action matching degree between the multimodal large language model's predicted action and the expert path; the frontier exploration image selection correctness reward is given by whether the next exploration direction selected by the multimodal large language model is consistent with the true direction of the expert path; the answer accuracy reward is given by the correctness of the multimodal large language model's answer to a multiple-choice question based on memory; and the output format integrity reward is given by the multimodal large language model's generation of a response that conforms to a specific reasoning format.
[0026] Compared with the prior art, the present invention has the following beneficial technical effects and significant technical progress:
[0027] 1) It changes the previous model's passive reception of historical information, enabling the agent to actively recall specific memories like a human based on task requirements, which greatly improves the reasoning success rate in long-distance tasks.
[0028] 2) By integrating navigation actions and question-and-answer tasks into a unified reward function, the agent can focus not only on "how to get there" but also on "what is in the environment" during the exploration process, thus achieving a simultaneous improvement in exploration efficiency and environmental understanding.
[0029] 3) The introduction of external memory retrieval tools effectively solves the problems of context window limitations in multimodal models and difficulty in maintaining perceptual consistency in long sequences, enabling agents to handle complex tasks with long time sequences.
[0030] 4) Experiments have shown that the success rate and path efficiency of this invention are significantly better than existing benchmark methods for embodied exploration of multimodal large language models in unknown environments. Attached Figure Description
[0031] Figure 1 This is a flowchart of the present invention;
[0032] Figure 2 This is a flowchart of the training process of the present invention. Detailed Implementation
[0033] The present invention specifically includes the following steps:
[0034] Step 1: Data Construction for Long-Term Memory Embodied Exploration Task
[0035] A unified evaluation system is constructed, comprising two interconnected core sub-tasks: multi-objective navigation and memory-based question-answering. The specific construction includes the following steps:
[0036] Step 1.1: Based on the objects in the scene, such as sofas, TVs, tables, etc., and their corresponding room types, such as kitchens, bedrooms, and living rooms, generate multi-target navigation instructions using a large language model;
[0037] Step 1.2: Based on the multi-target navigation instructions, the navigation algorithm in the Habitat-sim simulator searches for multiple target objects in the scene in sequence. During the movement, the simulator acquires multi-view observation images in real time, extracts visual and text features using CLIP, and combines spatial coordinate information to dynamically construct a long-range scene memory library.
[0038] Step 1.3: After the trajectory is collected, use a multimodal large language model to generate queries about environmental details (such as object attributes, quantity, positional relationships or status), and generate the corresponding correct answers to build a memory-based question-and-answer task.
[0039] Step 2: Embodied Decision Training Process Based on Multimodal Large Language Model
[0040] The decision-making model proposed in this invention is based on a multimodal large language model, and the model input includes: the current task instruction. Current multi-view observation images Target-related issues And externally stored plot memories The model achieves active exploration through two-stage training, specifically including:
[0041] Step 2.1: Proactive Retrieval Decision
[0042] Based on multimodal large language model analysis of current task instructions, observations, and questions, textual thinking is formed, a piece of code is autonomously generated, and external memory retrieval tools are invoked. And generate targeted query text accordingly.
[0043] Step 2.2: Memory Enhancement and Multi-Task Decision Making
[0044] a. The current step-length episode memory includes image-text pairs filtered by image, text, and location similarity. Similarity calculation can be expressed as: .in, These represent the object category text features, observed image features, and location information in the current state, respectively. This represents an entry in the current memory. The exponential function representing the Euclidean distance. , , This represents the weighting coefficients for the similarity of different features;
[0045] b. The retrieval tool extracts the top k relevant memory fragments from the long-range episodic memory bank based on the query text using a cosine similarity algorithm. , can be represented as: The observed information is then fed back as additional cue words to a multimodal large language model. The model combines observations with historical memory to ultimately output a single-step action prediction. Frontier Exploration Image Selection and answers to questions .
[0046] Step 3: Design of Multi-Task Reward Function
[0047] Step 3-1: To synergistically optimize navigation efficiency and memory accuracy, this invention designs a composite reward function, characterized by including the following four dimensions:
[0048] a. Action accuracy reward: A reward is given based on the optimal action matching degree between the agent's predicted action and the expert path;
[0049] b. Frontier Exploration Image Selection Accuracy Reward: Evaluates whether the agent's chosen next exploration direction is consistent with the true direction of the expert path;
[0050] c. Question-answering accuracy reward: Provide feedback on the correctness of the agent's answers to multiple-choice questions based on memory;
[0051] d. Format integrity reward: The model generates responses that conform to a specific inference format.
[0052] In addition, a consistency penalty coefficient is introduced. If the predicted action direction by the model is inconsistent with the direction of the selected frontier exploration image, the corresponding reward score is reduced. A scaling factor is also introduced. Differentiated reward multipliers are set for "successful tool call" and "failed tool call" to incentivize agents to call memory retrieval tools more efficiently and accurately.
[0053] To facilitate understanding of the present invention, the present invention will be described in detail below with reference to the accompanying drawings and embodiments.
[0054] Example 1
[0055] See Figure 1 An embodied exploration based on reinforcement learning and long-range memory active retrieval specifically includes the following steps:
[0056] S100: Using the labels of objects in the HM3DSem dataset, obtain the objects included in each scene and the corresponding room categories as prompt information.
[0057] S110: Input the information prompts from S100 into the multimodal large language model to generate multi-target navigation task instructions.
[0058] S120: Utilizes multiple object instances corresponding to the generated task instructions to execute navigation paths in the simulation environment, generates exploration trajectories, and records observation images, text descriptions (generated by the image tagging model in the multimodal large language model), and multi-dimensional information such as location for each step of the action.
[0059] S130: Using the information obtained in S120, extract the corresponding image and text features based on CLIP, and use them and location information to calculate similarity scores to dynamically filter similar information, reduce the storage burden of the memory bank, and thus construct a long-range contextual memory bank.
[0060] S140: Using the collected memory bank, the episodic memory images are input into the visual multimodal large language model to generate five types of questions: attribute, count, state, relation, and location, to evaluate the model's cognitive and decision-making abilities regarding long-term memory.
[0061] See Figure 2 The specific training process of the multimodal large language model is as follows:
[0062] S200: The generated task instructions, observed images, and memory-based questions are used as inputs to train a multimodal large language model for active exploration and memory retrieval.
[0063] S210: Combining multimodal input, the model generates a thought analysis of the current task and situation, then generates text queries and executable code, and actively retrieves and makes decisions from the long-term contextual memory bank of the current step by calling the memory retrieval tool.
[0064] S220: Use the retrieved memory information as additional input to guide the model to think, analyze, and output the final decision.
[0065] S230: The model's final decision output includes predictions of the next action, selection of frontier exploration images, and question answers. A multi-task reward function is used to train the model, namely, an action accuracy reward, a frontier exploration image selection accuracy reward, a question answering accuracy reward, and a format completeness reward. Furthermore, logical consistency penalties and tool call penalties are introduced to supervise the consistency between the model's predicted action direction and the selected frontier exploration image direction, and to incentivize the model to call memory retrieval tools more efficiently and accurately.
[0066] The above are merely preferred embodiments of the present invention. The scope of protection of the present invention is not limited to the above embodiments. All technical solutions falling within the scope of the present invention's concept are within the scope of protection of the present invention. It should be noted that for those skilled in the art, any improvements made without departing from the principle of the present invention should be considered within the scope of protection of the present invention.
Claims
1. An embodied exploration method based on reinforcement learning and long-range memory active retrieval, characterized in that, The method specifically includes: Step 1: Constructing a data acquisition framework for long-range memory embodied exploration Step 1.1: Based on the objects and corresponding room types in the scene, a multi-target navigation task is generated using a large language model. Based on the navigation algorithm provided by the Habitat-sim simulator, multiple target objects are sequentially searched in the unknown environment. During the exploration process, a long-range plot memory database containing images, object categories, and spatial coordinates is dynamically constructed. High-similarity memories are filtered using feature similarity, which is calculated by the following formula: Feature similarity = ; in, These are the object category text features, observed image features, and location information in the current state; Each of these is an entry in the current memory bank; It is an exponential function of the Euclidean distance; These are the feature similarity weight coefficients for text features, image features, and location features, respectively. Step 1.2: The multimodal large language model generates questions and answers related to environmental cognition based on the constructed long-range contextual memory bank. The types of questions include open-ended and multiple-choice questions. Step 2: Embodied Decision Training Process Based on Multimodal Large Language Model Step 2.1: Set the current task instruction Current multi-view observation images Target-related issues And externally stored plot memories The input is a decision model based on a multimodal large language model, which generates an inferential response that includes current task instructions, observations, and problem analysis, and invokes external memory retrieval tools. Generate query text; Step 2.2: The retrieval tool extracts the top k relevant memory fragments from the long-range episodic memory bank based on the query text generated by the multimodal large language model using the cosine similarity algorithm. The input of these input words as additional cue words is then used to input a multimodal large language model. This multimodal large language model combines observations with historical memory to output a single-step action prediction. Frontier Exploration Image Selection and answers to questions The memory fragment It can be expressed as follows: ; in, The text features obtained by the text encoder for the query text; The text features obtained by the encoder are derived from the images and corresponding object category text information stored in the long-range memory. and observe image features ; This indicates the calculation of cosine similarity and the selection of... That is, the first k memory pairs are used as the memory for retrieval and are input into the multimodal large language model for decision-making; Step 3: Model Enhancement and Fine-tuning Step 3.1: Use the constructed multi-task reward function to enhance and fine-tune the multimodal large language model, so that the model can accurately call the memory retrieval tool, and input the retrieved memory as additional information into the multimodal large language model for final decision-making. The final output is single-step action prediction S, frontier exploration image selection F, and question answer A. The multi-task reward function is composed of a weighted sum of action prediction accuracy reward, frontier exploration image selection correctness reward, answer accuracy reward, and output format integrity reward.
2. The embodied exploration method based on reinforcement learning and long-range memory active retrieval according to claim 1, characterized in that, Each entry in the long-range contextual memory database contains an image, object category text information, and location coordinates, and features are extracted using a visual language model and spatiotemporal consistency filtering is performed.
3. The embodied exploration method based on reinforcement learning and long-range memory active retrieval according to claim 1, characterized in that, The reward function introduces a logical consistency coefficient. and scaling factor If the predicted motion direction by the model is inconsistent with the direction of the selected frontier exploration image, then the logistic consistency coefficient is used. The corresponding reward score for attenuation; the scaling factor Differentiated reward multipliers are set for "tool call successful" or "tool call failed" to adjust the sub-reward weights in scenarios with or without tool calls.
4. The embodied exploration method based on reinforcement learning and long-range memory active retrieval according to claim 1, characterized in that, The action prediction accuracy reward is given by the optimal action matching degree between the multimodal large language model's predicted action and the expert path; the frontier exploration image selection correctness reward is given by whether the next exploration direction selected by the multimodal large language model is consistent with the true direction of the expert path; the answer accuracy reward is given by the correctness of the multimodal large language model's answer to multiple-choice questions based on memory; and the output format integrity reward is given by the multimodal large language model generating a response that conforms to a specific reasoning format.
5. The embodied exploration method based on reinforcement learning and long-range memory active retrieval according to claim 1, characterized in that, In step 1.1, the high-similarity memory is filtered using feature similarity. The similarity scores of the k most recent samples are aggregated, their mean and standard deviation are calculated, and the context memory is dynamically filtered at each step according to the adaptive similarity threshold to obtain the filtered long-range memory bank, so as to effectively screen similar memories with excessive repetition.