A Reinforcement Learning Approach for Intelligent Agents Based on Value Feedback Shaping from Large Language Models
By constructing a value guidance system based on a large language model, and using it to provide heuristic value feedback to initialize and shape the agent's value network, the problem of reward sparsity in deep reinforcement learning is solved, and the agent achieves rapid convergence and stable policy performance in complex tasks.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- OCEAN UNIV OF CHINA
- Filing Date
- 2026-03-02
- Publication Date
- 2026-06-30
AI Technical Summary
In existing deep reinforcement learning, the environment often only provides rewards at the end of the task, resulting in interactive samples becoming inefficient data due to a lack of feedback. Furthermore, existing value shaping methods rely on manual rules, which are costly and have poor adaptability, making it difficult to adapt to the dynamic changes of complex environments.
We adopt an agent reinforcement learning method based on value feedback shaping using a large language model. By constructing a value guidance system, we use the large language model to provide heuristic value feedback to initialize the agent's value network and continuously shape it during training. We also combine temporal difference optimization to optimize the value network.
It significantly improves sample utilization efficiency, enabling agents to converge faster and obtain stable strategies in complex tasks. It has good adaptability and cross-scenario generalization ability, supporting task planning in virtual environments and operation of real robots.
Smart Images

Figure CN121745211B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of artificial intelligence and robot control technology, and in particular relates to an agent reinforcement learning method based on value shaping of large language models. Background Technology
[0002] In recent years, deep reinforcement learning (DRL) has been gradually applied in robot control and automated decision-making tasks. Through the interactive learning mechanism between the agent and the environment, robots can adaptively perform complex skills in dynamic environments, demonstrating the potential to handle unstructured tasks. However, existing reinforcement learning still faces significant bottlenecks in the initialization and shaping of the value function, which are limited by the reward structure and sample utilization efficiency, directly affecting the learning stability and policy performance of the agent.
[0003] In traditional reinforcement learning frameworks, the rationality of the value function determines the agent's exploration direction and convergence speed; the quality of the reward signal and the efficiency of sample utilization are key factors for successful training. In many real-world tasks (such as robotic arm grasping, navigation, and multi-step decision-making), the environment often provides rewards only at the end of the task, leading to a severe "reward sparsity" problem in reinforcement learning. The agent struggles to quickly locate effective exploration paths amidst a large number of invalid states, and interactive samples become inefficient data due to a lack of feedback, resulting in a sharp decline in sample efficiency.
[0004] To alleviate the problem of reward sparsity, value shaping methods are widely used to narrow the exploration scope with prior knowledge. However, existing value shaping methods still face the following limitations: random initialization of the value function can easily lead to chaotic exploration in the early stages of training; insufficient coverage of small-sample supervised pre-training makes it difficult to adapt to diverse scenarios; and manually designed reward functions and heuristic rules not only rely on expert experience and have high development costs, but also struggle to match the dynamic changes and multi-constraint requirements of complex environments, resulting in insufficient adaptability and generality.
[0005] Furthermore, traditional value shaping relies on manual rules, which are costly and have poor adaptability. Existing methods require experts to manually design reward functions or heuristic rules, which are costly to develop and difficult to adapt to dynamic environmental changes. At the same time, the rules lack self-updating capabilities and are difficult to maintain effectiveness in complex tasks. Summary of the Invention
[0006] To address the technical problem that in deep reinforcement learning of intelligent agents, the environment often only provides rewards at the end of the task, resulting in interactive samples becoming inefficient data due to a lack of feedback, this invention proposes an intelligent agent reinforcement learning method based on value feedback shaping of a large language model, which can solve the above problem.
[0007] To solve the above-mentioned technical problems, the present invention adopts the following technical solution:
[0008] An agent reinforcement learning method based on value feedback shaping from a large language model includes:
[0009] Constructing a value guidance system based on a large language model;
[0010] The intelligent agent value network initialization step includes generating a state or state-action pair dataset based on the task scenario, converting it into a natural language description and inputting it into the large language model, the large language model outputting heuristic values, and initializing the intelligent agent value network based on the heuristic values;
[0011] The initialization steps for the agent reinforcement learning basic network are as follows: the agent reinforcement learning basic network is initialized based on the initialized agent value network.
[0012] The training steps of the agent are as follows: during the interaction between the agent and the environment according to the current strategy, the agent converts the current observation state and the action to be performed into a natural language description every M time steps and inputs it into the large language model. The large language model outputs heuristic value and stores the interaction data in the experience replay buffer.
[0013] The agent value network adaptive shaping step integrates heuristic value with the value generated by the value network itself, optimizes the output value of the agent value network through temporal difference, and updates the parameters of the reinforcement learning base network based on the optimized value.
[0014] Determine whether the agent has learned a stable policy. If so, output the optimal policy and end the training; otherwise, return to the agent training step.
[0015] In some embodiments, the agent value network includes a state value network and an action value network. In the agent value network initialization step, the large language model includes outputting heuristic state values based on the input state and outputting heuristic action values based on the input state-action pairs.
[0016] In the initialization step of the agent reinforcement learning basic network, the agent reinforcement learning basic network is initialized according to the heuristic action value.
[0017] In some embodiments, the agent value network initialization step minimizes the pre-training loss function. Initialize the action-value network:
[0018] ;
[0019] Where o and a are the state-action pairs input to the large language model, and f is the heuristic action value output by the large language model based on the input information. For the action value network based on state-action pairs The value of the output action To initialize the dataset, θ represents the action value network parameters.
[0020] In some embodiments, the interactive data in the agent training step is... The format is stored in the experience replay buffer. Given the current observation status, For the current action to be performed, For the next state, For the next state The possible actions that can be performed at any time, r is the basic network of the agent's reinforcement learning based on state-action pairs. - The adaptive shaping step of the agent value network, which involves feedback environmental rewards, includes:
[0021] Calculate the timing difference error TD:
[0022] ;
[0023] in, For the action value network based on state-action pairs The value of the output action For the action value network based on state-action pairs The value of the output action;
[0024] Action value output by the action value network Optimize to obtain :
[0025] ;
[0026] in, As a discount factor, For learning rate, It is the shaping factor.
[0027] In some embodiments, the agent value network adaptive shaping step further includes:
[0028] Optimize the state value output by the state value network to obtain :
[0029] ;
[0030] in, The state value output by the initial agent state value network. For large language models based on the input state The heuristic value of the output state.
[0031] In some embodiments, the adaptive shaping step of the agent value network further includes dynamically decaying the shaping factor β until it decays to 0.
[0032] In some embodiments, the method further includes advantage assessment based on the optimized state value, including:
[0033] Based on the collected trajectory data, the temporal difference error term for each time step is calculated:
[0034] ;
[0035] in, The time difference error at time step t, The immediate environmental reward at time step t. This is the end marker;
[0036] according to The generalized advantage estimation method is used to calculate the generalized advantage estimate at time step t.
[0037] In some embodiments, the heuristic value output by the large language model is also standardized to the range of [0,1].
[0038] In some embodiments, constructing a value guidance system based on a large language model includes:
[0039] Design prompt templates that include task scenario descriptions, state-action pair examples, and value definitions, and calibrate them using chain thinking and few sample examples.
[0040] In some embodiments, the large language model is any one of GPT, Gemini, or Claude.
[0041] Compared with existing technologies, the advantages and positive effects of this invention are as follows: The agent reinforcement learning method based on large language model value shaping constructs a value guidance system based on a large language model. This system introduces a large language model to provide external evaluation feedback on the value of agent states or state-action pairs, providing heuristic value estimation for these states or pairs. This initialization of the value network before agent training reduces the inefficiency of early exploration. Furthermore, heuristic value feedback is continuously generated during agent training, dynamically shaping the value function. Unlike traditional methods that rely on sparse reward propagation, this invention provides immediate semantic guidance through language heuristics, significantly improving sample utilization efficiency and enabling agents to converge faster and obtain stable policies in complex tasks. This method can be widely applied to virtual environment task planning and real-world robot operation scenarios, exhibiting good adaptability and transferability.
[0042] This embodiment optimizes reinforcement learning strategies through two stages: value initialization and adaptive shaping. It combines heuristic values generated by a large language model with environmental feedback signals. This method significantly improves sample efficiency, alleviates training instability caused by reward sparsity, and achieves faster convergence and better policy performance in complex tasks. Compared to existing technologies, this embodiment introduces semantic guidance in the early exploration stage and integrates semantics and environmental feedback in the later training stage, resulting in stronger adaptability, stability, and cross-scenario generalization capabilities.
[0043] Other features and advantages of the present invention will become clearer after reading the detailed description of the embodiments of the present invention in conjunction with the accompanying drawings. Attached Figure Description
[0044] Figure 1 This is a flowchart of an embodiment of the intelligent agent reinforcement learning method proposed in this invention;
[0045] Figure 2 This is a comparison curve of the success rates of TDQN, R Shaping, Q Shaping, and VIAS in Task 1, "Setting the Table," of Embodiment 2 of the intelligent agent reinforcement learning method proposed in this invention.
[0046] Figure 3 This is a comparison curve of the success rates of TDQN, R Shaping, Q Shaping, and VIAS in the "preparing food" scenario of Task 2 in Embodiment 2 of the intelligent agent reinforcement learning method proposed in this invention.
[0047] Figure 4This is a comparison curve of the success rates of TDQN, R Shaping, Q Shaping, and VIAS in Task 3, "Turning on the TV," of Embodiment 2 of the intelligent agent reinforcement learning method proposed in this invention.
[0048] Figure 5 This is an ablation experiment result in the "setting the table" scenario of Task 1 in Embodiment 2 of the intelligent agent reinforcement learning method proposed in this invention, comparing the success rate curves of VIAS, AS, and VI.
[0049] Figure 6 The ablation experiment results in the "preparing food" scenario of Task 2 in Embodiment 2 of the intelligent agent reinforcement learning method proposed in this invention are shown in the success rate comparison curves of VIAS, AS, and VI.
[0050] Figure 7 This is a comparison curve of the success rates of VIAS, AS, and VI in the ablation experiment of Task 3 "Turning on the TV" scenario in Embodiment 2 of the intelligent agent reinforcement learning method proposed in this invention.
[0051] Figure 8 In Task 1 of Embodiment 4 of the intelligent agent reinforcement learning method proposed in this invention, a comparison curve of the cumulative reward during the training process using the AS′ method and the PPO method is shown.
[0052] Figure 9 This is a comparison curve of the cumulative rewards during training using the AS′ method and the PPO method in Task 2 of Embodiment 4 of the intelligent agent reinforcement learning method proposed in this invention. Detailed Implementation
[0053] The specific embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.
[0054] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0055] Large Language Models (LLMs) are considered a potential source of value priors due to their advantages in semantic understanding, reasoning, and knowledge organization. However, there are key shortcomings when integrating LLMs with reinforcement learning: (1) The integration of LLMs and reinforcement learning lacks an effective adaptation mechanism, making it difficult to map language model knowledge to the agent's state space. (2) Some methods directly rely on LLMs to generate actions, causing the agent to lose its adaptability to the real environment. (3) Another approach uses LLMs as a fixed reward generator, but does not fully utilize its knowledge structure, and this approach itself has reliability defects, making it difficult to guarantee training stability. (4) LLMs are difficult to provide long-term cumulative reward estimates for observation-action pairs and lack a mechanism for dynamic adjustment with the training process, making them difficult to serve as reliable value guidance.
[0056] To address the technical problem in existing deep reinforcement learning processes where the environment often only provides rewards at the end of the task, leading to inefficient data from interaction samples due to a lack of feedback, this invention proposes an agent reinforcement learning method based on value feedback shaping using a large language model. This method leverages the advantages of large oracle models in semantic understanding, reasoning, and knowledge organization as potential sources of value priors. However, applying large language models to agent reinforcement learning introduces the aforementioned new problems. Therefore, the agent reinforcement learning method based on value feedback shaping using a large language model in this invention can solve all of these problems simultaneously.
[0057] Example 1, see Figure 1 As shown, this invention proposes an agent reinforcement learning method based on value feedback shaping of a large language model, comprising:
[0058] A value guidance system based on a large language model is constructed. This system utilizes the large language model to provide heuristic value feedback for reinforcement learning algorithms. It can receive natural language descriptions converted from states or state-action pairs and evaluate the input information, i.e., output heuristic values. The generation of these heuristic values is based on the powerful information search capabilities of the large language model itself, and these heuristic values serve as expert evaluations to initialize the agent's value network.
[0059] The foundational network for reinforcement learning can be an action-value network (Q-network) or a state-value network (V-network), used to fit the agent's long-term rewards in the environment. Through this system design, the heuristic values generated by the large language model can be used for initialization before training or continuously shaped during training, thereby improving sample efficiency, policy stability, and semantic consistency.
[0060] The intelligent agent value network initialization steps include generating a state or state-action pair dataset based on the task scenario, converting it into a natural language description, inputting it into a large language model, outputting heuristic values from the large language model, and initializing the intelligent agent value network based on these heuristic values.
[0061] Before training begins, this process uses semantic prior information provided by the language model to help the agent identify high-value state-action pairs in the early stages of training, significantly reducing sample waste and policy bias caused by random exploration. Compared with traditional reinforcement learning, this invention introduces semantic guidance in the initialization phase, enabling the agent to have stronger target perception and policy directionality, thereby shortening the convergence time. A large language model is used to generate heuristic values for pre-collected states or state-action pairs, initializing the agent's value network to reduce ineffective exploration in the early stages of training.
[0062] The initialization steps for the agent reinforcement learning basic network are as follows: the agent reinforcement learning basic network is initialized based on the initialized agent value network.
[0063] The agent training process involves the agent interacting with the environment according to the current policy. Every M time steps, the agent converts the current observation state and the action pair into a natural language description and inputs it into the large language model. The large language model outputs heuristic values and stores the interaction data in the experience replay buffer. This process ensures that the agent continuously obtains semantic-level auxiliary information during training, improving the stability and semantic consistency of policy updates.
[0064] The current strategy refers to the control strategy formulated by the agent during normal training. The innovation of this approach lies in proposing an agent reinforcement learning method based on value feedback shaping from a large language model. That is, based on the existing theoretical framework of agent reinforcement learning, it primarily acts on the value network in reinforcement learning, and through the improvement and optimization of the value network, leverages the intrinsic connection between the value network and the foundational reinforcement learning network to optimize the entire foundational network. The agent's control strategy can be implemented using existing methods. This approach focuses on describing the interaction steps generated between the large language model and agent deep learning, including the improvement of the agent's value network and the value feedback process by value feedback before and during training.
[0065] The agent value network adaptive shaping step integrates heuristic value with the value generated by the value network itself, optimizes the output value of the agent value network through temporal difference, and updates the basic network parameters of reinforcement learning based on the optimized value.
[0066] Determine whether the agent has learned a stable policy. If so, output the optimal policy and end the training; otherwise, return to the agent training step.
[0067] In reinforcement learning, an agent sets a flag for successful task completion within the environment. The success of the task can be assessed by considering the rationality of the trajectory composed of executed state-action pairs, and whether the curve plotted using reward values converges to a stable value as learning progresses. Methods for determining whether the agent has learned a stable policy can be implemented using existing technologies and will not be elaborated upon here.
[0068] This embodiment of the agent reinforcement learning method based on large language model value shaping constructs a value guidance system based on a large language model. The large language model provides external evaluation feedback on the value of agent states or state-action pairs, offering heuristic value estimates for these states or pairs. This initialization of the value network before agent training reduces the inefficiency of early exploration. Heuristic value feedback is continuously generated during agent training, dynamically shaping the value function. Unlike traditional methods relying on sparse reward propagation, this invention provides immediate semantic guidance through language heuristics, significantly improving sample utilization efficiency and enabling the agent to converge faster and obtain stable policies in complex tasks. This method can be widely applied to virtual environment task planning and real-world robot operation scenarios, exhibiting good adaptability and transferability.
[0069] This embodiment constructs a value guidance system based on a large language model, converting the entire task scenario and state-action pairs into natural language descriptions. The large language model, through its powerful semantic understanding, reasoning, and knowledge organization capabilities, can analyze the input information to derive its corresponding value. This achieves effective integration and adaptation between the large language model and reinforcement learning, enabling the effective mapping of language model knowledge to the agent's state space.
[0070] This approach generates values for evaluating actions or state-action pairs using a large oracle model and feeds these values back to the agent's value network as external knowledge to aid in value initialization and shaping. This avoids the problem of the agent losing its adaptability to the real environment by not directly generating actions using a large language model. It also avoids the issue of using a large language model as a fixed reward generator without fully utilizing its knowledge structure, and the inherent reliability flaws and difficulty in ensuring training stability inherent in this approach.
[0071] This scheme periodically collects states or state-action pairs during agent training, converts them into natural language descriptions, and inputs them into a large language model. The large language model outputs heuristic values, which are then combined with environmental rewards to continuously shape the value function. This enables the large language model to provide long-term cumulative reward estimates for observation-action pairs and to dynamically adjust as training progresses, serving as a reliable long-term value guide for agent training.
[0072] This embodiment optimizes reinforcement learning strategies through two stages: value initialization and adaptive shaping. It combines heuristic values generated by a large language model with environmental feedback signals. This method significantly improves sample efficiency, alleviates training instability caused by reward sparsity, and achieves faster convergence and better policy performance in complex tasks. Compared to existing technologies, this embodiment introduces semantic guidance in the early exploration stage and integrates semantics and environmental feedback in the later training stage, resulting in stronger adaptability, stability, and cross-scenario generalization capabilities.
[0073] In some embodiments, constructing a value guidance system based on a large language model includes:
[0074] Design prompt templates that include task scenario descriptions, state-action pair examples, and value definitions, and calibrate them using chain thinking and few sample examples.
[0075] The abstract state of the environment is transformed into a natural language description, which is then used as a cue word input into a large language model to generate heuristic value. The natural language description is dynamically generated by filling in predefined text templates.
[0076] The prompt word template includes task scenario descriptions, state-action pair examples, and value definitions. It is calibrated using chained thinking and few-sample examples to ensure the stability and reliability of the heuristic value output by the large language model. Experiments show that using three progressively more complex tasks to evaluate VIAS, each task testing different aspects of the agent's ability to parse language, understand object manipulability, and perform action sequences, significantly improves the consistency and numerical stability of the model output.
[0077] To improve system stability and computational efficiency, this embodiment employs a templated structure in the prompt word design. The prompt words include task background, observation description, action candidate set, and value output requirements, and are calibrated using few-sample examples. During training, the feedback results of the language model are cached; repeated queries of the same state-action pair directly call the cached results, significantly reducing computational cost and interaction latency.
[0078] The agent reinforcement learning method in this embodiment guides reinforcement learning by incorporating semantic priors from a Large Language Model (LLM). The natural language conversion process of abstract states is achieved by constructing text templates, which are predefined text structures containing several state variable placeholders. When sampling a state stored in memory, key elements of that state, including location, object, action, and spatial relationships, are used as variables to fill the text template, generating a natural language description. This natural language description is then input as prompt words into the Large Language Model to generate heuristic value feedback related to that state. The text template can be dynamically adjusted according to the task scenario to adapt to the state expression needs in different environments, thereby ensuring that the value feedback generated by the language model remains consistent and stable with the actual task scenario.
[0079] In some embodiments, the large language model can be any of, but is not limited to, GPT, Gemini, and Claude.
[0080] In some embodiments, the agent value network includes a state value network and an action value network. In the agent value network initialization step, the large language model includes outputting heuristic state values based on the input state and outputting heuristic action values based on the input state-action pairs.
[0081] In the initialization step of the agent reinforcement learning basic network, the agent reinforcement learning basic network is initialized according to the value of heuristic actions to avoid the waste of samples caused by random exploration in the early stage of training.
[0082] In some embodiments, the heuristic value output by the large language model is also normalized to the range of [0,1].
[0083] In some embodiments, the specific implementation process of value initialization includes: the agent randomly interacting with the environment to generate N sets of state-action pairs, and constructing an initialization dataset. Each observation-action pair is converted into a natural language description (e.g., "Observation: There are plates and a table in the kitchen; Action: Place the plates on the table"), input into a large language model, and the output is calibrated with few-sample prompts, requiring the model to output a quantitative heuristic value in the range [0,1].
[0084] In some embodiments, the agent value network initialization step minimizes the pre-training loss function. Initialize the action-value network:
[0085] .
[0086] Where o and a are the state-action pairs input to the large language model, and f is the heuristic action value output by the large language model based on the input information. For the action value network based on state-action pairs The value of the output action To initialize the dataset, θ represents the parameters of the action value network. This initialization process can significantly improve the success rate in the early training phase and reduce exploration failures.
[0087] In some embodiments, the interactive data in the agent training step is... The format is stored in the experience replay buffer. Given the current observation status, For the current action to be performed, For the next state, For the next state The possible actions that can be performed at any time, r is the basic network of the agent's reinforcement learning based on state-action pairs. - The feedback environment reward uses a combination of temporal difference updates and heuristic value shaping to optimize the action value network. Specifically, the adaptive shaping steps of the agent's value network include:
[0088] Calculate the timing difference error TD:
[0089] ;
[0090] in, For the action value network based on state-action pairs The value of the output action For the action value network based on state-action pairs The value of the output action.
[0091] Action value output by the action value network Optimize to obtain :
[0092] .
[0093] in, As a discount factor, For learning rate, It is the shaping factor.
[0094] Experiments have shown that using either the initialization or shaping module alone can improve performance, but the best results are achieved when the two are combined, enabling faster convergence in complex tasks.
[0095] In some embodiments, the agent value network adaptive shaping step further includes:
[0096] Optimize the state value output by the state value network to obtain :
[0097] .
[0098] in, The state value output by the initial agent state value network. For large language models based on the input state The heuristic value of the output state.
[0099] In some embodiments, the adaptive shaping step of the agent value network further includes dynamically decaying the shaping factor β until it decays to 0.
[0100] By dynamically decaying the shaping factor β, the impact of large language model feedback on agent training is reduced, allowing the agent to rely more on environmental signals to achieve policy convergence. The shaping factor gradually decays during training according to a pre-defined strategy, maintaining a high weight initially to fully utilize the semantic guidance of the language model, and then gradually decreasing it later to enhance the dominant role of environmental feedback. This design avoids excessive interference from the language model on the policy in the later stages of training, ensuring the stability and generalization ability of the policy.
[0101] Action value shaping is particularly suitable for reinforcement learning algorithms that require explicit computation and iterative updates of action values. It preserves state-action pairs generated by the agent's interaction with the environment and uses a large language model to generate heuristic action values based on their potential long-term rewards for shaping. State value shaping is particularly suitable for reinforcement learning algorithms that implicitly estimate action values through state value functions and dominance functions. It only collects states generated by the agent's interaction with the environment and uses a large language model to generate heuristic state values based on the potential long-term rewards of the states for shaping, thus ensuring that different types of algorithms can obtain the assistance of language priors.
[0102] In some embodiments, the method further includes advantage assessment based on the optimized state value, including:
[0103] Based on the collected trajectory data, the temporal difference error term for each time step is calculated:
[0104] .
[0105] in, The time difference error at time step t, The immediate environmental reward at time step t. This is a termination flag; it is 1 if the current trajectory has ended, and 0 otherwise.
[0106] Advantage calculation is based on the Generalized Advantage Estimation (GAE) method, utilizing the aforementioned fused state values. Conduct an advantage assessment to improve the stability of low-level strategy updates.
[0107] according to The generalized dominance estimate for time step t is calculated based on the generalized dominance estimation method, specifically including:
[0108] ;
[0109] For time steps The generalized advantage estimate This is the bias-variance adjustment parameter in GAE, used to balance the accuracy and stability of dominance estimation; The combination constitutes the attenuation coefficient in GAE; This is the advantage value for the next time step (calculated sequentially in the recursion).
[0110] Based on the dominance value With state value function It is possible to estimate the expected reward of the current state-action pair:
[0111] .
[0112] in Indicates time step The expected return estimate is the Q value calculated based on the state value (it is not necessary to explicitly output the q value using a neural network, but rather to approximate the q value using the V value and A value (dominance value)).
[0113] This leads to the direct expression of the advantage function:
[0114] .
[0115] in Indicates during observation Next action The corresponding advantage function value.
[0116] Through the above calculation process, the advantage evaluation is carried out by using the state value of the fusion language prior. This can effectively alleviate the problem of unstable value estimation caused by reward sparsity or delay in long-term tasks, making the gradient of the low-level policy network more stable and the variance lower during the update process, thereby improving the convergence speed and the final policy performance.
[0117] The method in this embodiment supports two categories of reinforcement learning: offline policy and online policy algorithms. For offline policy algorithms (such as DQN), an empirical replay mechanism is used to randomly sample batch data, and the state-value network parameters are adjusted by combining temporal difference updates and heuristic value shaping.
[0118] For online policy algorithms (such as PPO), a state value shaping approach is adopted. Heuristic state values are obtained through a large language model and introduced into the generalized advantage estimation to optimize the policy network parameters. The estimation of the advantage function depends not only on the environmental reward and the value function but also on the V-value generated by the language model, thereby improving the policy's perception of long-term goals. This design is applicable to continuous action spaces and complex policy optimization scenarios, ensuring that the method can operate effectively under various reinforcement learning frameworks. Experiments in robotic arm "grasping" and "placing" tasks demonstrate that state value shaping significantly improves policy stability and final reward.
[0119] Preferably, the method supports multi-scenario adaptation: in virtual environment tasks (such as "setting the table," "preparing food," and "turning on the TV" tasks in VirtualHome), the natural language descriptions of observations and actions are directly generated based on the built-in objects and spatial relationships in the environment; in real-world robot tasks (such as "grasping" and "placing" tasks by a robotic arm), the robotic arm's state information needs to be converted into natural language descriptions, input into a large language model to obtain targeted heuristic value, and assist in policy correction. Experimental results show that the policies trained in the simulation environment can be directly transferred to the real-world robot platform, and the task success rate remains above 90% after the transfer.
[0120] The agent-based reinforcement learning method in this embodiment is insensitive to the choice of large language model. Experimental results show that using GPT-4o, Gemini-1.5, or Claude-3.5 can improve sample efficiency and success rate. Different models only affect the convergence speed, but all can ensure the stability and applicability of the method.
[0121] Compared with the prior art, the present invention has the following beneficial effects:
[0122] 1. Significantly improves sample efficiency. The value initialization module provides a "warm start" advantage, avoiding ineffective exploration in the early stages of reinforcement learning, resulting in a significant improvement in sample efficiency for complex tasks. Experiments show that in the "turn on the TV" task, the VIAS method improves sample efficiency by more than 40% compared to the traditional TDQN.
[0123] 2. Strong dynamic adaptability. Through shaping factors... The gradual decay of the language model balances the semantic priors of the language model with the real feedback from the environment, gradually reducing the influence of the language model, so that the agent eventually relies on environmental signals to complete policy convergence.
[0124] 3. Excellent generalization ability. It supports cross-domain transfer between virtual and real-world environments. Policies pre-trained in the simulation environment can be directly transferred to real robots without large-scale retraining. Experiments show that in the robotic arm's "setting the table" task, the transferred policy achieved a 100% success rate.
[0125] 4. Wide adaptability. It can be flexibly integrated into off-policy and on-policy algorithms, supports the replacement of different large language models, and is applicable to various reinforcement learning tasks ranging from low-dimensional discrete actions to high-dimensional continuous actions.
[0126] Example 2: To further verify the effectiveness of the method of the present invention, this example conducts experiments in the VirtualHome simulation environment. The VirtualHome environment includes multiple rooms such as the kitchen and living room, as well as interactive objects (such as plates, dining tables, refrigerators, televisions, etc.). The task scenarios are all described in natural language, requiring the agent to parse semantics and complete multi-step actions in a complex environment. This environment can simulate multi-step tasks in real home scenarios, has semantic complexity and action sequence dependencies, and is suitable as a verification platform.
[0127] This embodiment designs three representative tasks: (1) Setting up the dining table task: requiring the agent to place objects such as plates and wine glasses on the designated dining table surface, examining the agent's understanding of object recognition and placement actions; (2) Preparing food task: requiring the agent to take out food and tableware from the refrigerator or container and complete the placement operation, examining the agent's performance in multi-object interaction and sequential actions; (3) Turning on the TV task: requiring the agent to find the TV and turn it on, while preparing snacks, examining the agent's multi-step reasoning and action execution capabilities in complex scenarios.
[0128] In the experimental setup, several baseline methods were selected for comparison, including the traditional Templated Deep Action Value Network (TDQN), Reward Shaping (R Shaping), Action Value Shaping (Q Shaping), and the agent reinforcement learning method based on large language model value feedback shaping (VIAS) proposed in this invention. All methods used the same network structure and parameter settings to ensure fairness in the comparison. Specific hyperparameter settings are as follows: discount factor γ was set to 0.99; the initial exploration rate ε was 0.1, gradually decreasing to 0.01 during training; the shaping factor β was initially 0.5, decreasing by 0.05 every 10,000 steps until it approached 0 during the convergence phase.
[0129] like Figures 2 to 4 As shown, the method in this embodiment exhibits higher success rates and faster convergence speeds across all three tasks. Particularly in the complex task of "turning on the TV," the method of this invention can quickly identify the optimal policy, significantly outperforming TDQN and reward shaping methods, demonstrating the complementary advantages of value initialization and adaptive shaping. Experimental data shows that the VIAS method achieves a high success rate early in training and significantly outperforms the baseline method in convergence speed, proving the guiding role of the heuristic value provided by the language model in complex tasks.
[0130] like Figures 5 to 7 As shown, the methods of using value initialization and adaptive shaping (VIAS), value initialization (VI), and adaptive shaping (AS) are compared. Experimental results show that introducing (VI) as the initialization method significantly improves the convergence speed in the early stages of training, reflected in the rapid performance increase in the success rate curve. This mechanism effectively reduces the agent's ineffective exploration in the initial stage, providing high-quality prior behavioral guidance. In contrast, (AS), as a continuous training aid mechanism, while not significantly reducing the proportion of ineffective exploration, plays a crucial accelerating role in the middle stages of training, assisting the agent in optimizing the policy more efficiently, thereby improving the overall efficiency and stability of the training process. Therefore, the combined method VIAS, compared to the former, effectively reduces the agent's ineffective exploration in the initial stage while improving sample efficiency during training, resulting in lower training costs and faster policy convergence.
[0131] This embodiment verifies the effectiveness of the method of the present invention in a simulation environment. Compared with directly using the language model as the policy, the method of the present invention can make full use of the semantic prior of the language model, and combine environmental feedback for value shaping, thereby achieving higher sample efficiency and better policy performance in complex tasks. Experimental results further demonstrate that the method of the present invention has stronger robustness and generalization ability in multi-step tasks, and can stably converge and output high-quality policies in complex semantic environments.
[0132] Example 3, as shown in Table 1, further verifies the performance of the method of the present invention under different large language models in the VirtualHome simulation environment. To ensure the fairness of the experiment, except for replacing the language model, all other training parameters and network structures are kept consistent. The hyperparameter settings are the same as in Example 2, including discount factor γ=0.99, exploration rate ε initial value of 0.1 and gradually decaying to 0.01, and shaping factor β initial value of 0.5 and gradually decaying during training.
[0133] Table 1: Success rates of the VIAS framework combined with GPT-4o, Gemini 1.5, and Claude 3.5 respectively.
[0134] GPT-4o Gemini 1.5 Claude 3.5 Task 1 (Setting the Table) 1.00±0.00 1.00±0.00 1.00±0.00 Task Two (Prepare Food) 0.99±0.01 1.00±0.00 0.95±0.06 Task 3 (Turn on the TV) 0.91±0.09 0.96±0.06 0.94±0.07
[0135] In the experiment, three large language models—GPT-4o, Gemini-1.5, and Claude-3.5—were selected as heuristic value generators. While they differ in architecture and inference mechanisms, all possess strong natural language understanding and reasoning capabilities. By comparing the training results under different models, the sensitivity and robustness of the proposed method in language model selection can be evaluated.
[0136] Experimental results show that all three language models significantly improve the learning efficiency and task success rate of the agent. Gemini-1.5 performs best in the complex task of "turning on the TV," achieving a success rate of 96% and relatively fast convergence. GPT-4o and Claude-3.5 achieve success rates of 91% and 94%, respectively, slightly lower than Gemini-1.5, but their overall performance is still significantly better than the baseline method that does not use a language model. The results indicate that different language models exhibit some differences in performance on specific tasks, but none lead to a significant performance decrease, demonstrating that the method of this invention is insensitive to the choice of language model and possesses good robustness and universality.
[0137] In summary, this embodiment verifies the effectiveness of the method of the present invention under different language models, demonstrating its good robustness and generalization ability. Regardless of the large language model used, the method of the present invention can improve sample efficiency and success rate in complex tasks, ensuring stable convergence of the strategy. This result shows that the method of the present invention does not depend on a specific language model in practical applications, and can flexibly adapt to different models, further enhancing the versatility and practical value of the method.
[0138] Example 4, as Figure 8 , Figure 9 As shown, this embodiment verifies the effectiveness of the method of the present invention on a physical robot platform. The experiment consists of two parts: a navigation robot experiment and a robotic arm experiment. By testing on different types of robot platforms, this embodiment further demonstrates the applicability and stability of the method of the present invention in the transfer from simulation to reality.
[0139] The first part is the navigation robot experiment. This navigation robot has movement and positioning capabilities, enabling it to perform path planning and object interaction in an indoor environment. In this embodiment, the policy trained in the VirtualHome simulation environment is directly transferred to the navigation robot, and a table-setting task is performed in a real-world scenario. The specific task requires the robot to start from its initial position and sequentially complete the action sequence of "move to plate—grab plate—move to table—place plate—move to cup—grab cup—move to table—place cup". Experimental results show that the transferred policy can successfully complete the table-setting task in a real-world environment, with the action execution order consistent with the simulation environment. Furthermore, no disconnect between actions and semantics occurred in multiple repeated experiments, demonstrating the effectiveness and stability of the proposed method in transferring from simulation to reality. Compared to traditional reinforcement learning methods, the proposed method does not require large-scale retraining during the transfer process, significantly reducing the cost of real-world deployment.
[0140] The second part is the robotic arm experiment. The experimental platform is a 6-DOF robotic arm equipped with a two-finger gripper and an RGB camera. The tasks include "Grab" and "Put". In the Grab task, the robotic arm starts from a fixed initial posture, drives the end effector to contact the target object and closes the gripper to complete the grasp. In the Put task, the robotic arm moves the grasped object to a designated position and opens the gripper to complete the placement. This embodiment uses the PPO algorithm as the baseline method and introduces a State Value Shaping Module (AS') into the method of this invention, that is, querying the heuristic V value corresponding to the state and introducing this heuristic feedback into the advantage function calculation. The training parameters are set as follows: learning rate 0.0003, discount factor γ=0.99, batch size 64, maximum training steps 12500, policy entropy weight α=0.01.
[0141] Experimental results show that in the "grasping" and "placement" tasks, the method of this invention converges faster and the final reward is significantly higher than the baseline PPO algorithm. After training, the robotic arm can stably execute complete action sequences with a 100% success rate, while the baseline method still experiences failures with the same number of training steps, manifested as gripper position deviation or object placement errors. Further analysis shows that the method of this invention can effectively utilize the heuristic state value provided by the language model during training, reducing training instability caused by reward sparsity, and maintaining policy consistency and robustness in complex action sequences.
[0142] In summary, this embodiment verifies the effectiveness of the method of the present invention on physical robots. By directly transferring the simulation strategy to the navigation robot and introducing state value shaping into the robotic arm, the method of the present invention not only performs excellently in the simulation environment but also can be transferred to real-world scenarios, ensuring the stability of the strategy and the task completion rate. These results demonstrate that the method of the present invention has good adaptability and promotional value in cross-platform and cross-task applications, providing a new technical path for the deployment of reinforcement learning in practical robotic systems.
[0143] Of course, the above description is not intended to limit the present invention, and the present invention is not limited to the examples given above. Any changes, modifications, additions or substitutions made by those skilled in the art within the scope of the present invention should also fall within the protection scope of the present invention.
Claims
1. A reinforcement learning method for intelligent agents based on value feedback shaping of large language models, characterized in that, include: Constructing a value guidance system based on a large language model; The intelligent agent value network initialization step includes generating a state or state-action pair dataset based on the task scenario, converting it into a natural language description and inputting it into the large language model, the large language model outputting heuristic values, and initializing the intelligent agent value network based on the heuristic values; The initialization steps for the agent reinforcement learning basic network are as follows: the agent reinforcement learning basic network is initialized based on the initialized agent value network. The training steps of the agent are as follows: during the interaction between the agent and the environment according to the current strategy, the agent converts the current observation state and the action to be performed into a natural language description every M time steps and inputs it into the large language model. The large language model outputs heuristic value and stores the interaction data in the experience replay buffer. The agent value network adaptive shaping step integrates heuristic value with the value generated by the value network itself, optimizes the output value of the agent value network through temporal difference, and updates the parameters of the reinforcement learning base network based on the optimized value. Determine whether the agent has learned a stable policy. If so, output the optimal policy and end the training; otherwise, return to the agent training step. The agent value network includes a state value network and an action value network. In the agent value network initialization step, the large language model includes outputting heuristic state values based on the input state and outputting heuristic action values based on the input state-action pairs. In the initialization step of the agent reinforcement learning basic network, the agent reinforcement learning basic network is initialized according to the heuristic action value; Interactive data in the training steps of the intelligent agent The format is stored in the experience replay buffer. Given the current observation status, For the current action to be performed, For the next state, For the next state The possible actions that can be performed at any time, r is the basic network of the agent's reinforcement learning based on state-action pairs. - The environmental reward feedback, the adaptive shaping step of the agent value network includes: Calculate the timing difference error TD: ; in, For the action value network based on state-action pairs The value of the output action For the action value network based on state-action pairs The value of the output action; Action value output by the action value network Optimize to obtain : ; in, As a discount factor, For learning rate, It is the shaping factor.
2. The agent reinforcement learning method according to claim 1, characterized in that, In the initialization step of the agent value network, the pre-training loss function is minimized. Initialize the action-value network: ; Where o and a are the state-action pairs input to the large language model, and f is the heuristic action value output by the large language model based on the input information. For the action value network based on state-action pairs The value of the output action To initialize the dataset, θ represents the action value network parameters.
3. The agent reinforcement learning method according to claim 1, characterized in that, The adaptive shaping step of the agent value network further includes: Optimize the state value output by the state value network to obtain : ; in, The state value output by the initial agent state value network. For large language models based on the input state The heuristic value of the output state.
4. The agent reinforcement learning method according to claim 1, characterized in that, The adaptive shaping step of the agent value network also includes dynamically decaying the shaping factor β until it decays to 0.
5. The agent reinforcement learning method according to claim 3, characterized in that, It also includes advantage assessment based on the optimized state value, including: Based on the collected trajectory data, the temporal difference error term for each time step is calculated: ; in, The time difference error at time step t, The immediate environmental reward at time step t. This is the end marker; according to The generalized advantage estimation method is used to calculate the generalized advantage estimate at time step t.
6. The agent reinforcement learning method according to claim 1, characterized in that, It also includes standardizing the heuristic values output by the large language model to the range of [0,1].
7. The agent reinforcement learning method according to any one of claims 1-6, characterized in that, Constructing a value guidance system based on a large language model includes: Design prompt templates that include task scenario descriptions, state-action pair examples, and value definitions, and calibrate them using chain thinking and few sample examples.
8. The agent reinforcement learning method according to any one of claims 1-6, characterized in that, The large language model can be any one of GPT, Gemini, or Claude.