Embodied agent training and reasoning method based on conditional diffusion world model and thought chain
By constructing a conditional diffusion world model and an embodied agent training method based on thought chains, the problems of low sample efficiency, inconsistent action generation, and resource-constrained deployment in existing technologies are solved, achieving efficient and reliable decision-making for complex tasks and lightweight deployment.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHEJIANG UNIV OF TECH
- Filing Date
- 2026-01-30
- Publication Date
- 2026-06-19
AI Technical Summary
Existing reinforcement learning algorithms rely on a large number of real interactions, resulting in low sample efficiency and high training costs. Visual-language-action models exhibit inconsistent action generation and opaque logical reasoning in complex tasks, have weak generalization ability in new scenarios, and large models are difficult to deploy on resource-constrained terminals.
An integrated action-world model is constructed, employing a conditional diffusion strategy and a two-layer retrieval-enhanced thought chain reasoning module. Cascaded retrieval is performed through a logical reasoning library and a physical execution library. The conditional diffusion strategy generates logically consistent and physically feasible action sequences, and the teacher model is transferred to a lightweight student model through joint knowledge distillation.
It achieves high-sample-efficiency strategy optimization, improves decision reliability and success rate in complex tasks, is suitable for efficient deployment in resource-constrained scenarios, and has unified multimodal understanding and action generation capabilities.
Smart Images

Figure CN122242564A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of artificial intelligence technology, specifically relating to a method for training and reasoning embodied intelligent agents based on a conditional diffusion world model and thought chain. Background Technology
[0002] In reinforcement learning and multimodal modeling research, embodied agents acquire observations, perform actions, and receive reward signals by interacting with the environment, with the goal of learning optimal policies to maximize cumulative rewards. However, traditional reinforcement learning algorithms, such as proximal policy optimization (PPO) and asynchronous advantage actor-critic (A3C) based on policy gradients, or deep Q-networks (DQN) and soft actor-critic (SAC) based on value functions, all rely on a large number of real interactions, resulting in low sample efficiency and high training costs.
[0003] To overcome this problem, researchers proposed the concept of a World Model. World Models enable embodied agents to undergo "imagination training" within their internal latent space by learning environmental dynamics and reward functions, thereby reducing reliance on interactions with the real environment. Representative works include PlaNet and the DreamerV1-V4 series models. This paper employs a core training strategy combining World Models and Interactive Dynamics, enabling the model to predict future states with shorter sampling steps while maintaining high-accuracy reward estimation. The latent space of the World Model utilizes a differentiable tokenizer and a Transformer dynamics network, supporting long-sequence predictions of both vision and action.
[0004] Despite significant progress in world modeling, existing solutions still have several limitations: First, in terms of reasoning ability, traditional Vision-Language-Action (VLA) models are prone to problems such as disordered task decomposition, accumulation of errors in preceding actions, and lack of logical decision-making in complex scenarios when handling long-duration / multi-step tasks. Second, the core requirement of VLA models is to generate actions that "fit multimodal inputs and conform to physical laws," but traditional action models are prone to problems such as discontinuous action sequences, violation of physical constraints (such as collisions and loss of force control), and poor spatial adaptation of continuous actions. Third, VLA models rely on large-scale data training, but real-world scenarios present new environments, niche tasks, and scarce samples, resulting in poor model generalization ability and the inability to reuse historical high-quality experience. Fourth, VLA models typically employ a large number of parameters to support multimodal understanding and complex reasoning, but the robot's local end suffers from limited memory, weak computing power, and energy sensitivity, making large models difficult to deploy in practice.
[0005] Furthermore, existing autoregressive strategies mostly employ discrete action prediction, lacking the ability to model continuous control, making it difficult to maintain stability in embodied applications such as robotics and complex control scenarios. Regarding model interpretability, the lack of explicit reasoning mechanisms results in poor interpretability and insufficient transparency in the model's decision-making process, further limiting its application in safety-sensitive scenarios. Summary of the Invention
[0006] In view of the above, the purpose of this invention is to provide an embodied agent training and reasoning method based on a conditional diffusion world model and a thought chain. By constructing an integrated action-world model, a collaboratively trained conditional diffusion strategy, and a thought chain reasoning module with dual-layer retrieval enhancement, an embodied agent system with unified multimodal understanding, world imagination, action generation, and knowledge reasoning capabilities is built. This enables high-sample-efficiency strategy optimization, high-reliability decision-making for complex tasks, and efficient deployment in resource-constrained scenarios.
[0007] To achieve the above-mentioned objectives, the present invention provides the following technical solution: In a first aspect, the embodiments of the present invention provide an embodied agent training and reasoning method based on a conditional diffusion world model and a thought chain, comprising the following steps: Construct an action-world model, which includes an action model that predicts actions based on image observation history and language instructions, and a world model that predicts future states based on image observation history and action sequences, and pre-train the action-world model; A conditional diffusion strategy model is constructed, which uses potential visual features, language instructions and thought chains as composite conditions, and generates accurate action sequences through an inverse denoising process. A two-layer retrieval knowledge base containing a logical reasoning library and a physical execution library is constructed and cascaded retrieval is performed. The first stage is based on the retrieval of reference thinking chains based on latent visual features and language instructions. The second stage combines the reference thinking chains and latent visual features to retrieve reference physical trajectories. The reference thought chain and reference physical trajectory are used as composite conditional inputs to the conditional diffusion strategy model to generate actions consistent with logical reasoning and physical constraints. The model is then used for imagination training in the potential space constructed by the pre-trained action-world model to complete the policy optimization training of the embodied agent. By using the action-world model of the fusion conditional diffusion strategy model as the teacher model, and transferring the teacher model to the lightweight student model through joint knowledge distillation, a final deployable embodied agent is formed for efficient reasoning.
[0008] Preferably, the pre-training of the action-world model includes: A mask autoencoder framework is used to train the word segmenter that processes visual input in the action-world model to encode image observations into compact latent visual features. The word segmenter is trained through a mask-reconstruction self-supervised approach, and its loss function is a weighted sum of mean squared error loss and LPIPS perceptual loss to balance pixel-level reconstruction accuracy and visual perception consistency. The efficient Transformer framework is used to train the dynamic model that simulates the dynamic changes of the environment in the action-world model. This is used to learn robust state transition rules in the latent space. The training of the dynamic model is carried out by adding controllable noise to the target state and then reconstructing it after denoising. The training loss function is dynamically adjusted according to the prediction step size, and a ramp weight function is introduced to enhance the guiding role of high-quality data in model training, so as to improve the robustness of the model to noise and interference.
[0009] Preferably, in the first stage, when retrieving the reference thought chain based on latent visual features and language instructions, a retrieval accuracy loss is used for training, as follows: , in, For the loss of retrieval accuracy in the first stage, For the present The query vector at time step, These are the positive sample key vectors in the knowledge base. For the first negative sample candidate key vectors For similarity function, This refers to the temperature parameter.
[0010] Preferably, in the second stage, when retrieving the reference physical trajectory by combining the reference thought chain and latent visual features, a retrieval accuracy loss combined with a logic-action consistency penalty is used for training, as follows: , , in, To use the same search accuracy loss as in the first phase, This is a logic-action consistency penalty term. To balance the weights, For the retrieved reference physical trajectory, For parameter extraction function, The reasoning text generated for the current step of the thought chain. For constraint mapping functions, The Euclidean distance is the distance between the actual action parameters and the logical constraint parameters.
[0011] Preferably, the thought chain is a structured text sequence for step-by-step task reasoning, the content of which includes a description of the task environment state, semantic understanding based on potential visual features and language instructions, intermediate logical judgment steps, and the final execution action intention.
[0012] Preferably, the policy loss function is used during the training process, expressed as: , The first term reduces the policy's selection probability under poor state-action pairs, the second term increases the policy's selection probability under good state-action pairs, and the third term constrains the policy's update magnitude. The strategy to be optimized and The first Each action and state, and The first Each action and state, and These are the sets of state-action pairs with positive and negative advantages, respectively. To balance the coefficients of the loss weights for positive and negative samples, As a priori criterion. The total number of samples, Let KL divergence be the KL divergence. This is a priori strategy.
[0013] Preferably, the loss function for joint knowledge distillation is expressed as: , The first term is used to maintain consistency in the distribution of behavioral strategies; the second term is used to constrain the latent spatial structure to ensure that the world model captures environmental dynamics; and the third term is used to ensure consistency in inference paths. and The strategy distributions for the teacher model and the student model are respectively. and These represent the states of action and input, respectively. Let KL divergence be the KL divergence. and These are the latent visual features corresponding to the teacher model and the student model, respectively. For Euclidean distance and The balancing weights for each loss term, The distillation loss of reasoning ability in the thought chain; , in, To balance the weights, For cross-entropy loss, For mean square error loss, and These are the hard labels for the student model and the teacher model, respectively, representing the final, definitive output. and These are the soft labels for the student model and the teacher model, respectively, representing the complete probability distribution output by the teacher model.
[0014] Secondly, embodiments of the present invention also provide an embodied agent training and reasoning system based on a conditional diffusion world model and a thought chain, implemented using the aforementioned embodied agent training and reasoning method based on a conditional diffusion world model and a thought chain, including: a model building module, a conditional diffusion module, a retrieval enhancement module, a policy training module, and a distillation deployment module. The model building module is used to build an action-world model, which includes an action model that predicts actions based on image observation history and language instructions, and a world model that predicts future states based on image observation history and action sequences, and pre-trains the action-world model. The conditional diffusion module is used to construct a conditional diffusion strategy model, which uses potential visual features, language instructions and thought chains as composite conditions, and generates accurate action sequences through a reverse denoising process. The retrieval enhancement module is used to construct a two-layer retrieval knowledge base containing a logical reasoning library and a physical execution library and to perform cascade retrieval. The first stage is to retrieve reference thought chains based on latent visual features and language instructions, and the second stage is to retrieve reference physical trajectories by combining reference thought chains and latent visual features. The strategy training module is used to input the reference thought chain and reference physical trajectory as composite conditions into the conditional diffusion strategy model to generate actions consistent with logical reasoning and physical constraints, and to perform imagination training in the potential space constructed by the pre-trained action-world model to complete the strategy optimization training of the embodied intelligent agent. The distillation deployment module is used to take the action-world model of the fusion conditional diffusion strategy model as the teacher model, and transfer the teacher model to the lightweight student model through joint knowledge distillation to form the final deployable embodied agent for efficient reasoning.
[0015] Thirdly, embodiments of the present invention also provide an electronic device, including a memory and one or more processors, wherein the memory is used to store a computer program, and the processor is used to implement the above-described embodied agent training and reasoning method based on the conditional diffusion world model and thought chain when executing the computer program.
[0016] Fourthly, embodiments of the present invention also provide a computer-readable storage medium storing a computer program, which, when executed by a computer, implements the above-described method for training and reasoning embodied agents based on a conditional diffusion world model and thought chain.
[0017] Compared with the prior art, the beneficial effects of the present invention include at least the following: (1) By constructing an integrated action-world model and adopting a multi-stage collaborative training method, this invention achieves deep fusion and representation of vision, language, action and internal state in a unified latent space, which significantly improves the consistency of multimodal data and cross-modal reasoning ability.
[0018] (2) By introducing a two-layer retrieval-enhanced thinking chain reasoning and condition diffusion action generation strategy, this invention effectively combines logical planning with physical constraints. The generated decision sequence is not only semantically coherent and in line with the task objectives, but also physically feasible and precise in action, which greatly improves the reliability and success rate of decision-making in complex tasks.
[0019] (3) This invention, through a joint knowledge distillation framework that includes strategy, potential representation and thought chain, can efficiently compress a fully trained large-scale teacher model into a lightweight student model while maintaining core decision-making and reasoning performance, which is conducive to deployment in edge devices such as robots with limited computing and storage resources. Attached Figure Description
[0020] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0021] Figure 1 This is a flowchart illustrating the embodied agent training and reasoning method based on the conditional diffusion world model and thought chain provided in this embodiment of the invention. Figure 2 This is a schematic diagram of the framework of the embodied agent training and reasoning method based on the conditional diffusion world model and thought chain provided in the embodiments of the present invention; Figure 3 This is a network structure diagram of the world model fusion conditional diffusion strategy provided in an embodiment of the present invention; Figure 4 This is a schematic diagram of the structure of the embodied intelligent agent training and reasoning system based on the conditional diffusion world model and thought chain provided in the embodiments of the present invention. Detailed Implementation
[0022] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the scope of protection of this invention.
[0023] The inventive concept of this invention is as follows: Addressing the problems of inconsistent action generation, opaque logical reasoning, weak generalization ability in new scenarios, and difficulty in deploying large models on resource-constrained terminals in existing visual-language-action models when handling complex tasks, this invention provides a training and reasoning method for embodied agents based on conditional diffusion world models and thought chains. First, a conditional diffusion strategy model is adopted, using visual features, language instructions, and thought chains as composite conditions to generate high-quality, accurate continuous action sequences through a reverse denoising process. Second, a retrieval-enhanced generation mechanism is introduced. By constructing a two-layer retrieval knowledge base and performing cascading retrieval, reference thought chains and physical trajectories are obtained and injected as conditions into the action generation process to enhance the model's generalization ability to new tasks and low-sample scenarios. Third, interpretable task decomposition logic is generated through thought chain reasoning, and logical consistency constraints are introduced in the retrieval and action generation stages to ensure strict alignment between action and reasoning steps, thereby enhancing the transparency and reliability of decision-making. Finally, through a joint knowledge distillation strategy, the strategies, latent representations, and thought chain reasoning capabilities of the teacher model are simultaneously transferred to the lightweight student model, achieving efficient deployment of the agent in resource-constrained environments.
[0024] like Figure 1 and Figure 2 As shown, the embodiment provides a method for training and reasoning embodied agents based on a conditional diffusion world model and thought chain, including the following steps: S1. Construct an action-world model, which includes an action model that predicts actions based on image observation history and language instructions, and a world model that predicts future states based on image observation history and action sequences. The action-world model is then pre-trained.
[0025] S1.1, Action-World Model Construction.
[0026] In this embodiment, an image-based reinforcement learning environment is initialized. To enable the model to perform both action prediction and world state prediction simultaneously, two main components are defined: an action model (or a policy model). A world model Action Model Responsible for generating an action The condition is the history of image observation. and language instructions This can be formally expressed as: .in, and These are the parameters for the action model and the world model, respectively. For the current moment Before Image observation data captured at each time step. Simultaneously, the world model... Historical observations and the corresponding action sequence To predict the next frame of image observation data That is, the future state, this relationship can be expressed as: .
[0027] By developing an integrated action-world model As a new visual-language-action (VLA) model, to unify these two functions, These are the parameters of the action-world model. This action-world model should be able to predict actions as a policy model and predict future states as a world model, represented as... ,in Represents action generation, Represents a prediction of the state of the world.
[0028] S1.2, Action-World Model Pre-training.
[0029] In this embodiment, a dual-module architecture, including a masked autoencoder (MAE) framework and an efficient Transformer framework, is used to pre-train the action-world model, which mainly includes the following sub-steps.
[0030] S1.2.1 employs a mask autoencoder framework to train the word segmenter that processes visual input in the action-world model. The core task is to compress high-dimensional, redundant video frame data into low-dimensional, compact latent visual features with semantic representation capabilities. MAE, with its self-supervised "mask-reconstruction" logic, enables the word segmenter to capture key structural information and feature patterns in video frames, avoiding the learning of irrelevant noise.
[0031] To balance pixel-level reconstruction accuracy with visual perception consistency, the loss function employs a weighted combination of mean squared error (MSE) and LPIPS perceptual loss. The MSE loss focuses on pixel-level reconstruction errors, ensuring the latent representation accurately reproduces the numerical features of the original frame. The LPIPS perceptual loss calculates differences based on the feature space of the pre-trained visual model, guaranteeing the reconstruction results conform to human visual perception rules and avoiding the problem of correct pixels but semantic distortion. The final loss function is designed as a linear superposition of the two, with weights adjusted to balance the two objectives. The formula logic is as follows: , The MSE loss is added to the LPIPS loss with a coefficient of 0.2 to form the total loss for training the word segmenter, which guides the word segmenter to learn a potential representation that combines accuracy and semantics.
[0032] S1.2.2 employs the efficient Transformer framework to train the dynamic model simulating dynamic environmental changes within the action-world model. As the core driving component of the world model, the dynamic model's primary mission is to accurately learn the state transition rules in the latent space and construct a virtual engine capable of simulating dynamic environmental changes. To overcome the bottleneck of continuously amplifying accumulated errors in traditional step-by-step prediction modes, this stage skips intermediate redundant prediction steps and directly predicts from the initial latent state to the target clean latent state, fundamentally avoiding the error propagation problem and significantly improving the accuracy of state transition prediction and inference efficiency. To adapt the model to noise interference in the real environment and enhance its robustness, parameters are used... (Values range [0,1]) For the initial state With the target state Perform linear interpolation to generate contaminated potential samples. Simulate environmental conditions under different noise intensities: , in, For noise, To achieve the target clean state, the model needs to start from... Recover .
[0033] The loss function is designed differently for different scenarios, depending on the generation step size. Dynamic size adjustment: When the step size reaches the minimum value When the sample prediction distance is short and the noise impact is small, L2 loss is used. Directly constrained predicted values Compared with the actual target state To minimize deviations and ensure the accuracy of basic predictions; when the step size is greater than... To reduce the accumulated error of multi-step prediction, a bootstrap loss is employed to distill the knowledge from the two-step prediction. By applying a stopping gradient (sg) operation to intermediate predictions, distortion during gradient propagation is avoided. Furthermore, to enhance the guiding role of high-quality data in model training, a ramp loss weight function is designed. ,make The larger the value (i.e., the closer the sample is to the clean target state and the higher the signal quality), the higher the weight ratio, allowing the model to prioritize learning the feature patterns of high-quality data while retaining low-quality data. By maximizing the training value of noisy samples and balancing robustness with prediction accuracy, an efficient and interference-resistant potential state transition simulator, i.e., a dynamic model, is ultimately constructed.
[0034] S1.2.3, In addition to the above-mentioned action-conditional video prediction pre-training on the action-world model, the second training phase includes learning the task-conditional policy and reward model. A multi-token prediction (MTP) loss function is used to achieve joint prediction of actions and rewards, improving the consistency between agent decision-making and reward evaluation. The training input is the task embedding. This embedding integrates the current environmental state and task objective information, providing contextual support for action and reward prediction.
[0035] The loss function, in the form of negative log-likelihood, simultaneously constrains the prediction accuracy of both the action sequence and the reward sequence. Specifically, it affects the prediction accuracy of the future... The logarithm of the predicted probability for each step's action and reward is calculated, summed, and the negative value is taken as the total loss, forcing the model to learn from the current task embedding. Starting from this point, the system accurately predicts subsequent action choices and corresponding reward feedback. This joint training approach allows the policy head and reward head to work together, preventing a disconnect between action decision-making and reward evaluation, and laying a solid foundation for the subsequent reinforcement learning phase.
[0036] S2 constructs a conditional diffusion strategy model, which uses latent visual features, language instructions, and thought chains as composite conditions to generate accurate action sequences through a reverse denoising process.
[0037] The basic idea of diffusion probability models originates from stochastic differential equations and the theory of denoising distribution transformation. Its core idea is: in the forward pass, Gaussian noise is progressively added to the data to approximate a standard normal distribution; in the backward pass, a denoising network is learned, enabling the model to gradually reconstruct the data distribution from the noise. It is assumed that the network predicts the noise term. Approximating real noise Minimizing the inverse KL divergence is equivalent to minimizing the noise reconstruction error. : , in, For the present Noisy action sequence at each moment, superscript This represents the time step of the diffusion process. for The latent feature representation of time; for Conditional information at each time step. Unlike simple Gaussian policies, this section trains a more expressive diffusion action decoder, i.e., a conditional diffusion policy model. .like Figure 3 As shown, the decoder uses visual features, language commands, thought processes, and retrieved physical reference trajectories as composite conditions. The precise action sequence is gradually restored from Gaussian noise through a reverse diffusion process. The denoising process is as follows: , in, For the new action after noise reduction, and These are noise scheduling parameters that control the proportion of noise removal at each step. For denoising neural networks, This represents the current number of denoising steps. For the aforementioned Condition information.
[0038] This decoupled design allows the action-world model training in step S1 to focus on broad environmental understanding and exploration, while the diffusion strategy in step S2 focuses on precise execution under multimodal constraints, effectively solving the contradiction between the traditional single strategy and the difficulty in balancing the breadth of exploration and the accuracy of control.
[0039] Traditional diffusion models are typically based on a single modal input (such as image or speech), while this invention introduces a three-modal conditional input for the first time: , Indicates visual latent features, For language semantic embedding, This generates a sequence of reasoning tokens for the thought chain. This design imbues the diffusion process with task awareness, enabling action generation to not only rely on visual information but also follow linguistic instructions and logical reasoning sequences. Therefore, the generated actions are not only physically plausible but also semantically consistent with the task objective.
[0040] S3 constructs a two-layer retrieval knowledge base containing a logical reasoning library and a physical execution library, and performs cascading retrieval. The first stage retrieves reference thought chains based on latent visual features and language instructions, and the second stage combines reference thought chains and latent visual features to retrieve reference physical trajectories.
[0041] S3.1, Construction of a two-layer retrieval knowledge base.
[0042] In this embodiment, the new architecture innovatively introduces the CoT (CoT) and conditional diffusion modules, upgrading the Retrieval Enhancement (RAG) function from simple action memory to logical and physical dual alignment, and reconstructing it into a two-layer heterogeneous dedicated knowledge base.
[0043] The logic reasoning library is designed as a thought chain generation head. Its data structure adopts a dual-feature composite index design, with the index key... By visual latent features With language instruction embedding Generated through cross-modal fusion, it can uniquely represent the combination of "specific scene + specific task": , in, For splicing operations, For visual encoders, For text encoders.
[0044] Search for similar historical success stories in the logic reasoning library to obtain a set of reference thought chains. These reference texts will be used as prompts for context learning and will be fed into the subsequent CoT generator, rather than being used directly.
[0045] The retrieved value is a structured thought chain text of "condition-judgment-decision-execution intent". A unified format is used to facilitate rapid parsing and referencing by the thought chain module. To ensure that the retrieved logic matches the current scenario, InfoNCE (retrieval accuracy) loss is required. Conduct training: , in, For the present Logical query vector at time step For positive sample key vectors, For the first candidate key vectors, For similarity function, This refers to the temperature parameter.
[0046] The physics execution library is designed for conditional diffusion strategies, storing a precise mapping between "inference intent" and "physical trajectory," providing a highly adaptable denoising reference for conditional diffusion models. Its data structure includes index keys... By visual latent features Language instruction embedding and mind chain embedding The core improvement of the joint generation lies in the integration of thought chain embedding to accurately anchor the current decision-making intent; the retrieved value is a standardized trajectory fragment that can be directly input into the diffusion model without additional preprocessing. , in, Embedded into the thought chain.
[0047] The retrieved value is a standardized trajectory fragment that can be directly input into the diffusion model without additional preprocessing.
[0048] In addition to the basic InfoNCE retrieval loss, the training of the physics execution library also needs to satisfy the inference-action consistency constraint mentioned later in this paper. If the retrieved physical trajectory... (corresponding action sequence) ) and the current mindset Conflict requires punishment: , in, This is an indicator function for whether the retrieved physical trajectory conflicts with the current thought chain.
[0049] S3.2, Two-stage cascaded retrieval.
[0050] In this embodiment, after the construction of the two-layer heterogeneous library is completed, a two-stage cascaded search is initiated.
[0051] The first stage is auxiliary reasoning, serving as a pre-decision stage. Its core objective is to provide a small number of sample references for CoT generation, completing logical anchoring before the model outputs any action commands. Its query vector... By visual latent features With language instruction embedding Composition, key vector This corresponds to the historical states and instructions stored in the knowledge base. To ensure that the retrieved logic is consistent with the semantics of the current scenario, InfoNCE loss is also required: , in, This represents the loss of retrieval accuracy in the first stage.
[0052] The second stage is physical retrieval. The goal of this stage is to find physical trajectories that conform to the current logical plan, assisting the diffusion model in generating actions. The query vector for this stage... The thought chain generated in the first stage Add query criteria as cascading input: , key vector A change occurs, corresponding to the thought chain records stored in the knowledge base at the time the historical trajectory occurred. The loss here... In addition to including InfoNCE loss, it must also include a logic-action consistency penalty to prevent retrieved actions from conflicting with the thought chain: , in, To balance the weights.
[0053] This is a logic-action consistency penalty term, namely: , in, The retrieved reference physical trajectory is defined as follows to simplify the vector: . Constraints on the implicit action parameters of the thought chain (such as the speed threshold corresponding to "deceleration").
[0054] S4 uses reference thought chains and reference physical trajectories as composite conditional input conditions to diffuse the policy model, generating actions consistent with logical reasoning and physical constraints. It then performs imaginative training in the latent space constructed by the pre-trained action-world model to complete the policy optimization training of the embodied agent.
[0055] S4.1, Chain of Thought (CoT) Reasoning.
[0056] For each moment The conditional probability of generating a thought chain can be expressed as: ,in For reasoning token sequence, by , The retrieved reference thought chains are then concatenated and input into the CoT decoder to generate the formula. This formula originates from the auto-regressive assumption in language modeling. Based on the principle of maximum likelihood estimation (MLE), the training objective can be obtained. ,in To Given all the words generated at each time step, the objective is to enable the model to learn to generate logically ordered reasoning processes by minimizing the negative log-likelihood.
[0057] During the deployment phase, to reduce inference time and memory overhead, this invention employs a teacher-student distillation strategy: the teacher model generates high-quality inference sequences, and the student model learns their distribution. , and These are the conditional probabilities output by the teacher model and the student model, respectively. The KL divergence term originates from the distribution matching principle in information theory, and its minimization process is equivalent to "compressing" the logical path output by the teacher into the student model. To prevent erroneous reasoning from misleading subsequent actions, this step introduces a crucial confidence gating and fallback mechanism: on the one hand, the average log probability of the generated sequence is calculated as the confidence score. , This represents the total length of the generated thought chain. On the other hand, the semantic similarity between the generated text and the reference text is calculated to assess consistency. If the confidence score or consistency score is lower than a preset safety threshold, the system will determine that the current generated result is unreliable and automatically trigger a fallback strategy, directly using the best reference thought chain retrieved in step S3.2. Alternative generation results Alternatively, it can output a conservative mode marker to ensure that the final thought chain output always has basic logical credibility.
[0058] Unlike traditional natural language generation, this invention uses the generated thought chain as a conditional input diffusion strategy: This structure achieves closed-loop control across three layers: logic, semantics, and action, enabling the embodied agent's actions to not only correctly complete tasks but also conform to human interpretive logic.
[0059] S4.2, Visualization Training Phase.
[0060] The imagination training phase, as the core reinforcement learning stage of the world model training algorithm, optimizes the policy only within the imagined trajectories generated by the world model. This relies on the collaborative iteration of two main modules: value estimation and policy optimization. First, value estimation trains the reward head through differential learning. After training, the reward head predicts the λ-reward. This allows the strategy to maximize rewards beyond what is expected: , Where 𝛾 is the discount factor, In non-terminal state, for Instant rewards earned at any time A value estimate for the current state.
[0061] The policy head uses PMPO (Probability-based Multi-step Policy Optimization) for learning, a robust reinforcement learning objective that leverages advantages. The symbols, regardless of their magnitudes. This property reduces the need to normalize rewards or advantages and ensures that all tasks receive equal attention despite potentially different reward scales. PMPO balances the attention to positive and negative feedback by averaging simple maximum likelihood losses on states with positive and negative advantages respectively. All imagined states in the batch and time dimensions , fine-tuning the Transformer brings a small amount of additional benefits but at a higher computational cost. Therefore, during imagined training, dynamics, policy priors, and reward losses need to be applied to maintain their functionality. Divide into positive sets or negative sets , and apply the following policy losses: , where the first term is used to reduce the selection probability of the policy for inferior state-action pairs, the second term is used to increase the selection probability of the policy for superior state-action pairs, and the third term is used to constrain the update magnitude of the policy, is the policy to be optimized, and are the th action and state respectively, and are the th action and state respectively, and are the sets of state-action pairs with positive and negative advantages respectively, is the coefficient to balance the loss weights of positive and negative samples, is the prior scale, is the total number of samples, is the KL divergence, is the prior policy. Different from the original PMPO objective, we use the reverse direction for the prior KL to better constrain the policy within the range of reasonable behaviors.
[0062] In this embodiment, set to take values in the range of 0.3 - 0.7, preferably 0.5, to balance the positive and negative sample sets; the behavioral prior scale takes values in the range of 0.1 - 0.5, preferably 0.3.
[0063] S5. Use the action-world model of the fused conditional diffusion policy model as the teacher model, and transfer the teacher model to the lightweight student model through joint knowledge distillation to form a final deployable embodied agent for efficient inference.
[0064] In the embodiment, joint knowledge distillation is performed for efficient deployment. The teacher policy distribution and the student policy distribution The matching can be represented in the form of KL divergence: This formula can be understood as minimizing the difference between two distributions in the action probability space, so that the student model can mimic the teacher's decision-making strategy.
[0065] Given that this paper includes a multimodal latent space and a reasoning module, single-policy distillation is insufficient to maintain overall performance; therefore, it is extended to a joint form: , The first term is used to maintain consistency in the distribution of behavioral strategies; the second term is used to constrain the latent spatial structure to ensure that the world model captures environmental dynamics; and the third term is used to ensure consistency in inference paths. and These represent the states of action and input, respectively. Let KL divergence be the KL divergence. and These are the latent visual features corresponding to the teacher model and the student model, respectively. For Euclidean distance and The balancing weights for each loss term, Distillation loss of reasoning ability in the thought chain The student model is trained using soft labels (such as probability distribution of inference steps and action constraint weights) output by the teacher model. The loss function is... for: , in, To balance the weights, a value of 0.6 is used. For cross-entropy loss, This is the mean square error. and These are the hard labels for the student model and the teacher model, respectively, representing the final, definitive output. and These are the soft labels for the student model and the teacher model, respectively, representing the complete probability distribution output by the teacher model. Ultimately, this ensures that the student model replicates the teacher's reasoning logic, predicts future states, and accurately outputs the robot's next action.
[0066] In summary, the embodied agent training and reasoning method provided by this invention, based on a conditional diffusion world model and thought chain, constructs a collaborative training framework that integrates conditional diffusion action generation, retrieval-enhanced reasoning, and joint knowledge distillation. This framework combines high-fidelity continuous action generation, interpretable task logic decomposition, dynamic external knowledge fusion, and efficient model compression, thereby systematically improving the overall performance of embodied agents in terms of sample efficiency, task generalization, decision reliability, and deployment feasibility.
[0067] Based on the same inventive concept, such as Figure 4 As shown, this embodiment of the invention also provides an embodied agent training and reasoning system 400 based on a conditional diffusion world model and a thought chain, including: a model building module 410, a conditional diffusion module 420, a retrieval enhancement module 430, a policy training module 440, and a distillation deployment module 450.
[0068] The model building module 410 is used to build an action-world model, which includes an action model that predicts actions based on image observation history and language instructions, and a world model that predicts future states based on image observation history and action sequences, and pre-trains the action-world model.
[0069] The conditional diffusion module 420 is used to construct a conditional diffusion strategy model, which uses latent visual features, language instructions and thought chains as composite conditions to generate accurate action sequences through a reverse denoising process.
[0070] The retrieval enhancement module 430 is used to construct a two-layer retrieval knowledge base containing a logical reasoning library and a physical execution library and to perform cascading retrieval. The first stage is to retrieve reference thought chains based on latent visual features and language instructions, and the second stage is to retrieve reference physical trajectories by combining reference thought chains and latent visual features.
[0071] The strategy training module 440 is used to take the reference thought chain and reference physical trajectory as composite conditional inputs to the conditional diffusion strategy model to generate actions consistent with logical reasoning and physical constraints, and to perform imagination training in the potential space constructed by the pre-trained action-world model to complete the strategy optimization training of the embodied agent.
[0072] The distillation deployment module 450 is used to take the action-world model of the fusion conditional diffusion policy model as the teacher model, and transfer the teacher model to the lightweight student model through joint knowledge distillation to form the final deployable embodied agent for efficient reasoning.
[0073] Based on the same inventive concept, embodiments of the present invention also provide an electronic device, including a memory and one or more processors, wherein the memory is used to store a computer program, and the processor is used to implement the above-described embodied agent training and reasoning method based on the conditional diffusion world model and thought chain when executing the computer program.
[0074] Based on the same inventive concept, embodiments of the present invention also provide a computer-readable storage medium storing a computer program, which, when executed by a computer, implements the above-described method for training and reasoning embodied intelligent agents based on the conditional diffusion world model and thought chain.
[0075] It should be noted that the embodied agent training and reasoning system, electronic device, and computer-readable storage medium based on the conditional diffusion world model and thought chain provided in the above embodiments all belong to the same inventive concept as the embodied agent training and reasoning method based on the conditional diffusion world model and thought chain. For details of its specific implementation process, please refer to the embodiments of the embodied agent training and reasoning method based on the conditional diffusion world model and thought chain, which will not be repeated here.
[0076] The specific embodiments described above illustrate the technical solution and beneficial effects of the present invention in detail. It should be understood that the above description is only the most preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, additions, and equivalent substitutions made within the scope of the principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A method for training and reasoning embodied intelligent agents based on a conditional diffusion world model and thought chain, characterized in that, Includes the following steps: Construct an action-world model, which includes an action model that predicts actions based on image observation history and language instructions, and a world model that predicts future states based on image observation history and action sequences, and pre-train the action-world model; A conditional diffusion strategy model is constructed, which uses potential visual features, language instructions and thought chains as composite conditions, and generates accurate action sequences through an inverse denoising process. A two-layer retrieval knowledge base containing a logical reasoning library and a physical execution library is constructed and cascaded retrieval is performed. The first stage is based on the retrieval of reference thinking chains based on latent visual features and language instructions. The second stage combines the reference thinking chains and latent visual features to retrieve reference physical trajectories. The reference thought chain and reference physical trajectory are used as composite conditional inputs to the conditional diffusion strategy model to generate actions consistent with logical reasoning and physical constraints. The model is then used for imagination training in the potential space constructed by the pre-trained action-world model to complete the policy optimization training of the embodied agent. By using the action-world model of the fusion conditional diffusion strategy model as the teacher model, and transferring the teacher model to the lightweight student model through joint knowledge distillation, a final deployable embodied agent is formed for efficient reasoning.
2. The embodied agent training and reasoning method based on the conditional diffusion world model and thought chain as described in claim 1, characterized in that, The pre-training of the action-world model includes: A mask autoencoder framework is used to train the word segmenter that processes visual input in the action-world model to encode image observations into compact latent visual features. The word segmenter is trained through a mask-reconstruction self-supervised approach, and its loss function is a weighted sum of mean squared error loss and LPIPS perceptual loss to balance pixel-level reconstruction accuracy and visual perception consistency. The efficient Transformer framework is used to train the dynamic model that simulates the dynamic changes of the environment in the action-world model. This is used to learn robust state transition rules in the latent space. The training of the dynamic model is carried out by adding controllable noise to the target state and then reconstructing it after denoising. The training loss function is dynamically adjusted according to the prediction step size, and a ramp weight function is introduced to enhance the guiding role of high-quality data in model training, so as to improve the robustness of the model to noise and interference.
3. The embodied agent training and reasoning method based on the conditional diffusion world model and thought chain as described in claim 1, characterized in that, In the first stage, when retrieving the reference thought chain based on latent visual features and linguistic instructions, a retrieval accuracy loss is used for training, represented as follows: , in, For the loss of retrieval accuracy in the first stage, For the present The query vector at time step, These are the positive sample key vectors in the knowledge base. For the first negative sample candidate key vectors For similarity function, This refers to the temperature parameter.
4. The embodied agent training and reasoning method based on the conditional diffusion world model and thought chain as described in claim 3, characterized in that, In the second stage, when retrieving reference physical trajectories by combining reference thought chains and latent visual features, a retrieval accuracy loss combined with a logic-action consistency penalty is used for training, as follows: , , in, To use the same search accuracy loss as in the first phase, This is a logic-action consistency penalty term. To balance the weights, For the retrieved reference physical trajectory, For parameter extraction function, The reasoning text generated for the current step of the thought chain. For constraint mapping functions, The Euclidean distance is the distance between the actual action parameters and the logical constraint parameters.
5. The embodied agent training and reasoning method based on the conditional diffusion world model and thought chain according to claim 1, characterized in that, The thought chain is a structured text sequence used for step-by-step task reasoning. Its content includes a description of the task environment state, semantic understanding based on potential visual features and language instructions, intermediate logical judgment steps, and the final execution action intention.
6. The embodied agent training and reasoning method based on the conditional diffusion world model and thought chain according to claim 1, characterized in that, Imagine that the training process uses a policy loss function, expressed as: , The first term reduces the policy's selection probability under poor state-action pairs, the second term increases the policy's selection probability under good state-action pairs, and the third term constrains the policy's update magnitude. The strategy to be optimized and The first Each action and state, and The first Each action and state, and These are the sets of state-action pairs with positive and negative advantages, respectively. To balance the coefficients of the loss weights for positive and negative samples, As a priori criterion. The total number of samples, Let KL divergence be the KL divergence. This is a priori strategy.
7. The embodied agent training and reasoning method based on the conditional diffusion world model and thought chain according to claim 1, characterized in that, The loss function for joint knowledge distillation is expressed as: , The first term is used to maintain consistency in the distribution of behavioral strategies; the second term is used to constrain the latent spatial structure to ensure that the world model captures environmental dynamics; and the third term is used to ensure consistency in inference paths. and The strategy distributions for the teacher model and the student model are respectively. and These represent the states of action and input, respectively. Let KL divergence be the KL divergence. and These are the latent visual features corresponding to the teacher model and the student model, respectively. For Euclidean distance and The balancing weights for each loss term, The distillation loss of reasoning ability in the thought chain; , in, To balance the weights, For cross-entropy loss, For mean square error loss, and These are the hard labels for the student model and the teacher model, respectively, representing the final, definitive output. and These are the soft labels for the student model and the teacher model, respectively, representing the complete probability distribution output by the teacher model.
8. An embodied agent training and reasoning system based on a conditional diffusion world model and thought chain, implemented using the embodied agent training and reasoning method based on a conditional diffusion world model and thought chain as described in any one of claims 1 to 7, characterized in that, include: The module includes a model building module, a conditional diffusion module, a retrieval enhancement module, a policy training module, and a distillation deployment module. The model building module is used to build an action-world model, which includes an action model that predicts actions based on image observation history and language instructions, and a world model that predicts future states based on image observation history and action sequences, and pre-trains the action-world model. The conditional diffusion module is used to construct a conditional diffusion strategy model, which uses potential visual features, language instructions and thought chains as composite conditions, and generates accurate action sequences through a reverse denoising process. The retrieval enhancement module is used to construct a two-layer retrieval knowledge base containing a logical reasoning library and a physical execution library and to perform cascade retrieval. The first stage is to retrieve reference thought chains based on latent visual features and language instructions, and the second stage is to retrieve reference physical trajectories by combining reference thought chains and latent visual features. The strategy training module is used to input the reference thought chain and reference physical trajectory as composite conditions into the conditional diffusion strategy model to generate actions consistent with logical reasoning and physical constraints, and to perform imagination training in the potential space constructed by the pre-trained action-world model to complete the strategy optimization training of the embodied intelligent agent. The distillation deployment module is used to take the action-world model of the fusion conditional diffusion strategy model as the teacher model, and transfer the teacher model to the lightweight student model through joint knowledge distillation to form the final deployable embodied agent for efficient reasoning.
9. An electronic device comprising a memory and one or more processors, the memory for storing a computer program, characterized in that, The processor is used to implement the embodied agent training and reasoning method based on the conditional diffusion world model and thought chain as described in any one of claims 1 to 7 when executing a computer program.
10. A computer-readable storage medium storing a computer program thereon, characterized in that, When the computer program is executed by a computer, it implements the embodied agent training and reasoning method based on the conditional diffusion world model and thought chain as described in any one of claims 1 to 7.