A robot long-sequence task learning and planning method based on partial visual observation

By employing a robot long-sequence task learning method based on partial visual observation, and utilizing POMDP and neural networks for end-to-end training, the problem of robot long-sequence task planning in dynamic and complex scenarios is solved, and generalized learning for different perspectives and operating subjects is achieved.

CN117565032BActive Publication Date: 2026-06-16TONGJI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TONGJI UNIV
Filing Date
2023-09-20
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing technologies struggle to effectively plan and learn long sequences of robot tasks in dynamically changing, unstructured, and complex scenarios. In particular, the application scenarios of visual demonstration data are fixed, making it impossible to cope with long sequence task planning in real-world, complex scenarios.

Method used

A robot long sequence task learning method based on partial visual observation is adopted. It utilizes partially observable Markov processes (POMDP) ​​for modeling and combines neural networks to design state transition, observation, and reward functions. Through end-to-end training, the robot can realize the imitation and planning of long sequence tasks.

🎯Benefits of technology

It enables the learning of high-level task sequences from expert video demonstration data, exhibits strong generalization ability, and can handle long-sequence task planning from different perspectives and operators, thus making up for the shortcomings of existing technologies in the application of real long-sequence tasks.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117565032B_ABST
    Figure CN117565032B_ABST
Patent Text Reader

Abstract

The application provides a robot long sequence task learning and planning method based on partial visual observation, which comprises the following steps: collecting a series of expert demonstration data, segmenting and recognizing the demonstration data by using a meta-motion recognition model, and arranging the demonstration data into a series of historical trajectories for training; modeling a long sequence task learning problem by using a partially observable Markov process, defining a parameter space, designing a state transition function and an observation function, and an updating mode of confidence, designing a reward function, and approximately solving the model by using a QMDP network; and taking the historical trajectories of the expert demonstration as supervision, designing a loss function, and performing end-to-end training of the model. The application takes video recordings of expert demonstrations as input, and the demonstration data is widely available and easy to collect. The neural network is used for parameter modeling of the partially observable Markov process, end-to-end parameter learning is realized, the generalization is strong, and the long sequence imitation and planning learning can be completed for different demonstration angles and operation subjects.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of robot learning technology, and in particular to a method for learning and planning long sequence tasks for robots based on partial visual observation. Background Technology

[0002] In recent years, with the rapid development of artificial intelligence technology and intelligent robots, robots have appeared more frequently in people's lives and are playing an increasingly important role. Currently, developing robotics technology and industry has become a major strategic need for my country. The release of initiatives such as "Made in China 2025" and the "Outline of the New Generation Artificial Intelligence Development Plan" aims to promote the development of intelligent robots. As a mechanical device capable of automatically performing operations or tasks, a robot possesses functions such as perception, planning, decision-making, and control. Among these, task planning and decision-making require the robot to identify its own state based on sensor data input, understand the task execution stage, and rationally plan actions to complete the task; this is the core of the robot's "intelligence."

[0003] Traditional industrial robots typically use expert programming for task planning. However, programming requires expert knowledge and necessitates robots operating in structured environments (such as assembly lines), resulting in poor generalization and the need for reprogramming when the environment changes. In contrast, service robots, increasingly prevalent in human life, are often used in dynamic, unstructured, and complex scenarios, rendering programming inapplicable. To address these issues, vision-based task planning methods have emerged. These methods use expert visual demonstration data (such as demonstration images and videos) as learning objects, learning and planning reasonable action sequences based on expected rewards while imitating expert trajectories to complete the task. Visual input-based task sequence learning and planning methods offer the following advantages: visual demonstration data is relatively inexpensive to acquire and easy to collect; deep learning technology has a certain degree of generalization for different scenarios, and the range of objects it can perceive is broader (target objects, human key points, semantic segmentation maps, etc.).

[0004] Currently, methods for learning and planning visual task sequences are not widely available. Invention patent CN111203878A proposes a robot sequence task learning method based on visual imitation, used to guide robots to mimic human actions from videos containing human movements. However, the application scenarios of this patent (object grasping and movement) are relatively fixed, and the task sequences are short and simple, unable to handle the planning of long sequence tasks in real-world complex scenarios. This invention proposes a robot long sequence task learning and planning method based on partial visual observation. It utilizes Partially Observable Markov Processes (POMDPs) to model the problem, using the complete set of objects and object relationships as the state space, and visual images as observations. A QMDP planner is used to approximate the solution of the model. The state transition function, observation function, and reward function are all designed using neural networks, and the entire model is trained end-to-end using historical trajectory data composed of actions and observations. Summary of the Invention

[0005] The purpose of this invention is to propose a robot long sequence task learning and planning method based on partial visual observation, which can learn to imitate and plan task sequences from a series of expert visual demonstration data.

[0006] To achieve the above objectives, this invention proposes a robot long-sequence task learning and planning method based on partial visual observation, comprising the following steps:

[0007] Step 1: Collect a series of expert demonstration data consisting of images, use the meta-action recognition model to segment and recognize meta-actions in the demonstration data, and organize them into a series of historical trajectories for training;

[0008] Step 2: Model the long sequence task learning problem as a partially observable Markov process, defining the state space, observation space, and action space;

[0009] Step 3: Design the state transition function and observation function using a neural network, and design the confidence update method;

[0010] Step 4: Design a reward function and a QMDP planner using a neural network, and use the QMDP network to approximate the solution of the model;

[0011] Step 5: Using the demonstration trajectory obtained in Step 1 as supervision, design a loss function and perform end-to-end training on the model designed in Step 2 and Step 3 until the model converges.

[0012] Furthermore, in step 1, the dataset is evenly divided into M segments, each segment consisting of T / M observation images; based on the computational cost of the model, K images are evenly selected for action recognition;

[0013] The meta-action recognition model consists of two parts: ResNet50 and LSTM.

[0014] Furthermore, the ResNet50 uses the ImageNet pre-trained model. The original model requires an input image size of 224×224, and the vector dimension of the output of the last fully connected layer of the model is 1000, which is the number of categories. The action recognition model uses the input of the last fully connected layer of ResNet50 as the feature of each frame image, and the vector dimension is 2048.

[0015] The LSTM consists of two hidden layers. It takes the feature vector extracted by ResNet50 as input and outputs a vector with the same dimension as the number of actions for action recognition. The action model is trained with labeled data. The pre-trained ResNet50 does not participate in gradient backpropagation and parameter updates. Finally, the action recognition results are processed by fusion and other post-processing operations to form an action sequence.

[0016] Furthermore, in step 2, the problem is modeled using partially observable Markov processes, specifically as follows:

[0017] Long-sequence task planning and learning based on partial visual observations can be modeled using partially observable Markov processes, by...<S,A,O,T,Z,R> Six-tuple description;

[0018] In the formula, S represents a finite set of states, A represents a finite set of actions, O represents the observation space, T(s,a,s')=Pr(s'|s,a) represents the state transition function, Z(s',a,o)=Pr(o|s',a) represents the observation function, R(s,a) represents the reward function, and γ is the discount factor of the reward function;

[0019] The interaction process between the robot and the environment is as follows: When the robot performs an action a∈A in a certain state, the environmental state changes from s∈S to s'∈S. The robot obtains the current observation o∈O and the reward r∈R through sensors, forming a series of historical trajectories h. t ={a1,o1,...,a t-1 ,o t-1};

[0020] Because of its limited observation of the environment, the robot cannot directly obtain the current true state. Instead, it can only estimate the current state distribution through action execution and observation. This estimate of the state distribution is called the confidence level b. t (s)=Pr(s|h t The robot continuously updates its confidence level regarding the state based on the actions it performs and the observations it receives. The update formula is as follows:

[0021]

[0022] Where η is the normalization factor.

[0023] The goal of robot imitation and planning learning is to find a policy π(b) t Let A ∈ A such that the expected value of the future discount reward is maximized as follows:

[0024]

[0025] Furthermore, in step 2, the partially observable Markov process definition involves organizing the action sequence obtained from action recognition in step 1, along with the start and end frames (action segmentation boundaries), into a historical trajectory, i.e., h. t ={a1,o1,...,a t-1 ,o t-1}, used for subsequent model training.

[0026] Furthermore, in step 3, the method for modeling the state transition function, observation function, and reward function based on a neural network is as follows:

[0027] Both the state transition function and the observation function employ an encoder-decoder structure. The state transition function uses the initial confidence level as the encoder input, concatenates the encoder's output features with the one-hot action vector, and uses this as the decoder input. The decoder outputs the confidence level b'. t (s) represents the state distribution estimate after the action is performed. The observation function takes the observed image as input and outputs the confidence score Z(s,o). t ), indicating that the current observation is o. t State distribution estimation under [condition].

[0028] Furthermore, based on the above modeling method, the confidence update formula in step 3 is as follows:

[0029] b t+1 (s)=ηZ(s;o t )b' t (s)

[0030] Where η is the normalization factor.

[0031] Furthermore, the reward function described in step 4 employs a convolutional neural network, with the current confidence level as the input, predicting the reward for each action, and the dimension being equal to the number of action types.

[0032] Furthermore, the QMDP planner described in step 4 uses the Bellman equation to continuously perform value iteration, and the value iteration formula is as follows:

[0033] Q k+1 (s,a)=R(s)+γ∑S'∈S T(s,a,s')V k (s')

[0034] Among them, V k (s)=Σ a max Q k (s,a), where T is the state transition function and γ is the discount factor. The specific planning method of the QMDP planner is as follows: Input the current confidence level and Q-value (initially 0), predict the reward using the reward function, update the Q-value and confidence level, iterate I times, and obtain the final Q-value. Finally, using the current confidence level b... t (s) and the iterated Q I Multiply the values ​​to obtain the values ​​of each action under the current state confidence level, and output the final planned action vector a after softmax operation.

[0035] Furthermore, the loss function design described in step 5 is calculated as follows:

[0036]

[0037] Where N is the number of demonstrated trajectories. Let represent the expert action sequence, 'a' be the action sequence output by the planner, and CE represent the cross-entropy loss function. The entire training process is end-to-end, continuing until the model converges.

[0038] Compared with the prior art, the advantages of the present invention are:

[0039] 1. This invention can learn high-level task sequence imitation and planning directly from expert video demonstration data. Compared with expert programming and methods that use robot joint angles and other states as demonstration data, video demonstration data is widely available and easy to collect.

[0040] 2. This invention, based on neural network modeling and training, exhibits strong generalization ability across different demonstration perspectives and operating subjects. Existing POMDP techniques are mostly designed for simple planning tasks such as maze solving, with limited action and state spaces. This invention extends this to imitation learning and planning for real-world long-sequence tasks. Through neural network design, it achieves end-to-end training, demonstrating strong generalization ability. It can complete long-sequence imitation and planning learning for different demonstration perspectives and operating subjects, overcoming the shortcomings of existing technologies in applying them to real-world long-sequence tasks. Attached Figure Description

[0041] Figure 1 This is a flowchart of a robot long sequence task learning method based on partial visual observation in an embodiment of the present invention.

[0042] Figure 2This is a schematic diagram of the action recognition model in an embodiment of the present invention;

[0043] Figure 3 This is a schematic diagram of the state transition function, observation function, and confidence update model in an embodiment of the present invention;

[0044] Figure 4 This is a schematic diagram of the reward function and QMDP planner in an embodiment of the present invention. Detailed Implementation

[0045] To make the objectives, technical solutions, and advantages of the present invention clearer, the technical solutions of the present invention will be further described below.

[0046] like Figure 1 As shown, this invention proposes a robot long sequence task learning and planning method based on partial visual observation, the specific steps of which are as follows:

[0047] Step 1: Collect a series of expert demonstration data consisting of images. The expert demonstration trajectory data contains N demonstration data points, i.e., D = {d i |i=1,...,N}, where each sample data point is specifically composed of a series of images obtained by visual sensors, i.e., d i ={o1,...,o T}

[0048] Step 2: Using, for example Figure 2 The demonstrated meta-action recognition model performs meta-action segmentation and recognition on the demonstration data. The dataset is uniformly divided into M segments, each consisting of T / M observed images. Considering the computational cost, K images are uniformly selected for action recognition. The action recognition model consists of two parts: ResNet50 and LSTM. The first part, ResNet50, uses an ImageNet pre-trained model. The original model requires an input image size of 224×224, and the vector dimension of the last fully connected layer is 1000 (i.e., the number of categories). The action recognition model uses the input of the last fully connected layer of ResNet50 as the feature vector for each frame, with a vector dimension of 2048. The second part, LSTM, consists of two hidden layers. It takes the feature vector extracted by ResNet50 as input and outputs a vector with the same dimension as the number of actions for action recognition. The above action model is trained using labeled data, and the pre-trained ResNet50 does not participate in gradient backpropagation or parameter updates. Finally, the action recognition results undergo post-processing operations such as fusion to form an action sequence.

[0049] Furthermore, the action sequences obtained from action recognition, along with the start and end frames of the actions (action segmentation boundaries), are organized into a historical trajectory format, i.e., h. t ={a1,o1,...,at-1 ,o t-1}, used for subsequent model training.

[0050] Step 3: Model the long sequence task learning problem as a partially observable Markov process, defining the state space, observation space, and action space.

[0051] Long-sequence task planning and learning based on partial visual observations can be modeled using partially observable Markov processes, by...<S,A,O,T,Z,R> A six-tuple description. Here, S represents a finite set of states, A represents a finite set of actions, O represents the observation space, T(s,a,s')=Pr(s'|s,a) represents the state transition function, Z(s',a,o)=Pr(o|s',a) represents the observation function, R(s,a) represents the reward function, and γ is the discount factor for the reward function. The robot's interaction with the environment is as follows: When the robot performs an action a∈A in a certain state, the environmental state changes from s∈S to s'∈S. The robot then obtains the current observation o∈O and the reward r∈R through sensors, forming a series of historical trajectories h. t ={a1,o1,...,a t-1 ,o t-1}

[0052] Because of its limited observation of the environment, the robot cannot directly obtain the current true state. Instead, it can only estimate the current state distribution through action execution and observation. This estimate of the state distribution is called the confidence level b. t (s)=Pr(s|h t The robot continuously updates its confidence level regarding the state based on the actions it performs and the observations it receives. The update formula is as follows:

[0053]

[0054] Where η is the normalization factor.

[0055] The goal of robot imitation and planning learning is to find a policy π(b) t Let A ∈ A such that the expected value of the future discount reward is maximized as follows:

[0056]

[0057] Furthermore, the state space is spanned by the objects in the scene and the relationships between them, with a state space dimension d. S =N O ×N O ×R O N O =15 represents the number of object types, R O=4 represents the number of relation types. The action space dimension is equal to the number of action types, with a size of 30. The observation space is spanned by images acquired by the visual sensor, with dimension d. O = 3 × 224 × 224. The confidence level is the distribution estimate of the true state.

[0058] Step 4: Design the state transition function and observation function using a neural network, and design the confidence update method, such as... Figure 3 As shown.

[0059] Both the state transition function and the observation function employ an encoder-decoder structure. The state transition function uses the initial confidence level as the encoder input, concatenates the encoder's output features with the one-hot action vector, and uses this as the decoder input. The decoder outputs the confidence level b'. t (s) represents the state distribution estimate after the action is performed. The observation function takes the observed image as input and outputs the confidence score Z(s,o). t ), indicating that the current observation is o. t State distribution estimation under [condition].

[0060] Preferably, based on the above modeling method, the confidence update formula is:

[0061] b t+1 (s)=ηZ(s;o t )b' t (s)

[0062] Where η is the normalization factor.

[0063] Step 5: As Figure 4 As shown, a reward function and a QMDP planner are designed using a neural network, and the QMDP network is used to approximate the solution of the model. Specifically, the QMDP planner uses the Bellman equation to continuously perform value iteration, and the value iteration formula is as follows:

[0064] Q k+1 (s,a)=R(s)+γΣ S'∈S T(s,a,s')V k (s')

[0065] Among them, V k (s)=Σ a max Q k (s,a), where T is the state transition function and γ is the discount factor. The specific planning method of the QMDP planner is as follows: Input the current confidence level and Q-value (initially 0), predict the reward using the reward function, update the Q-value and confidence level, iterate I times, and obtain the final Q-value. Finally, using the current confidence level b... t (s) and the iterated Q IMultiply the values ​​to obtain the values ​​of each action under the current state confidence level, and output the final planned action vector a after softmax operation.

[0066] Step 6: Using the demonstration trajectory obtained in Step 1 as supervision, design the loss function, which is calculated as follows:

[0067]

[0068] Where N is the number of demonstrated trajectories. Let represent the expert action sequence, 'a' be the action sequence output by the planner, and CE represent the cross-entropy loss function. The entire training process is end-to-end, continuing until the model converges.

[0069] The above are merely preferred embodiments of the present invention and do not constitute any limitation on the present invention. Any equivalent substitutions or modifications made by those skilled in the art to the technical solutions and content disclosed in the present invention without departing from the scope of the present invention shall be deemed to have remained within the protection scope of the present invention.

Claims

1. A robot long-sequence task learning and planning method based on partial visual observation, characterized in that, Includes the following steps: Step 1: Collect a series of expert demonstration data consisting of images, use the meta-action recognition model to segment and recognize meta-actions in the demonstration data, and organize them into a series of historical trajectories for training; Step 2: Model the long sequence task learning problem as a partially observable Markov process, defining the state space, observation space, and action space; Step 3: Design the state transition function and observation function using a neural network, and design the confidence update method; Step 4: Design a reward function and a QMDP planner using a neural network, and use the QMDP network to approximate the solution of the model designed in Step 3; Step 5: Using the demonstration trajectory obtained in Step 1 as supervision, design a loss function and perform end-to-end training on the model designed in Step 2 and Step 3 until the model converges. In step 1, the dataset is evenly divided into M segments, each segment consisting of... The image is composed of 10 observation images; based on the computational cost of the model, K images are uniformly selected for action recognition; The meta-action recognition model consists of two parts: ResNet50 and LSTM. The ResNet50 uses an ImageNet pre-trained model. The original model requires an input image size of 224×224, and the vector dimension of the output of the last fully connected layer of the model is 1000, which is the number of categories. The action recognition model uses the input of the last fully connected layer of ResNet50 as the feature of each frame image, and the vector dimension is 2048. The LSTM consists of two hidden layers. It takes the feature vector extracted by ResNet50 as input and outputs a vector with the same dimension as the number of actions for action recognition. The action recognition model is trained with labeled data. The pre-trained ResNet50 does not participate in gradient backpropagation and parameter updates. Finally, the action recognition results are processed by fusion and other post-processing operations to form an action sequence. In step 2, the problem is modeled using partially observable Markov processes, specifically as follows: Long-sequence task planning and learning based on partial visual observations can be modeled using partially observable Markov processes, by... Six-tuple description; In the formula, Represents a finite set of states. Represents a finite set of actions. Represents the observation space. Represents the state transition function. Represents the observation function, Represents the reward function, This is the discount factor for the reward function; The interaction process between the robot and the environment is as follows: the robot performs actions in a certain state. This led to the change in the environmental state. Become Acquire current observations through sensors and rewards This forms a series of historical trajectories. ; Because of its limited observation of the environment, the robot cannot directly obtain the current true state. Instead, it can only estimate the current state distribution through action execution and observation. This estimation of the state distribution is called the confidence level. . The robot continuously updates its confidence level regarding the state based on the actions it performs and the observations it receives. The update formula is as follows: ; in, Normalization factor; The goal of robot imitation and planning learning is to find a strategy. This maximizes the expected value of the future discount reward as shown in the following formula: ; In step 3, the specific method for modeling the state transition function, observation function, and reward function based on the neural network is as follows: Both the state transition function and the observation function employ an encoder-decoder structure. The state transition function uses the initial confidence level as the encoder input, and the encoder's output features are concatenated with the one-hot action vector as the decoder input. The decoder outputs the confidence level. This refers to the estimation of the state distribution after an action is performed. The observation function takes the observed image as input and outputs the confidence score. This indicates that the current observation is State distribution estimation under the given conditions; The QMDP planner described in step 4 uses the Bellman equation to continuously perform value iterations. The value iteration formula is as follows: ; in, T is the state transition function. The discount factor is used. The specific planning method of the QMDP planner is as follows: Input the current confidence level and Q-value (initially 0), predict the reward using the reward function, update the Q-value and confidence level, iterate I times, and obtain the final Q-value. Finally, using the current confidence level... Compared with the iterative version Multiply to obtain the value of each action under the current state confidence, and then perform a softmax operation to output the final planned action vector. ; The loss function described in step 5 is calculated as follows: ; Where N is the number of demonstrated trajectories. Represents the expert action sequence. The action sequence is output by the planner, and CE represents the cross-entropy loss function. The entire training process is end-to-end, continuing until the model converges.

2. The robot long-sequence task learning and planning method based on partial visual observation according to claim 1, characterized in that, In step 2, the partially observable Markov process is defined as follows: the action sequence obtained from action recognition in step 1, along with the start and end frames (action segmentation boundaries), are organized into a historical trajectory. This is used for subsequent model training.

3. The robot long-sequence task learning and planning method based on partial visual observation according to claim 1, characterized in that, Based on the above modeling method, the confidence update formula in step 3 is as follows: ; in, This is the normalization factor.

4. The robot long-sequence task learning and planning method based on partial visual observation according to claim 1, characterized in that, The reward function described in step 4 uses a convolutional neural network structure. The input is the current confidence level, and the function predicts the reward for each action. The dimension is the same as the number of action types.