Robot action generation method and device based on generative supervised reinforcement

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By enhancing the pre-trained visual language backbone network through generative supervision and extracting features using scene depth maps, the problem of inaccurate robot movements is solved, enabling more efficient spatial understanding and action execution in real-world environments, while reducing system complexity and cost.

CN122244759APending Publication Date: 2026-06-19TSINGHUA UNIVERSITY

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: TSINGHUA UNIVERSITY
Filing Date: 2026-03-20
Publication Date: 2026-06-19

Application Information

Patent Timeline

20 Mar 2026

Application

19 Jun 2026

Publication

CN122244759A

IPC: G06V20/40; B25J9/16; G06V20/52; G06V20/70; G06V10/774; G06V10/764; G06V10/40; G06N3/09; G06N5/04; G06N3/008

AI Tagging

Application Domain

Programme-controlled manipulator Character and pattern recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

The robot motions generated by existing technologies are not accurate enough, especially due to insufficient understanding of fine-grained spatial structures and geometric relationships in real environments. Furthermore, the introduction of 3D input leads to high computational costs and difficulties in data acquisition.

Method used

A visual language backbone network is pre-trained using a generative supervised augmentation method. Feature extraction is performed using scene depth maps, and joint features are generated by combining visual images and task instruction text. These features are then used to generate robot actions, reducing the reliance on additional 3D sensors.

Benefits of technology

It improves the robot's spatial understanding and action execution capabilities in real-world environments, resulting in more accurate and reasonable generated actions, while reducing system complexity and cost.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122244759A_ABST

Patent Text Reader

Abstract

This invention provides a method and apparatus for generating robot actions based on generative supervised augmentation. The method includes: acquiring visual images and task instruction text for a target scene and inputting them into a visual language backbone network for feature extraction processing to determine joint features; the joint features are used to characterize the spatial and semantic features of the target scene; the visual language backbone network is trained using generative supervised augmentation based on a first sample set, the first sample set including multiple sets of first sample data and ground truth depth maps corresponding to each set of first sample data, each set of first sample data including a sample visual image and sample task instruction text for the sample scene; inputting the joint features, visual images, and task instruction text into an action generation network for action generation processing to determine the sequence of actions that the robot needs to perform in the target scene. Using the technical solution of this invention can improve the accuracy of the generated robot actions.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence technology, and in particular to a method and apparatus for generating robot motion based on generative supervision enhancement. Background Technology

[0002] Embodied intelligent brains typically need to understand objects, spatial relationships, and operational constraints in a real or simulated environment based on visual observation, and generate corresponding actions by combining these with language commands. Related technologies usually involve multiple aspects, including visual perception, language understanding, spatial reasoning, and motion control.

[0003] In related technologies, when applying Spatial-Aware Vision-Language-Action Models (VLA) to embodied intelligent tasks, the process begins by collecting images, videos, language commands, and corresponding motion trajectory data of the robot during task execution to construct joint training samples for vision, language, and action. Next, a visual encoder extracts environmental observation features, and a language encoder extracts task command features, generating a unified semantic representation through a fusion module. This semantic representation is then input into an action prediction module to output robotic arm control quantities, end effector actions, or task-level action sequences. Finally, training is completed through supervised learning, imitation learning, or policy optimization, and the trained model is deployed in the robot system to generate robot actions to perform embodied tasks such as grasping, handling, assembly, and interaction.

[0004] However, the above-mentioned technology has the problem that the generated robot movements are not accurate enough. Summary of the Invention

[0005] This invention provides a method and apparatus for generating robot actions based on generative supervision enhancement, which addresses the shortcomings of inaccurate robot actions generated in existing technologies. It enables a visual language backbone network pre-trained with generative supervision enhancement using scene depth maps to generate joint features representing the spatial and semantic features of the scene, and accurately generates robot actions based on these joint features.

[0006] This invention provides a robot motion generation method based on generative supervision enhancement, comprising: Acquire visual images and task instruction text for the target scene; Visual images and task instruction texts are input into a visual language backbone network for feature extraction to determine joint features corresponding to the visual images and task instruction texts. These joint features are used to characterize the spatial and semantic features of the target scene. The visual language backbone network is obtained through generative supervised augmentation training based on a first sample set. The first sample set includes multiple sets of first sample data and ground truth depth maps corresponding to each set of first sample data. Each set of first sample data includes sample visual images and sample task instruction texts for the sample scene. The joint features, visual images, and task instruction text are input into the action generation network for action generation processing to determine the sequence of actions that the robot needs to perform in the target scene; the action sequence includes at least one action.

[0007] According to the robot motion generation method based on generative supervision enhancement provided by the present invention, the training process of the aforementioned visual language backbone network includes: Each first sample data is input into the initial visual language backbone network for feature extraction processing to determine the predictive joint features corresponding to each first sample data. The predicted joint features corresponding to each first sample data are input into the connection network for feature mapping processing to determine the predicted condition features corresponding to each first sample data. The prediction condition features corresponding to each first sample data are input into a generative deep network for depth image generation processing to determine the predicted depth map corresponding to each first sample data; the dimension of the prediction condition features corresponding to the first sample data is consistent with the dimension of the input space of the generative deep network. Based on the predicted depth map and the corresponding ground truth depth map corresponding to each first sample data, the initial visual language backbone network is subjected to generative supervised augmentation training to obtain the visual language backbone network.

[0008] According to the robot motion generation method based on generative supervision enhancement provided by the present invention, before obtaining the visual language backbone network by performing generative supervision enhancement training on the initial visual language backbone network based on the predicted depth map and the corresponding ground truth depth map corresponding to each first sample data, the method further includes: Freeze the network parameters of the initial visual language backbone network and the network parameters of the generative deep network; Based on the predicted depth map and the corresponding ground truth depth map corresponding to each first sample data, the initial connected network is subjected to one-stage generative supervised training to obtain the connected network.

[0009] According to the robot motion generation method based on generative supervision enhancement provided by the present invention, before obtaining the visual language backbone network by performing generative supervision enhancement training on the initial visual language backbone network based on the predicted depth map and the corresponding ground truth depth map corresponding to each first sample data, the method further includes: Freeze the network parameters of the initial visual language backbone network; Based on the predicted depth map and the corresponding ground truth depth map corresponding to each first sample data, a second-stage generative supervised training is performed on the first-stage trained connection network and the initial generative deep network to obtain the connection network and the generative deep network.

[0010] According to the present invention, a robot action generation method based on generative supervised augmentation is provided, wherein the initial visual language backbone network is trained with generative supervised augmentation based on the predicted depth map and the corresponding ground truth depth map corresponding to each first sample data to obtain the visual language backbone network, including: Based on the predicted depth map and the corresponding ground truth depth map corresponding to each first sample data, a three-stage generative supervision enhancement training is performed on the two-stage trained connection network, the two-stage trained generative deep network, and the initial visual language backbone network to obtain the connection network, the generative deep network, and the visual language backbone network.

[0011] According to the present invention, a robot action generation method based on generative supervision enhancement is provided. The sample task instruction text of the first sample data includes sample answers. Based on the predicted depth map and the corresponding ground truth depth map corresponding to each first sample data, a three-stage generative supervision enhancement training is performed on the two-stage trained connection network, the two-stage trained generative deep network, and the initial visual language backbone network to obtain the connection network, the generative deep network, and the visual language backbone network, including: Decode the joint features of each prediction to determine the predicted answer corresponding to each first sample data; Calculate the language task loss based on the predicted answer and the corresponding sample answer for each first sample data; Calculate the depth reconstruction loss based on the predicted depth map and the corresponding ground truth depth map for each first sample data. Based on the language task loss and deep reconstruction loss, a three-stage generative supervised reinforcement training is performed on the two-stage trained connection network, the two-stage trained generative deep network, and the initial visual language backbone network to obtain the connection network, the generative deep network, and the visual language backbone network.

[0012] According to the robot motion generation method based on generative supervised augmentation provided by the present invention, the ground truth depth map corresponding to each first sample data is a ground truth depth map with added standard noise. The above-mentioned calculation of depth reconstruction loss based on the predicted depth map corresponding to each first sample data and the corresponding ground truth depth map includes: Based on each predicted depth map, determine the prediction noise corresponding to each predicted depth map; The depth reconstruction loss is calculated based on each predicted noise and the corresponding standard noise.

[0013] According to the robot motion generation method based on generative supervision enhancement provided by the present invention, the aforementioned first sample data includes at least one of the following types of first sample data: Sample visual images and sample task instruction text for the sample scene; the sample task instruction text includes questions and answers regarding the spatial location and / or referential relationships of objects in the sample visual images; Sample visual images and sample task instruction text for the sample scene; the sample task instruction text includes questions and answers regarding the geometric constraints and / or physical relationships of objects in the sample scene; The sample visual images and sample task instruction texts are for the sample scene; the sample task instruction texts include questions and answers on spatiotemporal planning for the robot to operate objects in the sample scene.

[0014] According to the present invention, a robot motion generation method based on generative supervised enhancement is provided, wherein the training method of the above-mentioned motion generation network includes: Obtain the second sample set; the second sample set includes multiple sets of second sample data and the true action sequence corresponding to each set of second sample data. Each set of second sample data includes sample visual images and sample task instruction text for the sample scene. Each second sample data is input into the three-stage trained visual language backbone network for feature extraction processing to determine the predictive joint features corresponding to each second sample data. The predicted joint features corresponding to each second sample data are input into the initial action generation network for action generation processing to determine the predicted action sequence that the robot needs to execute in the sample scene of the second sample data. Based on the predicted action sequence and the corresponding ground truth action sequence corresponding to each second sample data, the initial action generation network is subjected to generative supervised training to obtain the action generation network.

[0015] According to the present invention, a robot motion generation method based on generative supervision enhancement is provided, wherein the initial motion generation network is trained with generative supervision enhancement based on the predicted motion sequence corresponding to each second sample data and the corresponding ground truth motion sequence to obtain the motion generation network, including: The predicted joint features corresponding to each second sample data are input into the three-stage trained connection network for feature mapping processing to determine the predicted condition features corresponding to each second sample data. The prediction condition features corresponding to each second sample data are input into the three-stage trained generative deep network to generate depth images and determine the predicted depth map corresponding to each second sample data. Based on the predicted depth map and the corresponding ground truth depth map corresponding to each second sample data, as well as the predicted action sequence and the corresponding ground truth action sequence corresponding to each second sample data, the initial action generation network is subjected to generative supervised reinforcement training to obtain the action generation network.

[0016] According to the present invention, a robot motion generation method based on generative supervision enhancement is provided, wherein the visual language backbone network is an autoregressive network and the motion generation network is a diffusion network.

[0017] The present invention also provides a robot motion generation device based on generative supervision enhancement, comprising the following modules: The acquisition module is used to acquire visual images and task instruction text for the target scene; The joint feature generation module is used to input visual images and task instruction text into the visual language backbone network for feature extraction processing to determine the joint features corresponding to the visual images and task instruction text. The joint features are used to characterize the spatial and semantic features of the target scene. The visual language backbone network is obtained by generative supervised augmentation training based on the first sample set. The first sample set includes multiple sets of first sample data and ground truth depth maps corresponding to each set of first sample data. Each set of first sample data includes sample visual images and sample task instruction text for the sample scene. The action generation module is used to input joint features, visual images, and task instruction text into the action generation network for action generation processing, and to determine the action sequence that the robot needs to perform in the target scene; the action sequence includes at least one action.

[0018] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the robot motion generation method based on generative supervision enhancement as described above.

[0019] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the robot motion generation method based on generative supervision enhancement as described above.

[0020] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the robot motion generation method based on generative supervision enhancement as described above.

[0021] The present invention provides a robot motion generation method and apparatus based on generative supervision enhancement. This method acquires visual images and task instruction text for a target scene and inputs them into a visual-language backbone network for feature extraction processing. It determines the joint features corresponding to the visual images and task instruction text, and then inputs the joint features, visual images, and task instruction text into a motion generation network for motion generation processing to determine the motion sequence, including at least one action, that the robot needs to perform in the target scene. The joint features characterize the spatial and semantic features of the target scene. The visual-language backbone network is trained using generative supervision enhancement based on a first sample set. The first sample set includes multiple sets of first sample data and a ground truth depth map corresponding to each set of first sample data. Each set of first sample data includes a sample visual image and a sample task instruction text for the sample scene. In this method, the feature extraction process of the visual language backbone network can be pre-trained with generative supervision using scene depth maps. This introduces spatial geometric supervision information of the scene into the feature extraction process, enabling the visual language backbone network to understand the scene not only through semantic information but also by explicitly considering depth levels, boundaries, and spatial layout. This multi-dimensional understanding of the scene, combining semantic and spatial geometric information, allows the subsequent action generation network to reference this scene understanding when generating actions, resulting in more accurate and reasonable robot actions. Furthermore, this method only requires visual images and task command text as input to the visual language backbone network during action generation. It can directly learn the scene's spatial structure information from ordinary 2D visual images and task command text without needing to input point clouds, depth maps, or other additional 3D sensor data. This reduces the overall system complexity and application cost. Attached Figure Description

[0022] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0023] Figure 1 This is one of the flowcharts of the robot motion generation method based on generative supervision enhancement provided by the present invention.

[0024] Figure 2 This is the second flowchart of the robot motion generation method based on generative supervision enhancement provided by the present invention.

[0025] Figure 3 This is a network architecture block diagram of the robot motion generation method based on generative supervision enhancement provided by the present invention.

[0026] Figure 4 This is a schematic diagram of the robot motion generation device based on generative supervision enhancement provided by the present invention.

[0027] Figure 5 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation

[0028] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0029] Embodied intelligence technologies typically involve multiple aspects, including visual perception, language understanding, spatial reasoning, and action control. Currently, Vision-Language Models (VLMs) and Vision-Language-Action Models (VLMs) are relevant to embodied tasks. When applying VLMs to embodied intelligence tasks, large-scale VLM pre-training techniques are primarily used to establish the correspondence between image content and natural language semantics, thereby improving the model's perception, understanding, and reasoning abilities in open environments. Methods such as RoboBrain, Vlaser, and Mimo-Embodied train models by constructing diverse datasets containing object, action, attribute, and planning information to enhance their ability to describe and understand complex scenes. Simultaneously, labeled data related to spatial relationships is introduced, enabling the model to identify the relative positions, orientations, and interaction states between objects in the scene, thus improving the alignment between visual information and linguistic semantics. The implementation process of this visual-language model-based approach generally involves: first, collecting images, videos, and corresponding text descriptions, question-answer pairs, or instruction data to construct visual-language training samples; second, using a visual encoder and a language encoder to jointly model multimodal information, and through pre-training or instruction fine-tuning, enabling the model to acquire the ability to map from visual input to linguistic semantic representation; third, in some schemes, tasks such as spatial question answering, scene reasoning, and target relationship judgment are further added to enhance the model's understanding of spatial structure and task semantics; finally, the trained model is used for environmental understanding, instruction parsing, task reasoning, or interactive decision-making in embodied intelligence scenarios.

[0030] When applying spatially perceptive vision-language-action models to embodied intelligence tasks, this type of approach adds an action output module to the vision-language model, enabling the model to directly generate robot control actions based on visual input and language commands, thus achieving integrated processing from "perception-understanding-decision-execution". This type of technology typically uses a pre-trained vision-language model as the backbone network and integrates an action prediction head on top of it to leverage the existing perception and language understanding capabilities of the pre-trained model, improving the robot's adaptability to human commands and zero-shot generalization ability. To enhance the spatial perception capabilities of this type of model, some techniques, such as PointVLA and 3D-VLA, attempt to introduce three-dimensional or 2.5-dimensional inputs into the vision-language-action model, such as depth maps, point clouds, or multi-view geometric information, to compensate for the shortcomings of two-dimensional images in spatial representation. Other techniques, such as SpatialVLA and TraceVLA, explore implicitly extracting the global spatial context of the environment from two-dimensional observations and injecting geometric information into semantic representations to improve the model's understanding of spatial relationships. Other technologies, such as DreamVLA, introduce generative world models to assist in action planning by predicting future states or simulating environmental evolution, aiming to improve the robot's decision-making performance in complex tasks.

[0031] However, for vision-language modeling methods, their training objectives typically focus on semantic alignment and language understanding, neglecting the crucial fine-grained spatial structure information essential for embodied tasks. This makes it difficult to accurately represent the relative distances, 3D positional relationships, contact states, and manipulable constraints between objects. For vision-language-action modeling methods, most current approaches still rely primarily on 2D observation input, exhibiting limited ability to accurately perceive and model physical constraints in real 3D scenes. Directly introducing 3D or 2.5D input often faces high computational costs, complex data acquisition, and significant training expenses. While improvements through feature fusion, implicit geometry enhancement, or generative world models have improved planning or semantic understanding to some extent, they generally fail to significantly enhance the stable encoding of the current scene's geometry, thus limiting the model's spatial understanding accuracy and action execution reliability in real-world complex environments. Therefore, current related technologies suffer from problems such as reliance on semantic information, a lack of effective modeling of real-world scene geometry and fine-grained spatial relationships, and the high computational cost and data acquisition difficulties resulting from introducing additional 3D input to enhance spatial perception.

[0032] Based on this, embodiments of the present invention provide a robot motion generation method and apparatus based on generative supervision enhancement, which can solve the above-mentioned technical problems, that is, improve the robot's spatial understanding, task reasoning and motion execution capabilities in real environment without adding additional 3D input.

[0033] It should be noted that the execution subject of the embodiments of the present invention may be a robot motion generation device based on generative supervision enhancement, or it may be an electronic device including the device, or it may be a robot including the electronic device, or it may be other devices or systems, etc. There are no specific limitations here. The following embodiments will use an electronic device as the execution subject for illustration.

[0034] Figure 1 This is one of the flowcharts illustrating the robot motion generation method based on generative supervision enhancement provided by the present invention, such as... Figure 1 As shown, the method includes the following steps: Step 102: Obtain the visual image and task instruction text for the target scene.

[0035] The target scenario can be the environment in which the robot needs to perform operations. It can be a real-world scenario, such as a factory workshop, room, office, or industrial park. Alternatively, this target scenario can be called an embodied environment / scenario. Within this embodied environment / scenario, the robot / embodied agent achieves autonomous learning and evolution through dynamic interaction between its body and the environment, deeply integrating perception, action, and cognition. When controlling the robot's actions within the target scenario, a visual image of the target scenario can be acquired first. This visual image can be obtained through cameras deployed in the target scenario or through cameras installed on the robot. This visual image is a two-dimensional visual image. Simultaneously, task instruction text can be acquired. This task instruction text can be obtained through user input and is used to instruct the robot to perform the desired action, such as controlling the robot to assemble A onto B.

[0036] Step 104: Input the visual image and task instruction text into the visual language backbone network for feature extraction processing to determine the joint features corresponding to the visual image and task instruction text; the joint features are used to characterize the spatial and semantic features of the target scene. The visual language backbone network is obtained by generative supervised augmentation training based on the first sample set. The first sample set includes multiple sets of first sample data and ground truth depth maps corresponding to each set of first sample data. Each set of first sample data includes sample visual images and sample task instruction text for the sample scene.

[0037] In this step, a visual language backbone network can be pre-trained using generative supervised augmentation. This visual language backbone network can be a backbone network / backbone network based on a vision-language model (VLM). During training, multiple sets of sample visual images and sample task instruction texts for the sample scene are first acquired, along with ground truth depth maps corresponding to each set of sample visual images and task instruction texts. These ground truth depth maps reflect spatial geometric information such as the depth level, boundaries, and spatial layout of the sample scene. Then, each set of sample visual images and sample task instruction texts is used as the reference input to the initial visual language backbone network, and the ground truth depth maps corresponding to each set of sample visual images and sample task instruction texts are used as the supervision information for the output of the initial visual language backbone network. This allows for generative supervised augmentation training of the initial visual language backbone network, resulting in a well-trained visual language backbone network. Specifically, the training can involve inputting each set of sample visual images and sample task instruction text into the initial visual language backbone network for feature extraction, determining the sample joint features corresponding to each set of sample visual images and sample task instruction text; then mapping each sample joint feature to each predicted depth map, and using the loss between each predicted depth map and the corresponding ground truth depth map to perform generative supervised augmentation training on the initial visual language backbone network; this enables the visual language backbone network to learn the spatial geometric information and semantic / linguistic information of the scene from the sample visual images and sample task instruction text and reflect it in the generated joint features, so that the generated joint features can reflect the spatial features and semantic / linguistic features of the scene.

[0038] After training the visual-language backbone network, it gains the ability to extract scene spatial geometric and semantic information from input images and text. At this point, the visual image of the target scene and the task text instructions can be input into the trained visual-language backbone network for feature extraction. This network includes a visual encoder, a language encoder, and a connection module (such as a lightweight neural network, like a multilayer perceptron, MLP). The visual encoder first encodes the input visual image to obtain multiple image tokens, and the language encoder encodes the input task text instructions to obtain multiple text tokens. The connection module then aligns the feature spaces corresponding to the image tokens with the embedding spaces corresponding to the text tokens, obtaining the joint features corresponding to the overall visual image and task text instructions of the target scene. These joint features include the spatial features (i.e., spatial geometric information) and semantic features (i.e., semantic / language information) of the target scene, facilitating the subsequent synthesis of the scene's spatial geometric and semantic information to generate more accurate robot actions.

[0039] Alternatively, the aforementioned visual language backbone network is an autoregressive network. By continuously training the visual language backbone network with multiple sets of sample visual images and sample task instruction texts for sample scenes and their ground truth depth maps, the training process can be relatively simple and stable, and the quality of the joint features generated by the subsequently trained visual language backbone network can be improved.

[0040] Step 106: Input the joint features, visual image and task instruction text into the action generation network for action generation processing to determine the action sequence that the robot needs to perform in the target scene; the action sequence includes at least one action.

[0041] In this step, after obtaining the joint features of the target scene through the visual-language backbone network, these joint features (i.e., key-value tokens), the visual image of the target scene, and the task text instructions are input into the action generation network. The action generation network, guided by these joint features, visual image, and task text instructions, combines the spatial geometric and semantic information of the target scene to generate a sequence of actions the robot needs to perform in the target scene. This sequence includes one or more actions that the robot needs to perform, where each action can be represented by the three-dimensional coordinates of the robot's motion actuator (e.g., a robotic arm). This action sequence allows the robot to accurately execute the corresponding actions in the target scene and precisely complete the task instructions.

[0042] Alternatively, the above-mentioned action generation network is a network with a diffusion structure. The diffusion model / diffusion structure starts with random noise (action noise) and iteratively refines the entire action sequence by progressively denoising. Instead of predicting the next point step by step, it corrects the global action trajectory block by block, which makes the generated actions generally more coherent and smooth in time series with less jitter.

[0043] In this embodiment, visual images and task instruction texts for the target scene are acquired and input into the visual language backbone network for feature extraction processing to determine the joint features corresponding to the visual images and task instruction texts. The joint features, visual images, and task instruction texts are then input into the action generation network for action generation processing to determine the action sequence that the robot needs to perform in the target scene, including at least one action. The joint features are used to characterize the spatial and semantic features of the target scene. The visual language backbone network is obtained by generative supervised augmentation training based on a first sample set. The first sample set includes multiple sets of first sample data and ground truth depth maps corresponding to each set of first sample data. Each set of first sample data includes sample visual images and sample task instruction texts for the sample scene. In this method, the feature extraction process of the visual language backbone network can be pre-trained with generative supervision using scene depth maps. This introduces spatial geometric supervision information of the scene into the feature extraction process, enabling the visual language backbone network to understand the scene not only through semantic information but also by explicitly considering depth levels, boundaries, and spatial layout. This multi-dimensional understanding of the scene, combining semantic and spatial geometric information, allows the subsequent action generation network to reference this scene understanding when generating actions, resulting in more accurate and reasonable robot actions. Furthermore, this method only requires visual images and task command text as input to the visual language backbone network during action generation. It can directly learn the scene's spatial structure information from ordinary 2D visual images and task command text without needing to input point clouds, depth maps, or other additional 3D sensor data. This reduces the overall system complexity and application cost.

[0044] The above embodiments briefly illustrate the training process of the visual language backbone network. The following embodiments will describe its specific training process.

[0045] Figure 2 This is the second flowchart of the robot motion generation method based on generative supervision enhancement provided by the present invention, as shown below. Figure 2 As shown, the training process of the aforementioned visual language backbone network includes: Step 202: Input each first sample data into the initial visual language backbone network for feature extraction processing to determine the predicted joint features corresponding to each first sample data.

[0046] In order to enhance the visual language backbone network’s understanding of spatial relationships, physical attributes and operational processes in the embodied environment, this embodiment of the invention needs to first construct a large-scale embodied pre-training dataset. That is, different sample visual images and sample task instruction texts can be constructed to form different first sample data. At the same time, ground truth depth maps under different first sample data are collected, and each group of first sample data and its corresponding ground truth depth map are bound together to form a first training set.

[0047] Optionally, the aforementioned first sample data includes at least one of the following types of first sample data: The first type of sample data consists of sample visual images and sample task instruction text for sample scenes. The sample task instruction text includes questions and answers regarding the spatial positioning and / or referential relationships of objects in the sample visual images within the sample scenes.

[0048] This type of sample data includes both identical and different sample scenes. It focuses on embodied spatial localization and referential analysis, primarily used to train the visual language backbone network's ability to understand the location, attributes, referential relationships, and local spatial relationships of target objects. Through this sample data, the visual language backbone network can learn basic spatial abilities such as "Where is the target?" and "What is the relationship between the target and other objects?". Specifically, sample visual images from the sample scene can be collected, including one or more objects. Then, questions and answers / data responses regarding the spatial localization and / or referential relationships of the objects that the network needs to learn can be labeled as sample task instruction text. These answers / data responses can be labeled using bounding boxes, point annotations, etc.

[0049] The second type of sample data consists of sample visual images and sample task instruction text for the sample scene; the sample task instruction text includes questions and answers regarding the geometric constraints and / or physical relationships of objects in the sample scene.

[0050] This type of sample data includes sample scenarios that may be the same or different. It is specifically designed for physical and spatial reasoning, primarily used to train a visual language backbone network to understand geometric constraints and physical relationships in real-world environments, such as size, front / back, occlusion, contact, accessibility, stability, support relationships, and containment relationships. This sample data can originate from two-dimensional projections of real robot scenes, simulation environments, or publicly available 3D scene data, and can be constructed into visual question-and-answer, judgment, or reasoning samples. Specifically, it involves collecting two-dimensional projections of sample scenes as visual images, simultaneously labeling questions and answers regarding the geometric constraints and / or physical relationships of objects, and using these as sample task instruction text.

[0051] The third type of sample data consists of sample visual images and sample task instruction texts for the sample scene; the sample task instruction texts include questions and answers regarding spatiotemporal planning when the robot operates objects in the sample scene.

[0052] This type of sample data includes both identical and different sample scenarios. It is specifically designed for spatiotemporal planning and is primarily used to train the visual-language backbone network to understand operational steps, action sequences, and temporal dependencies. This sample data can contain atomic action sequences, keyframe trajectories, subtask descriptions, and corresponding language descriptions, enabling the visual-language backbone network to learn the ability to "do what first, do what later, and how to complete the entire task" within an embodied environment. Specifically, this can be achieved by collecting sample visual images of the sample scenarios and simultaneously annotating the spatiotemporal planning questions and answers when the robot manipulates objects within the sample scenarios, serving as sample task instruction text.

[0053] After obtaining the above three types of sample data, each type of sample data may include multiple first sample data. Then, each first sample data can be input into the untrained initial visual language backbone network for feature extraction processing to obtain the joint features corresponding to each first sample data, which are all recorded as the predicted joint features of the first sample data.

[0054] Step 204: Input the predicted joint features corresponding to each first sample data into the connection network for feature mapping processing to determine the predicted condition features corresponding to each first sample data.

[0055] In this step, a connection network can be set up to align the joint features output by the visual language backbone network with the input space of the generative deep network. This connection network specifically uses a neural network, such as one consisting of two convolutional layers. The generative deep network here is a network that generates a depth map of the target scene based on the features input to the connection network. It provides structured spatial supervision information for the training of the visual language backbone network, thereby enhancing the network's ability to extract spatial geometric information of scenes from two-dimensional visual images.

[0056] Specifically, after obtaining the predicted joint features corresponding to each first sample data, the initial visual language backbone network can input the predicted joint features corresponding to each first sample data into the connection network for feature mapping processing. The connection network maps the predicted joint features corresponding to each first sample data to the input space of the generative deep network to obtain the features after the predicted joint features of each first sample data are mapped, which are all denoted as predicted conditional features.

[0057] Step 206: Input the prediction condition features corresponding to each first sample data into the generative deep network for depth image generation processing to determine the predicted depth map corresponding to each first sample data; the dimension of the prediction condition features corresponding to the first sample data is consistent with the dimension of the input space of the generative deep network.

[0058] In this step, after obtaining the prediction condition features corresponding to each first sample data through the connection network, these features can be input into a generative deep network for depth image generation processing, generating depth maps corresponding to each first sample data, which are denoted as predicted depth maps. Here, the dimension of the input space of the generative deep network is the same as the dimension of the prediction condition features corresponding to each first sample data, thus ensuring that the generative deep network can quickly and accurately generate each predicted depth map.

[0059] Alternatively, the generative deep network (also known as a generative depth head or depth module) can employ a diffusion model / diffusion structure network. When generating a predicted depth map, the generative deep network can start with random noise (i.e., depth noise) and iteratively refine the entire depth map by progressively denoising the predicted condition features of each first sample data to obtain the final predicted depth map.

[0060] Step 208: Based on the predicted depth map and the corresponding ground truth depth map corresponding to each first sample data, perform generative supervised reinforcement training on the initial visual language backbone network to obtain the visual language backbone network.

[0061] In this step, after obtaining the predicted depth map corresponding to each first sample data, the initial visual language backbone network can be subjected to generative supervised augmentation training by calculating the loss between the predicted depth map and the corresponding ground truth depth map for each first sample data, thus obtaining the visual language backbone network. Specifically, when training the initial visual language backbone network, one can adjust only the network parameters of the initial visual language backbone network; or one can adjust the network parameters of at least one of the connection network and the generative deep network while adjusting the network parameters of the initial visual language backbone network; or one can adjust the network parameters of at least one of the connection network, the generative deep network, and the action generation network while adjusting the network parameters of the initial visual language backbone network.

[0062] Alternatively, since there is a difference between the output space of the visual-language backbone network and the input space of the generative deep network, if end-to-end joint training is performed directly from the input of the visual-language backbone network to the output of the generative deep network, it is easy to cause the connection network between the visual-language backbone network and the generative deep network to fail to converge, and even cause the network model training to be unstable. Based on this, this embodiment of the invention proposes a progressive training scheme, which divides the training process of the visual-language backbone network into a three-stage progressive training process to improve the training stability of the visual-language backbone network. This will be explained below.

[0063] For the first stage of the training process, optionally, in step 208 above, before obtaining the visual language backbone network, generative supervised augmentation training is performed on the initial visual language backbone network based on the predicted depth map and the corresponding ground truth depth map corresponding to each first sample data. The method further includes: Freeze the network parameters of the initial visual language backbone network and the network parameters of the generative deep network; Based on the predicted depth map and the corresponding ground truth depth map corresponding to each first sample data, the initial connected network is subjected to one-stage generative supervised training to obtain the connected network.

[0064] Among them, see Figure 3 The diagram shown illustrates the network architecture of the robot motion generation method based on generative supervision enhancement provided by this invention. It depicts the three-stage training and inference process of the visual language backbone network, where N / A indicates none, meaning that the motion generation network may not be included in the network model architecture during the first three stages of training the visual language backbone network. In the first stage of training, the network parameters of the initial visual language backbone network and the generative deep network can be frozen. The connection network is then initialized / trained. Specifically, this can be achieved by calculating the depth reconstruction loss using the predicted depth map and the corresponding ground truth depth map for each first sample data point, and adjusting the network parameters of the initial connection network using the depth reconstruction loss to obtain the first-stage trained connection network.

[0065] The purpose of training the connection network in this first stage is to initially project the output features of the visual language backbone network into the input / conditional space required by the generative deep network, complete the initial alignment at the feature level, and establish a stable representation foundation for subsequent training.

[0066] The first stage of loss calculation mainly calculates the depth reconstruction loss / depth supervision loss, assuming the visual input of the visual language backbone network is... o Language text input is l The output representation of the visual language backbone network is h oIf the conditional feature output by the connection network is c, then the generative deep network outputs a predicted depth map guided by the conditional feature. The corresponding depth reconstruction loss It can be represented as: .in, D This is a true depth map (or a pseudo-true depth map). L This represents the depth reconstruction loss, which can be obtained by calculating the mean squared error using the diffusion matching error.

[0067] After the connection network is trained in the first stage, the second stage of training can be performed. Optionally, in step 208 above, the initial visual language backbone network is subjected to generative supervised augmentation training based on the predicted depth map and the corresponding ground truth depth map for each first sample data. Before obtaining the visual language backbone network, the method further includes: Freeze the network parameters of the initial visual language backbone network; Based on the predicted depth map and the corresponding ground truth depth map corresponding to each first sample data, a second-stage generative supervised training is performed on the first-stage trained connection network and the initial generative deep network to obtain the connection network and the generative deep network.

[0068] In the second stage of training, the network parameters of the initial visual language backbone network can be frozen first. The main purpose is to initialize / train the generative deep network. Specifically, the connection network and the generative deep network trained in the first stage can be jointly trained. The depth reconstruction loss can be calculated using the predicted depth map and the corresponding ground truth depth map corresponding to each first sample data. The network parameters of the connection network and the generative deep network trained in the first stage can be adjusted using the depth reconstruction loss to obtain the connection network and the generative deep network trained in the second stage.

[0069] The purpose of this second stage of training the generative deep network is to enable it to gradually adapt to the conditional features provided by the visual language backbone network and connection networks, and to establish a mapping relationship between high-level semantic representations and fine-grained deep structures. The loss function in this second stage primarily calculates the depth reconstruction loss / depth supervision loss; the specific calculation method can be found in the calculation process described above, and will not be repeated here.

[0070] After training the connection network and generative deep network in the first and second stages, the third stage of training can be performed. Optionally, in step 208 above, the initial visual language backbone network is subjected to generative supervised augmentation training based on the predicted depth map and the corresponding ground truth depth map for each first sample data, to obtain the visual language backbone network, including: Based on the predicted depth map and the corresponding ground truth depth map corresponding to each first sample data, a three-stage generative supervision enhancement training is performed on the two-stage trained connection network, the two-stage trained generative deep network, and the initial visual language backbone network to obtain the connection network, the generative deep network, and the visual language backbone network.

[0071] In the third stage of training, the connected network trained in the second stage, the generative deep network trained in the second stage, and the initial visual language backbone network can be jointly trained. The depth reconstruction loss can be calculated using the predicted depth map and the corresponding ground truth depth map corresponding to each first sample data. The network parameters of the connected network trained in the second stage, the generative deep network trained in the second stage, and the initial visual language backbone network trained in the third stage can be adjusted using the depth reconstruction loss to obtain the connected network trained in the third stage, the generative deep network trained in the third stage, and the visual language backbone network trained in the third stage.

[0072] Alternatively, when performing three-stage training using depth reconstruction loss, other losses can be combined for joint training. Optionally, the sample task instruction text of the first sample data includes sample questions and sample answers. Based on the predicted depth map and the corresponding ground truth depth map corresponding to each first sample data, three-stage generative supervised enhancement training is performed on the two-stage trained connectivity network, the two-stage trained generative deep network, and the initial visual language backbone network to obtain the connectivity network, the generative deep network, and the visual language backbone network, including: Decode the joint features of each prediction to determine the predicted answer corresponding to each first sample data; Calculate the language task loss based on the predicted answer and the corresponding sample answer for each first sample data; Calculate the depth reconstruction loss based on the predicted depth map and the corresponding ground truth depth map for each first sample data. Based on the language task loss and deep reconstruction loss, a three-stage generative supervised reinforcement training is performed on the two-stage trained connection network, the two-stage trained generative deep network, and the initial visual language backbone network to obtain the connection network, the generative deep network, and the visual language backbone network.

[0073] The visual language backbone network can also include a decoder. After the connection modules in the visual language backbone network obtain the predicted joint features corresponding to each first sample data, the decoder can decode the predicted joint features corresponding to each first sample data. That is, it can understand and infer the semantics of the sample visual images and sample task instruction texts in each first sample data, predict the answers in the task instruction texts of each sample, and record them as predicted answers, thus obtaining the predicted answers corresponding to each first sample data. Then, based on the predicted answers and corresponding sample answers for each first sample data, the language task loss / text supervision loss / language modeling loss are calculated. For example, the cross-entropy loss function can be used to calculate the language task loss. At the same time, the depth reconstruction loss can be calculated using the predicted depth map and the corresponding ground truth depth map for each first sample data. Then, the depth reconstruction loss and the language task loss are weighted and summed to obtain the total loss. The network parameters of the two-stage trained connection network, the two-stage trained generative deep network, and the initial visual language backbone network are adjusted using the total loss to obtain the three-stage trained connection network, the three-stage trained generative deep network, and the three-stage trained visual language backbone network. The total loss here can be expressed as: ,in, Indicates the total loss. Indicates language task loss, Indicates the loss from deep reconstruction. This represents the balance coefficient, ranging from 0 to 1.

[0074] The third stage of training simultaneously introduces language task loss and deep reconstruction loss to optimize the network model end-to-end. The deep reconstruction loss can continuously provide geometric structural constraints for the visual language backbone network, while the language task loss can ensure that the network model can retain semantic understanding and instruction reasoning capabilities.

[0075] Further, optionally, the ground truth depth map corresponding to each first sample data is a ground truth depth map with added standard noise. The above calculation of depth reconstruction loss based on the predicted depth map and the corresponding ground truth depth map for each first sample data includes: Based on each predicted depth map, determine the prediction noise corresponding to each predicted depth map; The depth reconstruction loss is calculated based on each predicted noise and the corresponding standard noise.

[0076] In this approach, the generative deep network can employ a diffusion structure. During the output of the predicted depth map, it iteratively refines the entire depth map by progressively denoising the prediction condition features of each first sample data point to obtain the final predicted depth map. Simultaneously, the total noise removed during the denoising process is recorded as the predicted noise. Then, the loss (e.g., mean squared error) between the predicted noise corresponding to each predicted depth map and the standard noise added to the corresponding ground truth depth map can be calculated, and this loss is used to characterize the depth reconstruction loss. This method of characterizing the depth reconstruction loss by calculating the loss between noise levels simplifies the implementation process.

[0077] In this embodiment, joint features of sample data are extracted through a visual language backbone network. These joint features are then mapped to conditional features via a connection network and input into a generative deep network to generate a predicted depth map. The visual language backbone network is trained based on the predicted and ground truth depth maps. This approach first establishes a global structural prior in the depth generation process (i.e., generating the predicted depth map) using an autoregressive or conditional modeling mechanism. Then, a diffusion-based denoising mechanism is used to gradually restore the details of the depth map. This allows the output depth map to maintain overall consistency while more accurately representing edges, contours, layers, and local geometric relationships in the scene. This constrains the visual language backbone network to learn the real geometric structure in the scene, improving the accuracy of subsequent robot action generation. Furthermore, the three-stage training mechanism for the visual language backbone network effectively avoids the training instability problems caused by direct joint training and gradually establishes a collaborative representation of "semantics-geometry-reasoning."

[0078] Furthermore, the trained visual-language backbone network and the action generation network can be combined to form a vision-language-action model to realize robot operation tasks. The action generation network can also be trained on the basis of the three-stage training of the visual-language backbone network to improve the accuracy of the final generated robot actions. The following embodiments will illustrate the training process of the action generation network.

[0079] In one embodiment, the training method of the above-mentioned action generation network includes: Step A1: Obtain the second sample set; the second sample set includes multiple sets of second sample data and the true action sequence corresponding to each set of second sample data. Each set of second sample data includes sample visual images and sample task instruction text for the sample scene.

[0080] First, multiple sets of second sample data can be collected. Each set of second sample data includes sample visual images and sample task instruction text for the sample scene. Simultaneously, ground truth depth maps for each set of second sample data within the sample scene are collected, along with action sequences of robot actions performed under each set of second sample data, denoted as ground truth action sequences. Each ground truth action sequence includes one or more ground truth actions. Then, each set of second sample data, along with its ground truth depth map and ground truth action sequence, is bound together to obtain the second sample set.

[0081] Step A2: Input each second sample data into the three-stage trained visual language backbone network for feature extraction processing to determine the predicted joint features corresponding to each second sample data.

[0082] In this step, after obtaining the second sample data for each group, it can be input into the three-stage trained visual language backbone network for feature extraction processing to obtain the joint features corresponding to each group of second sample data, which are all recorded as the predicted joint features of the second sample data.

[0083] Step A3: Input the predicted joint features corresponding to each second sample data into the initial action generation network for action generation processing to determine the predicted action sequence that the robot needs to execute in the sample scene of the second sample data.

[0084] In this step, after obtaining the predicted joint features of each second sample data, they can be input into the initial action generation network. The initial action generation network generates the action sequence that the robot needs to perform in the sample scene of each second sample data, which is denoted as the predicted action sequence. Each predicted action sequence includes one or more predicted actions.

[0085] Step A4: Based on the predicted action sequence and the corresponding ground truth action sequence corresponding to each second sample data, perform generative supervised training on the initial action generation network to obtain the action generation network.

[0086] In this step, after obtaining the predicted action sequences corresponding to each second sample data point, the action loss between the predicted and ground truth action sequences can be calculated by combining the predicted action sequences with the ground truth action sequences corresponding to each second sample data point. The action loss is then used to adjust the network parameters of the initial action generation network to obtain a trained action generation network. Alternatively, the action loss can be used to adjust the network parameters of the initial action generation network, as well as the network parameters of at least one of the three-stage trained visual language backbone network, the three-stage trained connectivity network, and the three-stage trained generative deep network, ultimately obtaining a trained action generation network. Alternatively, when training the action generation network using the action loss, other losses can be combined to train the action generation network together.

[0087] It is understandable that the inference phase of the action generation network may not require the connection network and the generative deep network, while the training phase of the action generation network can be jointly trained together with the visual language backbone network, the connection network and the generative deep network.

[0088] Furthermore, when training the action generation network using action loss, other losses can be combined to train the action generation network together. Optionally, in step A4 above, the initial action generation network is subjected to generative supervised augmentation training based on the predicted action sequence and the corresponding ground truth action sequence corresponding to each second sample data to obtain the action generation network, including: The predicted joint features corresponding to each second sample data are input into the three-stage trained connection network for feature mapping processing to determine the predicted condition features corresponding to each second sample data. The prediction condition features corresponding to each second sample data are input into the three-stage trained generative deep network to generate depth images and determine the predicted depth map corresponding to each second sample data. Based on the predicted depth map and the corresponding ground truth depth map corresponding to each second sample data, as well as the predicted action sequence and the corresponding ground truth action sequence corresponding to each second sample data, the initial action generation network is subjected to generative supervised reinforcement training to obtain the action generation network.

[0089] See also Figure 3 After obtaining the predicted joint features of each second sample data, these features can be input into the three-stage trained connection network for feature mapping to obtain the conditional features corresponding to each second sample data, which are denoted as the predicted conditional features of the second sample data. The predicted conditional features of each second sample data are then input into the three-stage trained generative deep network for depth image generation to obtain the depth maps corresponding to each second sample data, which are denoted as the predicted depth maps of the second sample data. The depth reconstruction loss can then be calculated using the predicted depth map and the corresponding ground truth depth map for each sample data. Combined with the aforementioned action loss, a total action loss is formed. This total action loss is then used to adjust the network parameters of the initial action generation network, as well as the network parameters of the three-stage trained visual language backbone network, the three-stage trained connection network, and the three-stage trained generative deep network. Finally, the trained action generation network, the trained visual language backbone network, the trained connection network, and the trained generative deep network (i.e.,...) are obtained. Figure 3 Phase 4 of the training process.

[0090] The total loss of action here can be expressed as: , Indicates the total loss of the action. Indicates loss of action, Indicates the loss from deep reconstruction. This represents the balance coefficient, ranging from 0 to 1.

[0091] In this embodiment, after the three-stage training of the visual language backbone network is completed, the joint features generated by the three-stage trained visual language backbone network are input into the action generation network to obtain the predicted action sequence. The action generation network is then trained by the action loss between the predicted action sequence and the ground truth action sequence. This allows for rapid and accurate training of the action generation network, improving the efficiency and accuracy of generating robot actions. Furthermore, by inputting the joint features generated by the three-stage trained visual language backbone network into the three-stage trained connection network and generative deep network to obtain the predicted depth map, the action generation network is jointly trained by combining the depth reconstruction loss between the predicted depth map and the ground truth depth map with the action loss. Here, the generative deep network continuously provides scene geometry and spatial relationship information, enabling the action generation network to not only rely on semantic features when generating robot actions, but also to utilize the scene's depth hierarchy, structural boundaries, and spatial layout information to make action decisions. This improves the robot's perception and manipulation capabilities of the real environment, thereby enhancing the accuracy of generated robot actions. In other words, the enhanced representation formed in spatial understanding can be transferred to robot operation tasks, improving the accuracy, stability, and generalization ability of action prediction, and thus improving the robot's grasping accuracy, interaction stability, and task completion success rate in complex environments.

[0092] As can be seen from the above embodiments, the network architecture of this invention includes a visual language backbone network, a connection network, a generative deep network, and an action generation network. The visual language backbone network extracts joint visual and linguistic features; the connection network aligns the output features of the backbone network with the generative conditional space of the generative deep network; the generative deep network predicts scene depth and enhances spatial understanding under 2D visual input; and the action generation network combines semantic and spatial information to output robot actions, thus forming a visual-language-action collaborative reasoning system. During the training of the visual language backbone network, depth map prediction is used as an intrinsic generation objective, i.e., a depth map generation task is introduced, providing structured spatial supervision for the visual language backbone network. This allows the visual language backbone network to learn not only the semantic alignment objective but also the real geometric structure of the scene. In addition to learning "what the scene is," the visual language backbone network also learns "how the scene is organized in space," improving its spatial perception of the scene and thus enhancing the accuracy of subsequent robot action generation. Furthermore, this invention does not require input of point clouds, depth maps, or other additional 3D sensor data during inference; instead, it directly learns spatial structure information from ordinary 2D visual input, thus reducing system complexity and application costs. To address the inconsistency in representation space between the visual language backbone network and the generative deep network, a three-stage progressive training scheme is proposed to improve training stability and final performance. Moreover, the visual language backbone network trained by this invention not only enhances spatial understanding but also transfers this capability to the robot action generation stage, achieving unified optimization of perception, reasoning, and execution in embodied intelligence tasks.

[0093] Experimental results demonstrate that the technical solution of this invention achieves superior performance in both spatially relevant benchmark tests and real-world robotic tasks, reaching the latest advanced level. This invention achieves the highest overall score on challenging spatially relevant benchmark tests CV-Bench, VSI-Bench, MMSI-Bench, and EmbSpatial, and shows a significant improvement compared to its initial visual-language backbone network. For example, on VSI-Bench, the score of the 2B model improves from 50.4 to 62.8, and the score of the 8B model improves from 57.9 to 70.6. On benchmark tests requiring fine-grained spatial localization, such as RefSpatial, Where2Place, and RoboSpatial, the performance of this invention is 10% higher than the robust proprietary baseline model Gemini-3-Pro. In robotic manipulation tasks, the visual-language-action model extended based on this invention achieves an average success rate of 96.1% on the LIBERO benchmark and an average success rate of 43% in real-world testing, significantly better than the 28.7% of existing methods. Therefore, the technical solution of the present invention can effectively improve the robot's spatial understanding, task reasoning and action execution performance in real environment.

[0094] The robot motion generation device based on generative supervision enhancement provided by the present invention will be described below. The robot motion generation device based on generative supervision enhancement described below can be referred to in correspondence with the robot motion generation method based on generative supervision enhancement described above.

[0095] Figure 4 This is a schematic diagram of the robot motion generation device based on generative supervision enhancement provided by the present invention. See also: Figure 4 As shown, the device may include: The acquisition module 410 is used to acquire visual images and task instruction text for the target scene; The joint feature generation module 420 is used to input visual images and task instruction text into the visual language backbone network for feature extraction processing to determine the joint features corresponding to the visual images and task instruction text. The joint features are used to characterize the spatial and semantic features of the target scene. The visual language backbone network is obtained by generative supervised augmentation training based on the first sample set. The first sample set includes multiple sets of first sample data and ground truth depth maps corresponding to each set of first sample data. Each set of first sample data includes sample visual images and sample task instruction text for the sample scene. The action generation module 430 is used to input joint features, visual images and task instruction text into the action generation network for action generation processing to determine the action sequence that the robot needs to perform in the target scene; the action sequence includes at least one action.

[0096] In one embodiment, the apparatus further includes a first training module, which is configured to: input each first sample data into an initial visual language backbone network for feature extraction processing to determine the predicted joint features corresponding to each first sample data; input the predicted joint features corresponding to each first sample data into a connection network for feature mapping processing to determine the predicted conditional features corresponding to each first sample data; input the predicted conditional features corresponding to each first sample data into a generative deep network for depth image generation processing to determine the predicted depth map corresponding to each first sample data; wherein the dimension of the predicted conditional features corresponding to the first sample data is consistent with the dimension of the input space of the generative deep network; and perform generative supervised enhancement training on the initial visual language backbone network based on the predicted depth map corresponding to each first sample data and the corresponding ground truth depth map to obtain the visual language backbone network.

[0097] Optionally, the first training module is specifically used to freeze the network parameters of the initial visual language backbone network and the network parameters of the generative deep network; and to perform a one-stage generative supervised enhancement training on the initial connected network based on the predicted depth map and the corresponding ground truth depth map corresponding to each first sample data, thereby obtaining the connected network.

[0098] Optionally, the first training module is specifically used to freeze the network parameters of the initial visual language backbone network; and to perform second-stage generative supervision enhancement training on the first-stage trained connection network and the initial generative deep network based on the predicted depth map and the corresponding ground truth depth map of each first sample data, so as to obtain the connection network and the generative deep network.

[0099] Optionally, the first training module is specifically used to perform three-stage generative supervision enhancement training on the two-stage trained connection network, the two-stage trained generative deep network, and the initial visual language backbone network based on the predicted depth map and the corresponding ground truth depth map corresponding to each first sample data, so as to obtain the connection network, the generative deep network, and the visual language backbone network.

[0100] Optionally, the sample task instruction text of the first sample data includes sample answers. The first training module is specifically used to decode each predicted joint feature to determine the predicted answer corresponding to each first sample data; calculate the language task loss based on the predicted answer and the corresponding sample answer for each first sample data; calculate the depth reconstruction loss based on the predicted depth map and the corresponding ground truth depth map for each first sample data; and perform three-stage generative supervised reinforcement training on the two-stage trained connection network, the two-stage trained generative deep network, and the initial visual language backbone network based on the language task loss and the depth reconstruction loss to obtain the connection network, the generative deep network, and the visual language backbone network.

[0101] Optionally, the ground truth depth map corresponding to each first sample data is a ground truth depth map with added standard noise. The first training module is specifically used to determine the prediction noise corresponding to each prediction depth map based on each prediction depth map; and to calculate the depth reconstruction loss based on each prediction noise and the corresponding standard noise.

[0102] Optionally, the aforementioned first sample data includes at least one of the following types of first sample data: Sample visual images and sample task instruction text for the sample scene; the sample task instruction text includes questions and answers regarding the spatial location and / or referential relationships of objects in the sample visual images; Sample visual images and sample task instruction text for the sample scene; the sample task instruction text includes questions and answers regarding the geometric constraints and / or physical relationships of objects in the sample scene; The sample visual images and sample task instruction texts are for the sample scene; the sample task instruction texts include questions and answers on spatiotemporal planning for the robot to operate objects in the sample scene.

[0103] In one embodiment, the apparatus further includes a second training module for acquiring a second sample set. The second sample set includes multiple sets of second sample data and ground truth action sequences corresponding to each set of second sample data. Each set of second sample data includes a sample visual image and sample task instruction text for a sample scene. Each set of second sample data is input into a three-stage trained visual-language backbone network for feature extraction processing to determine the predicted joint features corresponding to each set of second sample data. The predicted joint features corresponding to each set of second sample data are input into an initial action generation network for action generation processing to determine the predicted action sequence that the robot needs to execute in the sample scene of the second sample data. Based on the predicted action sequences corresponding to each set of second sample data and the corresponding ground truth action sequences, the initial action generation network is subjected to generative supervised reinforcement training to obtain the action generation network.

[0104] Optionally, the second training module is specifically used to input the predicted joint features corresponding to each second sample data into the three-stage trained connection network for feature mapping processing to determine the predicted conditional features corresponding to each second sample data; input the predicted conditional features corresponding to each second sample data into the three-stage trained generative deep network for depth image generation processing to determine the predicted depth map corresponding to each second sample data; and perform generative supervised enhancement training on the initial action generation network based on the predicted depth map and the corresponding ground truth depth map, as well as the predicted action sequence and the corresponding ground truth action sequence corresponding to each second sample data, to obtain the action generation network.

[0105] In one embodiment, the visual language backbone network is an autoregressive network, and the action generation network is a diffusion network.

[0106] Figure 5 This is a schematic diagram of the physical structure of the electronic device provided by the present invention, such as... Figure 5 As shown, the electronic device may include: a processor 510, a communications interface 520, a memory 530, and a communications bus 540, wherein the processor 510, the communications interface 520, and the memory 530 communicate with each other through the communications bus 540. The processor 510 can call logic instructions in the memory 530 to execute a robot motion generation method based on generative supervision enhancement. This method includes: acquiring a visual image and task instruction text for a target scene; inputting the visual image and task instruction text into a visual language backbone network for feature extraction processing to determine joint features corresponding to the visual image and task instruction text; the joint features are used to characterize the spatial and semantic features of the target scene; the visual language backbone network is trained using generative supervision enhancement based on a first sample set, the first sample set including multiple sets of first sample data and ground truth depth maps corresponding to each set of first sample data, each set of first sample data including a sample visual image and sample task instruction text for the sample scene; inputting the joint features, visual image, and task instruction text into an action generation network for action generation processing to determine the action sequence that the robot needs to execute in the target scene; the action sequence includes at least one action.

[0107] Furthermore, the logical instructions in the aforementioned memory 530 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0108] On the other hand, the present invention also provides a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can execute the robot motion generation method based on generative supervision enhancement provided by the above methods. The method includes: acquiring a visual image and task instruction text for a target scene; inputting the visual image and task instruction text into a visual language backbone network for feature extraction processing to determine the joint features corresponding to the visual image and task instruction text; the joint features are used to characterize the spatial and semantic features of the target scene, and the visual language backbone network is obtained by generative supervision enhancement training based on a first sample set. The first sample set includes multiple sets of first sample data and ground truth depth maps corresponding to each set of first sample data. Each set of first sample data includes a sample visual image and a sample task instruction text for the sample scene; inputting the joint features, visual image, and task instruction text into a motion generation network for motion generation processing to determine the motion sequence that the robot needs to perform in the target scene; the motion sequence includes at least one motion.

[0109] In another aspect, the present invention also provides a non-transitory computer-readable storage medium storing a computer program thereon. When executed by a processor, the computer program is implemented to perform the robot motion generation method based on generative supervision enhancement provided by the above methods. The method includes: acquiring a visual image and task instruction text for a target scene; inputting the visual image and task instruction text into a visual language backbone network for feature extraction processing to determine joint features corresponding to the visual image and task instruction text; the joint features are used to characterize the spatial and semantic features of the target scene, the visual language backbone network is obtained by generative supervision enhancement training based on a first sample set, the first sample set including multiple sets of first sample data and ground truth depth maps corresponding to each set of first sample data, each set of first sample data including a sample visual image and sample task instruction text for the sample scene; inputting the joint features, visual image, and task instruction text into an action generation network for action generation processing to determine the action sequence to be performed by the robot in the target scene; the action sequence includes at least one action.

[0110] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0111] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0112] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A robot motion generation method based on generative supervision enhancement, characterized in that, include: Acquire visual images and task instruction text for the target scene; The visual image and the task instruction text are input into a visual language backbone network for feature extraction processing to determine the joint features corresponding to the visual image and the task instruction text. The joint features are used to characterize the spatial and semantic features of the target scene. The visual language backbone network is obtained by generative supervised augmentation training based on a first sample set. The first sample set includes multiple sets of first sample data and ground truth depth maps corresponding to each set of first sample data. Each set of first sample data includes a sample visual image and a sample task instruction text for the sample scene. The joint features, the visual image, and the task instruction text are input into the action generation network for action generation processing to determine the action sequence that the robot needs to perform in the target scene; the action sequence includes at least one action.

2. The robot motion generation method based on generative supervision enhancement according to claim 1, characterized in that, The training process of the visual language backbone network includes: Each of the first sample data is input into the initial visual language backbone network for feature extraction processing to determine the predicted joint features corresponding to each of the first sample data. The predicted joint features corresponding to each of the first sample data are input into the connection network for feature mapping processing to determine the predicted condition features corresponding to each of the first sample data. The prediction condition features corresponding to each of the first sample data are input into a generative deep network for depth image generation processing to determine the predicted depth map corresponding to each of the first sample data; the dimension of the prediction condition features corresponding to the first sample data is consistent with the dimension of the input space of the generative deep network. Based on the predicted depth map and the corresponding ground truth depth map corresponding to each of the first sample data, the initial visual language backbone network is subjected to generative supervised reinforcement training to obtain the visual language backbone network.

3. The robot motion generation method based on generative supervision enhancement according to claim 2, characterized in that, Before obtaining the visual language backbone network by performing generative supervised augmentation training on the initial visual language backbone network based on the predicted depth map and the corresponding ground truth depth map corresponding to each of the first sample data, the method further includes: Freeze the network parameters of the initial visual language backbone network and the network parameters of the generative deep network; Based on the predicted depth map and the corresponding ground truth depth map corresponding to each of the first sample data, the initial connection network is subjected to one-stage generative supervised training to obtain the connection network.

4. The robot motion generation method based on generative supervision enhancement according to claim 2, characterized in that, Before obtaining the visual language backbone network by performing generative supervised augmentation training on the initial visual language backbone network based on the predicted depth map and the corresponding ground truth depth map corresponding to each of the first sample data, the method further includes: Freeze the network parameters of the initial visual language backbone network; Based on the predicted depth map and the corresponding ground truth depth map corresponding to each of the first sample data, a second-stage generative supervised training is performed on the first-stage trained connection network and the initial generative deep network to obtain the connection network and the generative deep network.

5. The robot motion generation method based on generative supervision enhancement according to claim 2, characterized in that, The step of performing generative supervised augmentation training on the initial visual language backbone network based on the predicted depth map and the corresponding ground truth depth map corresponding to each of the first sample data to obtain the visual language backbone network includes: Based on the predicted depth map and the corresponding ground truth depth map corresponding to each of the first sample data, a three-stage generative supervised enhancement training is performed on the two-stage trained connection network, the two-stage trained generative deep network, and the initial visual language backbone network to obtain the connection network, the generative deep network, and the visual language backbone network.

6. The robot motion generation method based on generative supervision enhancement according to claim 5, characterized in that, The sample task instruction text of the first sample data includes sample answers. The step of performing three-stage generative supervised augmentation training on the two-stage trained connection network, the two-stage trained generative deep network, and the initial visual language backbone network based on the predicted depth map and the corresponding ground truth depth map of each of the first sample data, to obtain the connection network, the generative deep network, and the visual language backbone network, includes: Decode each of the predicted joint features to determine the predicted answer corresponding to each of the first sample data; Calculate the language task loss based on the predicted answer and the corresponding sample answer for each of the first sample data; Based on the predicted depth map and the corresponding ground truth depth map corresponding to each of the first sample data, calculate the depth reconstruction loss; Based on the language task loss and the deep reconstruction loss, a three-stage generative supervised reinforcement training is performed on the two-stage trained connection network, the two-stage trained generative deep network, and the initial visual language backbone network to obtain the connection network, the generative deep network, and the visual language backbone network.

7. The robot motion generation method based on generative supervision enhancement according to any one of claims 1 to 6, characterized in that, The first sample data includes at least one of the following categories: The sample visual images and sample task instruction text are for the sample scene; the sample task instruction text includes questions and answers regarding the spatial location and / or referential relationships of objects in the sample visual images in the sample scene; The sample visual images and sample task instruction text are for the sample scene; the sample task instruction text includes questions and answers regarding the geometric constraints and / or physical relationships of objects in the sample scene. The sample visual images and sample task instruction text are for the sample scene; the sample task instruction text includes questions and answers regarding spatiotemporal planning when the robot operates objects in the sample scene.

8. The robot motion generation method based on generative supervision enhancement according to any one of claims 1 to 6, characterized in that, The training methods for the action generation network include: Obtain a second sample set; the second sample set includes multiple sets of second sample data and the true action sequence corresponding to each set of second sample data. Each set of second sample data includes sample visual images and sample task instruction text for the sample scene. Each of the second sample data is input into the three-stage trained visual language backbone network for feature extraction processing to determine the predicted joint features corresponding to each of the second sample data. The predicted joint features corresponding to each of the second sample data are input into the initial action generation network for action generation processing to determine the predicted action sequence that the robot needs to perform in the sample scene of the second sample data; Based on the predicted action sequence and the corresponding ground truth action sequence corresponding to each of the second sample data, the initial action generation network is subjected to generative supervised training to obtain the action generation network.

9. The robot motion generation method based on generative supervision enhancement according to claim 8, characterized in that, The step of performing generative supervised augmentation training on the initial action generation network based on the predicted action sequences and corresponding ground truth action sequences corresponding to each of the second sample data to obtain the action generation network includes: The predicted joint features corresponding to each of the second sample data are input into the three-stage trained connection network for feature mapping processing to determine the predicted condition features corresponding to each of the second sample data. The prediction condition features corresponding to each of the second sample data are input into the three-stage trained generative deep network for depth image generation processing to determine the predicted depth map corresponding to each of the second sample data. Based on the predicted depth map and the corresponding ground truth depth map corresponding to each of the second sample data, as well as the predicted action sequence and the corresponding ground truth action sequence corresponding to each of the second sample data, the initial action generation network is subjected to generative supervised reinforcement training to obtain the action generation network.

10. A robot motion generation device based on generative supervision enhancement, characterized in that, include: The acquisition module is used to acquire visual images and task instruction text for the target scene; A joint feature generation module is used to input the visual image and the task instruction text into a visual language backbone network for feature extraction processing, and to determine the joint features corresponding to the visual image and the task instruction text. The joint features are used to characterize the spatial and semantic features of the target scene. The visual language backbone network is obtained by generative supervised augmentation training based on a first sample set. The first sample set includes multiple sets of first sample data and ground truth depth maps corresponding to each set of first sample data. Each set of first sample data includes a sample visual image and a sample task instruction text for the sample scene. The action generation module is used to input the joint features, the visual image, and the task instruction text into the action generation network for action generation processing, and to determine the action sequence that the robot needs to perform in the target scene; the action sequence includes at least one action.