Robot operation control method and apparatus based on multi-view four-dimensional world model
By adopting a robot operation control method based on a multi-view four-dimensional world model, the problems of geometric appearance constraint conflict and depth drift in the existing technology are solved, and more accurate robot operation control is achieved, reducing deployment costs and improving operation accuracy and robustness.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING HUMANOID ROBOTICS INNOVATION CENTER CO LTD
- Filing Date
- 2026-05-28
- Publication Date
- 2026-06-30
AI Technical Summary
In existing robot operation and control methods, pure image prediction is prone to geometric appearance constraint conflicts, single-view RGBD prediction has geometric appearance defects and depth drift problems, while three-dimensional representation-based methods have weak appearance and semantic information and insufficient detail fidelity, resulting in unsmooth generated actions that cannot meet the requirements of robot fine operation.
A four-dimensional world model based on multiple perspectives is adopted. By acquiring single-frame observation images and text operation instructions from a preset reference perspective in the target scene, a four-dimensional world model of the target scene is constructed. Multi-view geometric information and depth data are fused to generate a consistent sequence of appearance and depth features from multiple perspectives for robot operation control.
It improves the accuracy and robustness of robot operation control, reduces application deployment costs, enhances the precision and generalization ability of operation control, and avoids dependence on multi-view hardware layout or multi-frame time sequence data acquisition.
Smart Images

Figure CN122299673A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of robot control technology, and more specifically, to a robot operation control method and device based on a multi-view four-dimensional world model. Background Technology
[0002] With the rapid development of autonomous robot operation and embodied intelligence technologies, the "imagine first, then execute" paradigm based on a world model is gradually becoming the mainstream approach for precise robot manipulation. This method uses a single color depth image (RGB) D) The observed image is used as input. The prediction model predicts the scene changes in the next few frames. Then, the executable actions are inferred based on the prediction results. It can complete fine operation tasks such as grasping, placing and switching in a real environment with complex occlusion and limited multiple viewpoints. It has a wide range of application needs in home services, industrial automation, flexible operation and other fields.
[0003] Currently, the prediction methods of the prediction models are based on video generation methods in pure image space, relying on existing video generation priors to achieve appearance prediction, but without fully modeling the three-dimensional geometric appearance structure; or they are oriented towards a three-dimensional four-dimensional world model, extending video generation to RGB. The data can be dynamically modeled using D data, or directly represented as point clouds, particles, etc. Inverse dynamics models, pose estimation, or additional motion prediction heads are then used to convert the generated future scene into robot control actions.
[0004] However, in existing technologies, pure image prediction is prone to geometric appearance constraint conflicts, and single-view RGB... D-axis prediction suffers from geometric incompleteness and depth drift; methods based on 3D representation suffer from weak appearance and semantic information and insufficient detail fidelity. In action reasoning, the temporal structure of the trajectory hierarchy is ignored, resulting in uneven generated actions. This leads to a significant discrepancy between the predicted results and the executable actions, failing to meet the requirements for precise robot manipulation. Summary of the Invention
[0005] The purpose of this application is to address the shortcomings of the prior art by providing a robot operation control and device based on a multi-view four-dimensional world model, so as to improve the accuracy of robot control and meet the robot's control sequence.
[0006] To achieve the above objectives, the technical solutions adopted in the embodiments of this application are as follows: In a first aspect, one embodiment of this application provides a robot operation control method based on a multi-view four-dimensional world model, the method comprising: Acquire a single-frame observation image from a preset reference viewpoint in the target scene and text operation instructions for the target operation task; the single-frame observation image includes: a single-frame appearance image and a single-frame depth image; Based on the single-frame observation image from the preset reference viewpoint, the camera parameters from the preset reference viewpoint, and the camera parameters from the target viewpoint, the appearance feature sequence and geometric feature sequence of the target scene are obtained. Using a preset multi-view four-dimensional world model, a target appearance feature sequence and a target depth feature sequence are generated based on the appearance feature sequence and the geometric feature sequence. Based on the target appearance feature sequence and the target depth feature sequence, generate a target appearance image sequence and a target depth image sequence; A four-dimensional world model of the target scene is generated based on the target appearance image sequence and the target depth image sequence. The robot is operated and controlled according to the four-dimensional world model.
[0007] Optionally, the step of using a preset multi-view four-dimensional world model to generate a target appearance feature sequence and a target depth feature sequence based on the appearance feature sequence and the geometric feature sequence includes: Using the cross-modal fusion module in the preset multi-view four-dimensional world model, local cross-modal feature fusion is performed on the appearance feature sequence and the geometric feature sequence to generate a fused appearance feature sequence and a fused geometric feature sequence. The camera poses from each viewpoint are obtained, and the fused appearance feature sequence and the fused geometric feature sequence are embedded based on the camera poses from each viewpoint to obtain the embedded appearance feature sequence and the embedded geometric feature sequence. Using the cross-view attention module in the preset multi-view four-dimensional world model, a target appearance feature sequence and a target depth feature sequence are generated based on the embedded appearance feature sequence, the embedded geometric feature sequence, the candidate action sequence of the target operation task, and the text operation instructions.
[0008] Optionally, obtaining the appearance feature sequence and geometric feature sequence of the target scene based on the single-frame observation image of the preset reference viewpoint, the camera parameters of the preset reference viewpoint, and the camera parameters of the target viewpoint includes: Based on the single-frame appearance image of the preset reference view, the camera parameters of the preset reference view, and the camera parameters of the target view, generate the appearance feature sequence of the target scene at the preset reference view and the appearance feature sequence of the target view. Based on the single-frame depth image of the preset reference viewpoint, the camera parameters of the preset reference viewpoint, and the camera parameters of the target viewpoint, a sequence of regional geometric features of the target scene at the preset reference viewpoint and a series of regional geometric features at the target viewpoint are generated. The appearance feature sequence is obtained by splicing the region appearance feature sequence of the preset reference viewpoint and the region appearance feature sequence of the target viewpoint. The geometric feature sequence is obtained by splicing the region geometric feature sequence of the preset reference viewpoint and the region geometric feature sequence of the target viewpoint.
[0009] Optionally, the step of using a cross-modal fusion module in a preset multi-view four-dimensional world model to perform local cross-modal feature fusion on the appearance feature sequence and the geometric feature sequence to generate a fused appearance feature sequence and a fused geometric feature sequence includes: Add appearance modality identifiers and geometric modality identifiers to the appearance feature sequence and the geometric feature sequence, respectively; Using the cross-modal fusion module, based on the appearance feature sequence with added appearance modality identifier and the geometric feature sequence with added geometric modality identifier, the first cross-attention parameter from appearance modality to geometric modality and the second cross-attention parameter from geometry to appearance modality are calculated respectively. Using the cross-modal fusion module, the fused appearance feature sequence is generated based on the first cross-attention parameter and the appearance feature sequence, and the fused geometric feature sequence is generated based on the second cross-attention parameter and the geometric feature sequence.
[0010] Optionally, the cross-view attention module includes: a deformable cross-view attention unit, a self-attention unit, and a cross-attention unit; The method employs the cross-view attention module in the preset multi-view four-dimensional world model to generate a target appearance feature sequence and a target depth feature sequence based on the embedded appearance feature sequence, the embedded geometric feature sequence, the action sequence of the target operation task, and the text operation instructions, including: Using the deformable cross-view attention unit, cross-view geometric offset perception is performed on the embedded appearance feature sequence and the embedded geometric feature sequence to obtain the offset appearance feature sequence and the offset geometric feature sequence. The self-attention unit is used to perform attention processing on the offset appearance feature sequence and the offset geometric feature sequence to obtain a self-attention appearance feature sequence and a self-attention geometric feature sequence. Using the cross-attention unit, the target appearance feature sequence and the target depth feature sequence are generated based on the action sequence, the text operation instruction, the self-attention appearance feature sequence, and the self-attention geometric feature sequence.
[0011] Optionally, the step of using the cross-attention unit to generate the target appearance feature sequence and the target depth feature sequence based on the action sequence, the text operation instruction, the self-attention appearance feature sequence, and the self-attention geometric feature sequence includes: The action sequence is encoded using a preset action trajectory encoder to obtain potential action features; Using the cross-attention unit, the target appearance feature sequence and the target depth feature sequence are generated based on the action latent features, the text operation instructions, the self-attention appearance feature sequence, and the self-attention geometric feature sequence.
[0012] Optionally, the step of controlling the robot based on the four-dimensional world model includes: Based on the four-dimensional world model and the text operation instructions, dynamic scene features are generated; The candidate action sequence is processed using the dynamic scene features to obtain a priori action sequence; The prior action sequence is modified to obtain an executable action sequence; The robot is operated and controlled according to the executable action sequence.
[0013] Optionally, the step of modifying the prior action sequence to obtain an executable action sequence includes: Based on at least two consecutive 3D point sets and prior action sequences in the dynamic scene features, a correction amount is generated using the residual inverse dynamics module. The prior action sequence is modified according to the modification amount to obtain an executable action sequence.
[0014] Secondly, another embodiment of this application provides a robot operation control device based on a multi-view four-dimensional world model, the device comprising: The first acquisition module is used to acquire a single-frame observation image from a preset reference viewpoint in the target scene and text operation instructions for the target operation task; the single-frame observation image includes: a single-frame appearance image and a single-frame depth image; The second acquisition module is used to acquire the appearance feature sequence and geometric feature sequence of the target scene based on the single-frame observation image of the preset reference view, the camera parameters of the preset reference view, and the camera parameters of the target view. The first generation module is used to generate a target appearance feature sequence and a target depth feature sequence based on the appearance feature sequence and the geometric feature sequence using a preset multi-view four-dimensional world model. The second generation module is used to generate a target appearance image sequence and a target depth image sequence based on the target appearance feature sequence and the target depth feature sequence. The third generation module is used to generate a four-dimensional world model of the target scene based on the target appearance image sequence and the target depth image sequence. The control module is used to operate and control the robot according to the four-dimensional world model.
[0015] Thirdly, another embodiment of this application provides an electronic device, including: a processor, a memory, and a bus, wherein the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the processor communicates with the memory via the bus, and the processor executes the machine-readable instructions to perform the steps of the robot operation control method based on a multi-view four-dimensional world model as described in any of the first aspects above.
[0016] Fourthly, another embodiment of this application provides a storage medium storing a computer program, which, when run by a processor, executes the steps of the robot operation control method based on a multi-view four-dimensional world model as described in any of the first aspects above.
[0017] The beneficial effects of this application are: This application provides a robot operation control and device based on a multi-view four-dimensional world model. The device acquires a single-frame observation image from a preset reference viewpoint and text operation instructions for the target operation task. Based on the single-frame observation image from the preset reference viewpoint, the camera parameters of the preset reference viewpoint, and the camera parameters of the target viewpoint, it acquires the appearance feature sequence and geometric feature sequence of the target scene. Using the preset multi-view four-dimensional world model, it generates a target appearance feature sequence and a target depth feature sequence based on the appearance feature sequence and geometric feature sequence. Based on the target appearance feature sequence and target depth feature sequence, it generates a target appearance image sequence and a target depth image sequence. Based on the target appearance image sequence and target depth image sequence, it generates a four-dimensional world model of the target scene. The robot is then operated and controlled based on the four-dimensional world model. This application requires only a single-frame observation image and text operation instructions as input. It directly generates multi-view, time-consistent appearance and depth feature sequences using a pre-set multi-view four-dimensional world model, thereby constructing a complete four-dimensional world model for robot control. This avoids the dependence of traditional methods on multi-view hardware layout or multi-frame time-series data acquisition, reducing application deployment costs. At the same time, since the generated four-dimensional world model integrates multi-view geometric information and depth data, it can provide the robot with more accurate spatial geometric relationships, thereby improving the accuracy and robustness of operation control. Moreover, the entire process does not require manual design of complex feature engineering and has good generalization ability. Attached Figure Description To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of this application and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.
[0018] Figure 1 A flowchart illustrating a robot operation control method based on a multi-view four-dimensional world model provided in this application embodiment; Figure 2 A flowchart illustrating the process of determining a feature sequence in a robot operation control method based on a multi-view four-dimensional world model, provided in an embodiment of this application; Figure 3 A flowchart illustrating the process of determining feature sequences in another robot operation control method based on a multi-view four-dimensional world model provided in this application embodiment; Figure 4 A flowchart illustrating the process of determining a fused feature sequence in a robot operation control method based on a multi-view four-dimensional world model, provided in an embodiment of this application; Figure 5 A flowchart illustrating the process of determining a target feature sequence in a robot operation control method based on a multi-view four-dimensional world model, provided in an embodiment of this application; Figure 6 A flowchart illustrating the process of determining a self-attention feature sequence in a robot operation control method based on a multi-view four-dimensional world model, provided in an embodiment of this application; Figure 7 A schematic diagram of the operation control process in a robot operation control method based on a multi-view four-dimensional world model provided in an embodiment of this application; Figure 8 A flowchart illustrating the determination of an executable action sequence in a robot operation control method based on a multi-view four-dimensional world model, provided in an embodiment of this application; Figure 9 A schematic diagram of the structure of a robot operation control device based on a multi-view four-dimensional world model provided in this application embodiment; Figure 10 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation
[0019] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. It should be understood that the accompanying drawings in this application are for illustrative and descriptive purposes only and are not intended to limit the scope of protection of this application. Furthermore, it should be understood that the schematic drawings are not drawn to scale. The flowcharts used in this application illustrate operations implemented according to some embodiments of this application. It should be understood that the operations in the flowcharts may not be implemented in sequence, and steps without logical contextual relationships may be reversed or implemented simultaneously. In addition, those skilled in the art, guided by the content of this application, may add one or more other operations to the flowcharts, or remove one or more operations from the flowcharts.
[0020] Furthermore, the described embodiments are merely some, not all, of the embodiments of this application. The components of the embodiments of this application described and illustrated herein can typically be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of this application provided in the accompanying drawings is not intended to limit the scope of the claimed application, but merely to illustrate selected embodiments of the application. All other embodiments obtained by those skilled in the art based on the embodiments of this application without inventive effort are within the scope of protection of this application.
[0021] It should be noted that the term "comprising" will be used in the embodiments of this application to indicate the presence of the features declared thereafter, but does not exclude the addition of other features.
[0022] Currently, image-space-based video generative world models rely solely on single-view red, green, and blue (RGB) image observations. While they can generate visually plausible video sequences, the lack of three-dimensional geometric constraints often results in violations of physical spatial relationships, leading to significant discrepancies between predicted scenes and executable actions. Although single-view red, green, and blue-depth (RGB-D) video generation methods introduce depth modalities to supplement geometric information, they are still limited by the inherent occlusion problem of single-view perspectives. Furthermore, monocular depth estimation is susceptible to scale drift and temporal inconsistencies, resulting in insufficient reliability in complex operational scenarios.
[0023] To address this, this application provides a robot operation control method based on a multi-view four-dimensional world model. By acquiring single-frame observation images from a preset reference viewpoint in the target scene and text operation instructions for the target operation task, a four-dimensional world model of the target scene is constructed. The method in this application generates a four-dimensional dynamic scene that is consistent across multiple views and geometrically reliable from single-frame observation images and text operation instructions. This solves the problems of lack of geometric constraints in pure image generation, incomplete depth information in single-view, and lack of appearance cues in three-dimensional representation in existing methods, thereby improving the realism, reliability, and executability of robot operation scene simulation and task planning.
[0024] To clearly describe the robot operation control method based on a multi-view four-dimensional world model provided in this application, the method will be described below in conjunction with several accompanying drawings. Figure 1 A flowchart illustrating a robot operation control method based on a multi-view four-dimensional world model provided in this application embodiment is shown below. Figure 1 As shown, the method includes: Step 101: Obtain a single-frame observation image of the target scene from a preset reference viewpoint and the text operation instructions for the target operation task.
[0025] The target scene refers to the robot's current environment, including the object to be manipulated. This object can be any manipulable object such as a drawer, cup, or block. The robot can be a humanoid robot, bipedal robot, quadrupedal robot, animal-like robot, etc., and this embodiment does not limit this. The preset reference viewpoint can be the initial camera's observation angle and can be set at any position in the target scene; this embodiment does not limit this. A single-frame observation image is a single-frame RGB-D image, which can include a single-frame appearance image and a single-frame depth image. The single-frame appearance image is an RGB color image used to provide visual information such as color and texture for each pixel. The single-frame depth image records the distance of each pixel in the image to the camera, used to provide geometric information for each pixel. The text operation instruction is the natural language description of the target operation task, such as "open the drawer" or "put the phone on the table."
[0026] Optionally, a single-frame image is captured using an RGB-D camera at a preset reference viewpoint in the target scene to obtain a single-frame appearance image and a single-frame depth image. Text operation instructions for the target operation task are obtained using either a voice acquisition device or a text acquisition device. If a voice acquisition device is used, the captured voice is converted into text as the text operation task; if a text acquisition device is used, the captured text is directly used as the text operation instruction. Alternatively, multiple text operation instructions can be preset, and the corresponding text operation instruction can be directly matched according to the target operation task.
[0027] Step 102: Based on the single-frame observation image of the preset reference viewpoint, the camera parameters of the preset reference viewpoint, and the camera parameters of the target viewpoint, obtain the appearance feature sequence and geometric feature sequence of the target scene.
[0028] The camera parameters of the preset reference viewpoint are the camera parameters of the single-frame observed image object. These parameters include the intrinsic and extrinsic camera parameters of the reference world. The camera parameters of the target viewpoint are other viewpoints required to generate the four-dimensional world model. There can be one or more target viewpoints; this embodiment does not limit this. In other words, a single-frame observed image from the target viewpoint needs to be generated based on the single-frame observed image from the preset reference viewpoint and the camera parameters of the target viewpoint. The camera parameters of the target viewpoint also include the intrinsic and extrinsic camera parameters of the target viewpoint.
[0029] The appearance feature sequence is a vector sequence obtained by encoding a single-frame appearance image. The appearance feature sequence only contains color and texture information. The appearance feature sequence is a set of appearance feature sequences of the corresponding single-frame appearance images of the camera parameters of the preset reference viewpoint in the target scene and the camera parameters of the target viewpoint.
[0030] The geometric feature sequence is a feature sequence obtained by encoding a single-frame depth image. It contains only information such as 3D shape and distance. The geometric view feature sequence is a set of geometric feature sequences from the corresponding single-frame depth images of the camera parameters at a preset reference viewpoint in the target scene and the camera parameters at the target viewpoint.
[0031] Optionally, a variational autoencoder (VAE) is used to spatially downsample a single-frame appearance image from a preset reference viewpoint to obtain an appearance feature map of the preset reference viewpoint. Based on the appearance feature map of the preset reference viewpoint and the camera parameters of the target viewpoint, the appearance feature map of the target viewpoint is determined. Based on the appearance feature maps of the preset reference viewpoint and the target viewpoint, the appearance feature sequence of the target scene is determined. Here, the appearance feature map of the target viewpoint is an initial noise sequence, at which point it contains unknown content.
[0032] Optionally, spatial downsampling is performed on a single-frame depth image from a preset reference viewpoint to obtain a geometric feature map of the preset reference viewpoint. Based on the geometric feature map of the preset reference viewpoint and the camera parameters of the target viewpoint, a geometric feature map of the target viewpoint is determined. Based on the geometric feature map of the preset reference viewpoint and the geometric feature map of the target viewpoint, a geometric feature sequence of the target scene is determined. Here, the geometric feature map of the target viewpoint is an initial noise sequence, at which point the geometric feature map of the target viewpoint contains unknown content.
[0033] Step 103: Using a preset multi-view four-dimensional world model, generate the target appearance feature sequence and the target depth feature sequence based on the appearance feature sequence and the geometric feature sequence.
[0034] The preset multi-view four-dimensional world model is a diffusion-based Transformer-based system used to generate multi-view, time-evolving four-dimensional world scenes based on single-frame observation images and text operation commands. The target appearance feature sequence model output represents the generated future RGB features, which include all future time steps from both the preset reference view and the target view. The target depth feature sequence represents the generated future depth features, which include all future time steps from both the preset reference view and the target view.
[0035] Optionally, a preset multi-view four-dimensional world model is used to generate a target appearance feature sequence and a target depth feature sequence based on the appearance feature sequence, the geometric feature sequence, and the camera parameters of the target viewpoint.
[0036] Step 104: Generate target appearance image sequence and target depth image sequence based on target appearance feature sequence and target depth feature sequence.
[0037] Among them, the target appearance image sequence is a multi-frame color image sequence including a preset reference viewpoint and a target viewpoint, and the target depth image sequence is a multi-frame depth image sequence including a preset reference viewpoint and a target viewpoint.
[0038] Optionally, a VAE decoder is used to spatially upsample the target appearance feature sequence to obtain a target appearance image, and a VAE decoder is used to spatially upsample the target depth feature sequence to obtain a target depth image sequence.
[0039] Step 105: Generate a four-dimensional world model of the target scene based on the target appearance image sequence and the target depth image sequence.
[0040] The four-dimensional world model is a representation of a three-dimensional scene that changes over time.
[0041] Optionally, backprojection is performed on each frame and each viewpoint based on the target appearance image sequence and the target depth image sequence. The target appearance image sequence is transformed into three-dimensional space by backprojecting the target depth image on each frame and each viewpoint. The local point cloud of each viewpoint is transformed into the world coordinate system corresponding to the camera parameter values of each viewpoint, and duplicate points are merged to obtain the three-dimensional point cloud of each frame. The three-dimensional point clouds of each frame are merged according to the time sequence to obtain the four-dimensional world model of the target scene.
[0042] Step 106: Control the robot based on the four-dimensional world model.
[0043] Optionally, the robot's actions in the four-dimensional direct model are determined based on the four-dimensional world model, and the robot is operated and controlled based on the robot's actions in the four-dimensional world.
[0044] Optionally, the robot's initial control actions are optimized based on the four-dimensional world model until the robot's actions are consistent with the actions in the four-dimensional world model, and the robot is operated and controlled based on the optimized initial control actions.
[0045] In this embodiment, a single-frame observation image from a preset reference viewpoint and text operation instructions for the target operation task are obtained in the target scene; based on the single-frame observation image from the preset reference viewpoint, the camera parameters of the preset reference viewpoint, and the camera parameters of the target viewpoint, an appearance feature sequence and a geometric feature sequence of the target scene are obtained; using a preset multi-view four-dimensional world model, a target appearance feature sequence and a target depth feature sequence are generated based on the appearance feature sequence and the geometric feature sequence; a target appearance image sequence and a target depth image sequence are generated based on the target appearance feature sequence and the target depth image sequence; a four-dimensional world model of the target scene is generated based on the target appearance image sequence and the target depth image sequence; and the robot is operated and controlled based on the four-dimensional world model. This application requires only a single-frame observation image and text operation instructions as input. It directly generates multi-view, time-consistent appearance and depth feature sequences using a pre-set multi-view four-dimensional world model, thereby constructing a complete four-dimensional world model for robot control. This avoids the dependence of traditional methods on multi-view hardware layout or multi-frame time-series data acquisition, reducing application deployment costs. At the same time, since the generated four-dimensional world model integrates multi-view geometric information and depth data, it can provide the robot with more accurate spatial geometric relationships, thereby improving the accuracy and robustness of operation control. Moreover, the entire process does not require manual design of complex feature engineering and has good generalization ability.
[0046] Based on the above embodiments, this application also provides a process for determining feature sequences in a robot operation control method based on a multi-view four-dimensional world model. Figure 2 A flowchart illustrating the determination of feature sequences in a robot operation control method based on a multi-view four-dimensional world model, as provided in this application embodiment, is shown below. Figure 2 As shown, in step 103 above, a preset multi-view four-dimensional world model is used to generate a target appearance feature sequence and a target depth feature sequence based on the appearance feature sequence and the geometric feature sequence, including: Step 201: Using the cross-modal fusion module in the preset multi-view four-dimensional world model, perform local cross-modal feature fusion on the appearance feature sequence and the geometric feature sequence to generate the fused appearance feature sequence and the fused geometric feature sequence.
[0047] The cross-modal fusion module is used to fuse appearance features and geometric features, that is, to incorporate geometric features into the appearance feature sequence and appearance features into the geometric feature sequence. The fused appearance feature sequence is the appearance feature sequence with geometric features incorporated, and the fused geometric feature sequence is the geometric feature sequence with appearance features incorporated.
[0048] Optionally, a cross-modal fusion module in a pre-defined multi-view four-dimensional world model is used to integrate geometric features into the appearance feature sequence, resulting in a fused appearance feature sequence. Alternatively, a cross-modal fusion module in a pre-defined multi-view four-dimensional world model is used to integrate appearance features into the geometric feature sequence, resulting in a fused geometric feature sequence.
[0049] Step 202: Obtain the camera poses from each viewpoint, and embed the fused appearance feature sequence and the fused geometric feature sequence according to the camera poses from each viewpoint to obtain the embedded appearance feature sequence and the embedded geometric feature sequence.
[0050] Each viewpoint includes a preset reference viewpoint and a target viewpoint. Camera pose is the camera's spherical coordinates. Camera pose refers to the camera's position and orientation in 3D space, represented by a rotation matrix and a translation vector. The embedded appearance feature sequence is essentially adding camera pose information; for each camera viewpoint in the appearance feature sequence, corresponding camera pose information is added. Similarly, for each camera viewpoint in the geometric feature sequence, corresponding camera pose information is added; for each camera viewpoint in the geometric feature sequence, corresponding camera pose information is added.
[0051] Optionally, based on the camera parameters of the viewpoint, the least-squares intersection of the optical axes of all cameras is used. Centered on the target camera, convert each camera to spherical coordinates and calculate the yaw angle. Pitch angle Roll angle and logarithmic distance Fourier feature encoding is applied to generate a 13-dimensional compact embedding to obtain the camera pose. ,in, For the first Geometric pose embedding vectors from each camera viewpoint It is a learnable constraint mapping function used to normalize and compress angles, limiting the range of angle values. For the first The roll angle from the camera's perspective. No. The tilt angle of a camera view. No. The roll angle from the camera's perspective. for The embedding vector has a dimension of 13.
[0052] Optionally, based on the camera parameters of the preset reference view and the target view, the camera pose of each view is determined, and the corresponding view is determined by the fused appearance feature sequence and the fused geometric feature sequence. Based on the camera pose of each view, the fused appearance feature sequence and the fused geometric feature sequence are embedded to obtain the embedded appearance feature sequence and the embedded geometric feature sequence.
[0053] Step 203: Using the cross-view attention module in the preset multi-view four-dimensional world model, generate the target appearance feature sequence and the target depth feature sequence based on the embedded appearance feature sequence, the embedded geometric feature sequence, the candidate action sequence of the target operation task, and the text operation instructions.
[0054] The candidate action sequence for the target operation task is obtained by extracting the robot's actions from a pre-set real training set, resulting in multiple action sequences as candidate action sequences. A cross-view attention module is used to exchange information between different viewpoints while maintaining geometric consistency.
[0055] Optionally, a cross-view attention module is used to restore the embedded appearance feature sequence and the embedded geometric feature sequence into a grid structure with view dimension. At the same time, the text operation instructions are encoded into text embedding vectors, and the candidate action sequence is compressed into trajectory latent variables through a temporal convolutional network encoder. The clean features are gradually recovered from the noise through an iterative denoising process. The final output includes the target appearance feature sequence and the target depth feature sequence containing all views and all future time steps.
[0056] In this embodiment, a cross-modal fusion module is used to achieve bidirectional enhancement of appearance and geometric features in the local neighborhood, thereby generating a geometrically consistent and appearance-rich fusion representation. Then, a position identifier is added to each viewpoint, enabling the model to distinguish the geometric relationships between different viewpoints. Finally, a cross-view attention module is used to achieve accurate geometric alignment between multiple viewpoints while maintaining computational efficiency. Combined with the joint guidance of text instructions and candidate action sequences, a temporally continuous, multi-view consistent target appearance and depth feature sequence that conforms to the task semantics is generated, improving the prediction accuracy and geometric consistency of the four-dimensional world model in robot operation and control.
[0057] Based on the above embodiments, this application also provides another process for determining feature sequences in a robot operation control method based on a multi-view four-dimensional world model. Figure 3 A flowchart illustrating the determination of feature sequences in another robot operation control method based on a multi-view four-dimensional world model provided in this application embodiment is shown below. Figure 3As shown, in step 102 above, based on the single-frame observation image from the preset reference viewpoint, the camera parameters of the preset reference viewpoint, and the camera parameters of the target viewpoint, the appearance feature sequence and geometric feature sequence of the target scene are obtained, including: Step 301: Based on the single-frame appearance image of the preset reference view, the camera parameters of the preset reference view, and the camera parameters of the target view, generate the appearance feature sequence of the target scene at the preset reference view and the appearance feature sequence of the target view.
[0058] Optionally, a preset VAE encoder is used to encode a single frame appearance image of a preset reference viewpoint to obtain an appearance feature sequence of the preset reference viewpoint. Based on the camera parameters of the preset reference viewpoint and the camera parameters of the target viewpoint, the corresponding position of each pixel in the reference viewpoint in the target viewpoint image is determined, thereby generating an appearance feature sequence of the target viewpoint.
[0059] Step 302: Based on the single-frame depth image of the preset reference viewpoint, the camera parameters of the preset reference viewpoint, and the camera parameters of the target viewpoint, generate the regional geometric feature sequence of the target scene in the preset reference viewpoint and the regional geometric feature series of the target viewpoint.
[0060] Optionally, a preset VAE encoder is used to encode a single-frame depth image of a preset reference viewpoint to obtain a set of feature sequences of the preset reference viewpoint. Based on the camera parameters of the preset reference viewpoint and the camera parameters of the target viewpoint, the corresponding position of each pixel in the reference viewpoint in the target viewpoint image is determined, thereby generating a geometric feature sequence of the target viewpoint.
[0061] Step 303: The appearance feature sequence of the region from the preset reference viewpoint and the appearance feature sequence of the region from the target viewpoint are spliced together to obtain the appearance feature sequence.
[0062] Optionally, intramodal fusion is achieved by concatenating the appearance feature sequences of regions from the same viewpoint along the width direction, thus marking the appearance feature sequences of regions from the same viewpoint as adjacent. For appearance feature sequences of regions from different viewpoints, viewpoint fusion is achieved by concatenating them along the height direction, and the appearance features of the regions after width concatenation and height concatenation are used as the appearance feature sequences.
[0063] Step 304: The geometric feature sequence of the region from the preset reference viewpoint and the geometric feature sequence of the region from the target viewpoint are spliced together to obtain the geometric feature sequence.
[0064] Optionally, intramodal fusion is achieved by concatenating the geometric feature sequences of regions from the same viewpoint along the width direction, thus marking the geometric feature sequences of regions from the same viewpoint as adjacent. For geometric feature sequences of regions from different viewpoints, viewpoint fusion is achieved by concatenating them along the height direction, and the geometric features of the regions after width concatenation and height concatenation are taken as the geometric feature sequence.
[0065] In this embodiment, appearance features and geometric features are extracted by region, and the regional features of the reference view and the target view are encoded separately and then stitched together in sequence in combination with camera parameters. This ensures that the generated appearance feature sequence and geometric feature sequence are strictly consistent in sequence length and arrangement order, establishing a clear correspondence for subsequent cross-modal fusion. This avoids the need for real-time deployment of multi-view hardware, reduces data acquisition costs, and enhances the adaptability and scalability of the four-dimensional world model for multi-view scene generation.
[0066] Based on the above embodiments, this application also provides a process for determining the fused feature sequence in a robot operation control method based on a multi-view four-dimensional world model. Figure 4 This application provides a flowchart illustrating the process of determining the fused feature sequence in a robot operation control method based on a multi-view four-dimensional world model, as shown in the embodiments of this application. Figure 4 As shown, in step 201 above, the cross-modal fusion module in the preset multi-view four-dimensional world model is used to perform local cross-modal feature fusion on the appearance feature sequence and the geometric feature sequence, generating the fused appearance feature sequence and the fused geometric feature sequence, including: Step 401: Add appearance modality identifiers and geometric modality identifiers to the appearance feature sequence and geometric feature sequence, respectively.
[0067] In this model, the appearance modality identifier is a learnable vector used to mark features belonging to the appearance modality. This vector is added to each appearance feature, allowing the model to know that the feature comes from a color image. Similarly, the geometric modality identifier is a learnable vector used to mark features belonging to the depth modality. This vector is added to each geometric feature, allowing the model to know that the feature comes from a depth image.
[0068] Optionally, it is an appearance feature sequence. Add appearance modal identifier The appearance feature sequence with added appearance modality identifiers is obtained. .
[0069] Optionally, it is a sequence of geometric features. Add geometric modal identifiers This yields a sequence of geometric features with added geometric modality identifiers. .
[0070] Step 402: Using the cross-modal fusion module, calculate the first cross-attention parameter from appearance modality to geometric modality and the second cross-attention parameter from geometry to appearance modality based on the appearance feature sequence with added appearance modality identifier and the geometric feature sequence with added geometric modality identifier.
[0071] The first cross-attention parameter, calculated from the attention weights and aggregation results when querying the geometric modality from the appearance modality, represents what information each appearance location obtains from the geometric side. The second cross-attention parameter, calculated from the attention weights and aggregation results when querying the appearance modality from the geometric modality, represents what information each geometric location obtains from the appearance side.
[0072] Optionally, a cross-modal fusion module is employed to perform fusion based on the appearance feature sequence with added appearance modality identifiers. Determine the spatial location of each feature in the appearance feature sequence. Define the radius as on the geometrically marked mesh. Local window Calculate the first cross-attention parameter from appearance to geometry. .in, The first cross-attention parameter is used to aggregate the positions. The appearance of cross-modal fusion output features For the attention calculation function, Learnable weights for queries on appearance branches. The key values of the geometric branches are learnable weights. Numerical learnable weights for the geometric branch. For the first A single-location feature vector of appearance with spatial location and additional appearance modality tags. Based on spatial location local window centered Within, the set of all geometric features after adding modal identifiers.
[0073] Optionally, a cross-modal fusion module is employed to perform fusion based on the appearance feature sequence with added appearance modality identifiers. Determine the spatial location of each feature in the appearance feature sequence. Define the radius as on the geometrically marked mesh. Local window Calculate the first cross-attention parameter from appearance to geometry. .in, The first cross-attention parameter is used to aggregate the positions. The appearance of cross-modal fusion output features For the attention calculation function, Learnable weights for queries in geometric branches. Learnable weights for key-value pairs in the appearance branch. The numerical weights for the appearance branch are learnable. For the first A geometric single-position feature vector with spatial location and additional geometric modal labeling. Based on spatial location local window centered Inside, it is the set of all appearance features after adding modal identifiers.
[0074] Step 403: Using a cross-modal fusion module, generate a fused appearance feature sequence based on the first cross-attention parameter and the appearance feature sequence, and generate a fused geometric feature sequence based on the second cross-attention parameter and the geometric feature sequence.
[0075] Optionally, a cross-modal fusion module is employed, using gated residuals to generate a fused appearance feature sequence based on the updated first cross-attention parameters and the appearance feature sequence. .in, These are appearance-learnable gating weights, used to control the degree to which geometric information is incorporated into appearance features.
[0076] Optionally, a cross-modal fusion module is employed, using gated residuals to update the second cross-attention parameters and the set feature sequence to generate a fused geometric feature sequence. .in, These are geometrically learnable gating weights, used to control the degree to which appearance information is incorporated into geometric features.
[0077] In this embodiment, by adding exclusive modal identifiers to the appearance feature sequence and the geometric feature sequence respectively, the feature attributes of different modalities can be clearly distinguished, reducing the representational ambiguity caused by heterogeneous input information. Then, relying on the cross-modal fusion module, the cross-attention parameters between appearance and geometry are calculated bidirectionally to realize the local complementary interaction and deep association modeling of the two modal information. Finally, the fused appearance and geometric feature sequences are generated by iteratively updating the attention output parameters, which enhances the alignment consistency and expressive power of multimodal features, alleviates the problems of missing single-modal information and weak feature association, and provides more accurate and robust basic feature support for subsequent multi-view feature alignment and scene dynamic modeling.
[0078] Based on the above embodiments, the cross-view attention module includes: a deformable cross-view attention unit, a self-attention unit, and a cross-attention unit. This application also provides a process for determining target feature sequences in a robot operation control method based on a multi-view four-dimensional world model. Figure 5 This application provides a flowchart illustrating the process of determining a target feature sequence in a robot operation control method based on a multi-view four-dimensional world model, as shown in the embodiments of this application. Figure 5 As shown, in step 203 above, a cross-view attention module in a preset multi-view four-dimensional world model is used to generate a target appearance feature sequence and a target depth feature sequence based on the embedded appearance feature sequence, the embedded geometric feature sequence, the action sequence of the target operation task, and the text operation instructions. These include: Step 501: Using a deformable cross-view attention unit, cross-view geometric offset perception is performed on the embedded appearance feature sequence and the embedded geometric feature sequence to obtain the offset appearance feature and the offset geometric feature sequence.
[0079] The deformable cross-view attention unit is used to establish correspondences between different viewpoints. Cross-view geometric offset sensing samples candidate points along the epipolar line in other viewpoints for each query location and predicts the spatial offset of each candidate point.
[0080] Optionally, a deformable cross-view attention unit is used to restore the embedded appearance feature sequence and the embedded geometric feature sequence into a multi-view grid structure. For each query position in the current view, the epipolar line corresponding to it in each other view is calculated using known camera parameters, and multiple candidate positions are uniformly sampled along the epipolar line. The two-dimensional spatial offset of each candidate point is predicted using the query features, the initial features of the candidate points, and the cosine similarity between the two as inputs. The offset is limited to the maximum amplitude to compensate for the position error caused by the discretization of the potential space. Finally, the aggregated cross-view information is superimposed on the original features through residual connection to obtain the offset appearance feature sequence and the offset geometric feature sequence in which the geometric correspondence of other views is incorporated into each position.
[0081] Optionally, using known camera parameters, the angle of view Query tags in other perspectives Upper induced polar line, uniform sampling along the polar line There are 10 candidate key value positions, generating a total of 100 candidate key value positions. There are several candidates. To compensate for the coarse resolution of the latent space, deformable refinement is introduced: a multilayer perceptron is used to predict small offsets based on query features, initial sampling key features, and cosine similarity, while limiting the maximum offset magnitude. .in, For clipping functions, For offset prediction multilayer perceptron; Query features, from the perspective In the middle, the first Query features for each spatial location; Initial candidate features. From the target perspective. In the middle, the first sampled along the polar line Initial features of candidate points; Cosine similarity, query features Compared with initial candidate features Cosine similarity between them; Maximum offset. A preset hyperparameter that limits the offset from exceeding this value.
[0082] Optionally, to compensate for the positional error caused by the discretization of the potential space, the aggregated cross-view information is superimposed on the original features through residual connection, so as to obtain the offset appearance feature sequence and the offset geometric feature sequence in which the geometric correspondence of other views is incorporated at each position.
[0083] Step 502: Using self-attention units, attention processing is performed on the offset appearance feature sequence and the offset geometric feature sequence to obtain self-attention appearance feature sequence and self-attention geometric feature sequence.
[0084] The self-attention unit is used to interact with all other positions in the sequence, propagating information globally. The attention processing is used to calculate the relevance weights between each position and other positions in the sequence, and aggregates the features of all positions based on the weights, so that each position obtains global contextual information.
[0085] Optionally, when using self-attention units for processing, the offset appearance feature sequence and the offset geometric feature sequence are treated as two independent input sequences. For each sequence, three vectors corresponding to the query, key, and value are generated for each position in the sequence. The relevance score between each position and all other positions in the sequence is calculated, usually obtained by the dot product of the query vector and the key vector. After scaling and normalization, the relevance score is converted into attention weights. The weights are used to sum the value vectors of all positions, so that the output features of each position can aggregate the global context information of the entire sequence. The residual connection adds the self-attention output to the original input and performs layer normalization to obtain the self-attention appearance feature sequence and the self-attention geometric feature sequence that incorporate global context information.
[0086] Step 503: Using cross-attention units, generate target appearance feature sequence and target depth feature sequence based on action sequence, text operation instructions, self-attention appearance feature sequence and self-attention geometric feature sequence.
[0087] Among them, the cross-attention unit can be a cross-attention module, which allows the current feature sequence to query another condition sequence.
[0088] Optionally, a cross-attention unit is employed. Text manipulation instructions are input into a pre-trained text encoder to extract fixed-dimensional text embedding vectors. The action sequence is then compressed into low-dimensional trajectory latent variables using a temporal convolutional network encoder. Using both the self-attention appearance feature sequence and the self-attention geometric feature sequence as queries, and the text embedding vector and trajectory latent variables as keys and values respectively, two cross-attention paths are computed. One path allows each spatial location's features to extract task-related semantic information from the text embeddings, while the other path allows each location's features to extract motion-style-related information from the trajectory latent variables. The outputs of the two cross-attention paths are fused with the original features through weighted summation or concatenation, ensuring that the generated features simultaneously contain task intent and motion guidance. After multi-layer cross-attention processing and multi-step iterative denoising, the system gradually recovers clean features from the noise, outputting a target appearance feature sequence and a target depth feature sequence encompassing all viewpoints and all future time steps.
[0089] In this embodiment, a deformable cross-view attention unit is first used to perform sparse sampling and predict learnable offsets under epipolar geometry constraints, achieving sub-pixel-level precise localization of corresponding points across viewpoints and improving the geometric alignment accuracy between multiple viewpoints. Subsequently, a self-attention unit is used to propagate contextual information globally, enabling features at each location to perceive the relevant content of the entire sequence, enhancing the completeness of feature representation. Finally, text instructions and action sequences are injected as conditions through cross-attention units, allowing the generation process to be guided by both task semantics and motion intent. This application ensures the accuracy of cross-view geometric modeling, achieves efficient propagation of global information, and supports flexible external condition injection, thereby generating temporally coherent, multi-view consistent target appearance feature sequences and depth feature sequences that strictly conform to instructions and action intents.
[0090] Based on the above embodiments, this application also provides a process for determining the self-attention feature sequence in a robot operation control method based on a multi-view four-dimensional world model. Figure 6 This application provides a flowchart illustrating the process of determining a self-attention feature sequence in a robot operation control method based on a multi-view four-dimensional world model, as shown in the embodiments of this application. Figure 6 As shown, in step 503 above, a cross-attention unit is used to generate a target appearance feature sequence and a target depth feature sequence based on the action sequence, text operation instructions, self-attention appearance feature sequence, and self-attention geometric feature sequence, including: Step 601: Use a preset motion trajectory encoder to encode the motion sequence to obtain the potential features of the motion.
[0091] The preset motion trajectory encoder is a pre-trained encoder network, which can be based on a Temporal Convolutional Network (TCN) or a Recurrent Neural Network (RNN), specifically designed to compress variable-length robot motion sequences into fixed-length low-dimensional latent vectors. The motion sequence is a series of action instructions performed by the robot during task execution, such as joint angles, end effector positions, and gripper opening / closing states at each time step, typically arranged chronologically. The motion sequence can be a series of action instructions for multiple different tasks. The motion latent features are low-dimensional vectors output by the preset motion trajectory encoder, used to represent the core motion features of the motion sequence, such as direction of motion, speed rhythm, and motion amplitude. This embodiment does not limit these aspects.
[0092] Optionally, the motion sequence is preprocessed using a preset motion trajectory encoder. The processed action sequence Extract action features and output a low-dimensional vector of fixed length; this vector is the latent action feature. Preprocessing may include normalization to ensure consistent numerical ranges across dimensions, and padding or truncation of the sequence length to achieve uniformity.
[0093] Optionally, a preset motion trajectory decoder can be used. Latent features of the action Decode the sequence to obtain the predicted action sequence. Based on the action sequence and the predicted action sequence, the first loss of the preset action trajectory encoder is obtained. This allows for optimization of the preset motion encoder based on its loss. For mathematical expectation, For encoder learnable parameters, For action sequences, To predict action sequences, To balance the weighting coefficients, Here is the formula for calculating divergence. This is a pre-defined prior distribution.
[0094] Optionally, based on a preset lightweight latent consistency head, and using cross-attention units, based on action latent features, text manipulation instructions, self-attention appearance feature sequences, and self-attention geometric feature sequences. Generate reconstruction conditional latent markers Thus, based on the reconstruction of conditional latent markers and action latent features Determine the second loss of the preset motion trajectory encoder. .in, The length of the spatial sequence. The number of feature channels is also known as the feature dimension.
[0095] Step 602: Using cross-attention units, generate target appearance feature sequence and target depth feature sequence based on action latent features, text operation instructions, self-attention appearance feature sequence and self-attention geometric feature sequence.
[0096] Optionally, the text instructions are encoded as text embeddings, and the action latent features are used as trajectory conditions. The self-attention appearance and geometric feature sequences are used as queries, and the text embeddings and action latent features are used as keys. Through cross-attention calculation, the sequence features obtain task semantics from the text and motion style from the action. After fusion and iterative denoising, the target appearance and depth feature sequences are output.
[0097] In this embodiment, the original action sequence is compressed into low-dimensional action latent features by an action trajectory encoder, which effectively removes high-frequency noise and redundant details, making the conditional input more compact and easier for the model to learn. Then, cross-attention units are used to perform dual conditional injection with self-attention appearance and geometric feature sequences as queries and action latent features and text instructions as keys, so that the generation process is guided by both task semantics and motion intent, which enhances the ability of the generated result to follow user instructions and the precise control of action style.
[0098] Based on the above embodiments, this application also provides a process for determining operation control in a robot operation control method based on a multi-view four-dimensional world model. Figure 7 This application provides a schematic diagram of the operation control process in a robot operation control method based on a multi-view four-dimensional world model, as shown in the embodiments of this application. Figure 7 As shown, in step 106 above, the robot is operated and controlled according to the four-dimensional world model, including: Step 701: Generate dynamic scene features based on the four-dimensional world model and text operation instructions.
[0099] The four-dimensional world model is a geometrically consistent point cloud sequence containing multiple time steps and multiple viewpoints, used to describe the changes in the state of a scene in three-dimensional space over time. Dynamic scene features are features extracted from the four-dimensional world model, representing the dynamic changes of the scene over time. Examples include the motion trajectories of objects, their deformation states, and the interaction between the robot and its environment.
[0100] Optionally, a point cloud sequence that changes over time is extracted from the four-dimensional world model. The positional changes of the point cloud, the movement trajectory of objects, and the interaction state between the robot and the environment are analyzed between adjacent time steps to obtain the spatiotemporal dynamic information of the scene. The text operation pointers are converted into semantic embedding vectors, and the temporal dynamic features and text semantic features are fused and aligned. The two are mutually enhanced through a temporal convolutional network or a cross-attention mechanism to obtain a dynamic scene feature vector.
[0101] Step 702: Using dynamic scene features, process the candidate action sequence to obtain the prior action sequence.
[0102] The candidate action sequence is the initial set of action candidates, which can be a randomly initialized sequence, a preset default action template, or a preliminary action trajectory obtained through optimization during testing. The prior action sequence is the preliminary action sequence obtained after dynamic scene feature guidance processing.
[0103] Optionally, the candidate action sequence is used as the initial action starting point. Then, the dynamic scene features and the candidate action sequence are aligned in the feature space. Through a cross-attention mechanism, the candidate action sequence is used as the query and the dynamic scene features are used as the key and value. The correlation between each action step and the dynamic changes of the scene is calculated, so that the candidate action sequence can obtain information about the evolution of the scene over time from the dynamic scene features. The aggregated scene information is used to update the candidate action sequence, and a prior action sequence that matches the current scene dynamics is output.
[0104] For example, through a preset formula Obtain the prior action sequence. For prior action sequences; The value of the variable when finding the minimum value; A reconstruction metric function is used to measure the difference between two videos; Conditional video generator. Input text commands. and candidate action sequences Output the generated video. To follow the input text instructions The generated video, This is the regularization coefficient, used to control the strength of the regularization term. for The sum of squares of the elements of a vector is used to penalize excessively large vectors. value.
[0105] Step 703: Correct the prior action sequence to obtain the executable action sequence.
[0106] Among them, the executable action sequence is the final action instruction sequence obtained after modification, which can be directly sent to the robot controller for execution.
[0107] Optionally, the prior action sequence is modified so that it can be executed by the robot, thereby obtaining an executable action sequence.
[0108] Step 704: Perform operation control on the robot according to the executable action sequence.
[0109] Optionally, the executable action sequence is converted into an instruction format that the robot controller can directly parse through a format conversion module. Then, time axis interpolation processing is performed on the action sequence to match the instruction frequency with the robot's control cycle, ensuring a smooth and continuous motion trajectory. The processed action instructions are sent to the robot controller one by one through the communication interface, driving the robotic arm and gripper to complete physical operations such as grasping, moving, and placing according to the preset trajectory, thereby achieving precise operation control of the robot.
[0110] In this embodiment, dynamic scene features are generated by a four-dimensional world model and text instructions, enabling action planning to fully utilize the spatiotemporal evolution information of the scene instead of relying solely on static images. Furthermore, the dynamic scene features are used to guide candidate action sequences, generating prior action sequences that match the scene dynamics. This avoids blindly searching the action space from scratch and ensures the consistency between actions and scene dynamics.
[0111] Based on the above embodiments, this application also provides a process for determining operation control in a robot operation control method based on a multi-view four-dimensional world model. Figure 8 This application provides a flowchart illustrating the determination of executable action sequences in a robot operation control method based on a multi-view four-dimensional world model, as shown in the embodiments of this application. Figure 8 As shown, in step 703 above, the prior action sequence is modified to obtain an executable action sequence, including: Step 801: Based on at least two consecutive 3D point sets in the dynamic scene features and the prior action sequence, use the residual inverse dynamics module to generate the correction amount.
[0112] Here, two consecutive 3D point sets are taken from the 3D point clouds of two adjacent time steps from the 4D world model, denoted as . and , respectively representing time and time The scene geometry is determined. The residual inverse dynamics module is a module built on a lightweight neural network module, used to analyze the geometric changes between two consecutive frame point clouds and predict the correction amount by combining prior actions.
[0113] Optionally, a residual inverse dynamics module can be used to process two continuous three-dimensional point sets. and Feature extraction is performed independently for each point. Global features are aggregated through max pooling to capture the overall geometric shape and structural information of the point cloud, and then based on the prior action sequence. Generate correction amount .
[0114] Step 802: Correct the prior action sequence according to the correction amount to obtain the executable action sequence.
[0115] Optionally, based on the correction amount Superimposed value prior action sequence Make corrections to obtain an executable action sequence. .
[0116] In this embodiment, the geometric changes between two consecutive frames of three-dimensional point sets are directly analyzed by the residual inverse dynamics module, so that the generation of correction amount is based on accurate spatial geometric information. At the same time, the residual prediction strategy is adopted to learn only the deviation between the prior action and the actual action rather than the complete action, which reduces the learning difficulty of the inverse dynamics problem, effectively compensates for the error caused by potential spatial discretization or dynamic complexity, and improves the accuracy and reliability of robot operation control.
[0117] Based on the same inventive concept, this application also provides a robot operation control device based on a multi-view four-dimensional world model, which corresponds to the robot operation control method based on a multi-view four-dimensional world model. Since the principle of the device in this application is similar to the robot operation control method based on a multi-view four-dimensional world model described above, the implementation of the device can refer to the implementation of the method, and the repeated parts will not be described again.
[0118] Figure 9 A schematic diagram of a robot operation control device based on a multi-view four-dimensional world model is provided as an embodiment of this application, as shown below. Figure 9 As shown, the device includes: a first acquisition module 901, a second acquisition module 902, a first generation module 903, a second generation module 904, a third generation module 905, and a control module 906; wherein: The first acquisition module 901 is used to acquire a single-frame observation image of a preset reference viewpoint in the target scene and text operation instructions for the target operation task; the single-frame observation image includes: a single-frame appearance image and a single-frame depth image; The second acquisition module 902 is used to acquire the appearance feature sequence and geometric feature sequence of the target scene based on the single-frame observation image of the preset reference view, the camera parameters of the preset reference view, and the camera parameters of the target view. The first generation module 903 is used to generate a target appearance feature sequence and a target depth feature sequence based on a preset multi-view four-dimensional world model and an appearance feature sequence and a geometric feature sequence. The second generation module 904 is used to generate a target appearance image sequence and a target depth image sequence based on the target appearance feature sequence and the target depth feature sequence. The third generation module 905 is used to generate a four-dimensional world model of the target scene based on the target appearance image sequence and the target depth image sequence. The control module 906 is used to control the operation of the robot based on the four-dimensional world model.
[0119] In one possible implementation, the first generation module 903 is specifically used to: use the cross-modal fusion module in the preset multi-view four-dimensional world model to perform local cross-modal feature fusion on the appearance feature sequence and the geometric feature sequence to generate the fused appearance feature sequence and the fused geometric feature sequence. The camera poses from each viewpoint are obtained, and the fused appearance feature sequence and fused geometric feature sequence are embedded based on the camera poses from each viewpoint to obtain the embedded appearance feature sequence and embedded geometric feature sequence. Using the cross-view attention module in the pre-set multi-view four-dimensional world model, the target appearance feature sequence and target depth feature sequence are generated based on the embedded appearance feature sequence, the embedded geometric feature sequence, the candidate action sequence of the target operation task, and the text operation instructions.
[0120] In one possible implementation, the second acquisition module 902 is specifically used to: use the cross-modal fusion module in the preset multi-view four-dimensional world model to perform local cross-modal feature fusion on the appearance feature sequence and the geometric feature sequence to generate the fused appearance feature sequence and the fused geometric feature sequence. The camera poses from each viewpoint are obtained, and the fused appearance feature sequence and fused geometric feature sequence are embedded based on the camera poses from each viewpoint to obtain the embedded appearance feature sequence and embedded geometric feature sequence. Using the cross-view attention module in the pre-set multi-view four-dimensional world model, the target appearance feature sequence and target depth feature sequence are generated based on the embedded appearance feature sequence, the embedded geometric feature sequence, the candidate action sequence of the target operation task, and the text operation instructions.
[0121] In one possible implementation, the first generation module 903 is specifically used to: add appearance modality identifiers and geometric modality identifiers to the appearance feature sequence and the geometric feature sequence, respectively; A cross-modal fusion module is used to calculate the first cross-attention parameter from appearance modality to geometric modality and the second cross-attention parameter from geometry to appearance modality, based on the appearance feature sequence with added appearance modality identifier and the geometric feature sequence with added geometric modality identifier. A cross-modal fusion module is used to generate a fused appearance feature sequence based on the first cross-attention parameter and the appearance feature sequence, and to generate a fused geometric feature sequence based on the second cross-attention parameter and the geometric feature sequence.
[0122] In one possible implementation, the cross-view attention module includes: a deformable cross-view attention unit, a self-attention unit, and a cross-attention unit; the first generation module 903 is specifically used to: use the deformable cross-view attention unit to perform cross-view geometric offset perception on the embedded appearance feature sequence and the embedded geometric feature sequence to obtain the offset appearance feature sequence and the offset geometric feature sequence. Self-attention units are used to perform attention processing on the offset appearance feature sequence and the offset geometric feature sequence to obtain self-attention appearance feature sequence and self-attention geometric feature sequence. A cross-attention unit is used to generate a target appearance feature sequence and a target depth feature sequence based on the action sequence, text operation instructions, self-attention appearance feature sequence, and self-attention geometric feature sequence.
[0123] In one possible implementation, the first generation module 903 is specifically used to: encode the action sequence using a preset action trajectory encoder to obtain potential action features; A cross-attention unit is used to generate a target appearance feature sequence and a target depth feature sequence based on action latent features, text operation instructions, self-attention appearance feature sequence, and self-attention geometric feature sequence.
[0124] In one possible implementation, the control module 906 is specifically used to: generate dynamic scene features based on the four-dimensional world model and text operation instructions; By using dynamic scene features, candidate action sequences are processed to obtain prior action sequences; The prior action sequence is modified to obtain the executable action sequence; The robot is controlled by a sequence of executable actions.
[0125] In one possible implementation, the control module 906 is specifically used to: generate a correction amount using a residual inverse dynamics module based on at least two consecutive three-dimensional point sets and a prior action sequence in the dynamic scene features; The prior action sequence is modified based on the correction amount to obtain the executable action sequence.
[0126] The processing flow of each module in the device and the interaction flow between each module can be referred to the relevant descriptions in the above method embodiments, and will not be detailed here.
[0127] This application also provides an electronic device. Figure 10This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application, such as... Figure 10 As shown, the device includes a processor 1001 and a memory 1002, and optionally, a bus 1003. The memory 1002 stores machine-readable instructions executable by the processor 101. When the electronic device is running, the processor 1001 and the memory 1002 communicate via the bus 1003. When the machine-readable instructions are executed by the processor 101, the steps of the robot operation control method based on the multi-view four-dimensional world model described above are performed. The electronic device can be a robot control device. When the electronic device is a robot control device, it controls the robot based on the robot operation control method of the multi-view four-dimensional world model. The electronic device can also be an external control device. When the electronic device is an external device, it communicates with the robot's controller to execute the robot operation control method based on the multi-view four-dimensional world model and communicates with the robot wirelessly to control the robot.
[0128] This application also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, performs the steps of the robot operation control method based on a multi-view four-dimensional world model.
[0129] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems and devices described above can be referred to the corresponding processes in the method embodiments, and will not be repeated here. In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods can be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of modules is only a logical functional division, and in actual implementation, there may be other division methods. Furthermore, multiple modules or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the displayed or discussed mutual coupling or direct coupling or communication connection can be through some communication interfaces; the indirect coupling or communication connection of devices or modules can be electrical, mechanical, or other forms.
[0130] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. If the functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this invention. The aforementioned storage medium includes: USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, optical disks, and other media capable of storing program code.
[0131] The above are merely specific embodiments of this application, but the scope of protection of this application is not limited thereto. Any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application.
Claims
1. A robot operation control method based on a multi-view four-dimensional world model, characterized in that, The method includes: Acquire a single-frame observation image from a preset reference viewpoint in the target scene and text operation instructions for the target operation task; the single-frame observation image includes: a single-frame appearance image and a single-frame depth image; Based on the single-frame observation image from the preset reference viewpoint, the camera parameters from the preset reference viewpoint, and the camera parameters from the target viewpoint, the appearance feature sequence and geometric feature sequence of the target scene are obtained. Using a preset multi-view four-dimensional world model, a target appearance feature sequence and a target depth feature sequence are generated based on the appearance feature sequence and the geometric feature sequence. Based on the target appearance feature sequence and the target depth feature sequence, generate a target appearance image sequence and a target depth image sequence; A four-dimensional world model of the target scene is generated based on the target appearance image sequence and the target depth image sequence. The robot is operated and controlled based on the four-dimensional world model.
2. The method according to claim 1, characterized in that, The method employs a preset multi-view four-dimensional world model, and generates a target appearance feature sequence and a target depth feature sequence based on the appearance feature sequence and the geometric feature sequence, including: Using the cross-modal fusion module in the preset multi-view four-dimensional world model, local cross-modal feature fusion is performed on the appearance feature sequence and the geometric feature sequence to generate a fused appearance feature sequence and a fused geometric feature sequence. The camera poses from each viewpoint are obtained, and the fused appearance feature sequence and the fused geometric feature sequence are embedded based on the camera poses from each viewpoint to obtain the embedded appearance feature sequence and the embedded geometric feature sequence. Using the cross-view attention module in the preset multi-view four-dimensional world model, a target appearance feature sequence and a target depth feature sequence are generated based on the embedded appearance feature sequence, the embedded geometric feature sequence, the candidate action sequence of the target operation task, and the text operation instructions.
3. The method according to claim 1, characterized in that, The step of obtaining the appearance feature sequence and geometric feature sequence of the target scene based on the single-frame observation image of the preset reference viewpoint, the camera parameters of the preset reference viewpoint, and the camera parameters of the target viewpoint includes: Based on the single-frame appearance image of the preset reference view, the camera parameters of the preset reference view, and the camera parameters of the target view, generate the appearance feature sequence of the target scene at the preset reference view and the appearance feature sequence of the target view. Based on the single-frame depth image of the preset reference viewpoint, the camera parameters of the preset reference viewpoint, and the camera parameters of the target viewpoint, a sequence of regional geometric features of the target scene at the preset reference viewpoint and a series of regional geometric features at the target viewpoint are generated. The appearance feature sequence is obtained by splicing the region appearance feature sequence of the preset reference viewpoint and the region appearance feature sequence of the target viewpoint. The geometric feature sequence is obtained by splicing the region geometric feature sequence of the preset reference viewpoint and the region geometric feature sequence of the target viewpoint.
4. The method according to claim 2, characterized in that, The method employs a cross-modal fusion module within a pre-defined multi-view four-dimensional world model to perform local cross-modal feature fusion on the appearance feature sequence and the geometric feature sequence, generating fused appearance feature sequences and fused geometric feature sequences, including: Add appearance modality identifiers and geometric modality identifiers to the appearance feature sequence and the geometric feature sequence, respectively; Using the cross-modal fusion module, based on the appearance feature sequence with added appearance modality identifier and the geometric feature sequence with added geometric modality identifier, the first cross-attention parameter from appearance modality to geometric modality and the second cross-attention parameter from geometry to appearance modality are calculated respectively. Using the cross-modal fusion module, the fused appearance feature sequence is generated based on the first cross-attention parameter and the appearance feature sequence, and the fused geometric feature sequence is generated based on the second cross-attention parameter and the geometric feature sequence.
5. The method according to claim 2, characterized in that, The cross-view attention module includes: a deformable cross-view attention unit, a self-attention unit, and a cross-attention unit; The method employs the cross-view attention module in the preset multi-view four-dimensional world model to generate a target appearance feature sequence and a target depth feature sequence based on the embedded appearance feature sequence, the embedded geometric feature sequence, the action sequence of the target operation task, and the text operation instructions, including: Using the deformable cross-view attention unit, cross-view geometric offset perception is performed on the embedded appearance feature sequence and the embedded geometric feature sequence to obtain the offset appearance feature sequence and the offset geometric feature sequence. The self-attention unit is used to perform attention processing on the offset appearance feature sequence and the offset geometric feature sequence to obtain a self-attention appearance feature sequence and a self-attention geometric feature sequence. Using the cross-attention unit, the target appearance feature sequence and the target depth feature sequence are generated based on the action sequence, the text operation instruction, the self-attention appearance feature sequence, and the self-attention geometric feature sequence.
6. The method according to claim 5, characterized in that, The step of using the cross-attention unit to generate the target appearance feature sequence and the target depth feature sequence based on the action sequence, the text operation instruction, the self-attention appearance feature sequence, and the self-attention geometric feature sequence includes: The action sequence is encoded using a preset action trajectory encoder to obtain potential action features; Using the cross-attention unit, the target appearance feature sequence and the target depth feature sequence are generated based on the action latent features, the text operation instructions, the self-attention appearance feature sequence, and the self-attention geometric feature sequence.
7. The method according to claim 1, characterized in that, The operation and control of the robot based on the four-dimensional world model includes: Based on the four-dimensional world model and the text operation instructions, dynamic scene features are generated; The candidate action sequence is processed using the dynamic scene features to obtain a priori action sequence; The prior action sequence is modified to obtain an executable action sequence; The robot is operated and controlled according to the executable action sequence.
8. The method according to claim 7, characterized in that, The step of modifying the prior action sequence to obtain an executable action sequence includes: Based on at least two consecutive 3D point sets and prior action sequences in the dynamic scene features, a correction amount is generated using the residual inverse dynamics module. The prior action sequence is modified according to the modification amount to obtain an executable action sequence.
9. A robot operation control device based on a multi-view four-dimensional world model, characterized in that, The device includes: The first acquisition module is used to acquire a single-frame observation image from a preset reference viewpoint in the target scene and text operation instructions for the target operation task; the single-frame observation image includes: a single-frame appearance image and a single-frame depth image; The second acquisition module is used to acquire the appearance feature sequence and geometric feature sequence of the target scene based on the single-frame observation image of the preset reference view, the camera parameters of the preset reference view, and the camera parameters of the target view. The first generation module is used to generate a target appearance feature sequence and a target depth feature sequence based on the appearance feature sequence and the geometric feature sequence using a preset multi-view four-dimensional world model. The second generation module is used to generate a target appearance image sequence and a target depth image sequence based on the target appearance feature sequence and the target depth feature sequence. The third generation module is used to generate a four-dimensional world model of the target scene based on the target appearance image sequence and the target depth image sequence. The control module is used to operate and control the robot according to the four-dimensional world model.
10. An electronic device, characterized in that, include: The processor and memory, the memory storing machine-readable instructions executable by the processor, wherein when the computer device is running, the processor executes the machine-readable instructions to perform the steps of the robot operation control method based on a multi-view four-dimensional world model as described in any one of claims 1 to 8.