Image data generation method, mobile device, electronic device and medium

By acquiring historical and noisy image feature sequences and using image feature prediction models to generate image data that conforms to the development law of things, the problem of poor image quality in robot world models is solved, and the task completion effect is improved.

WO2026138009A1PCT designated stage Publication Date: 2026-07-02AGIBOT INNOVATION (SHANGHAI) TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
AGIBOT INNOVATION (SHANGHAI) TECHNOLOGY CO LTD
Filing Date
2025-09-15
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

The images or videos generated by existing robot world models are of poor quality, which may cause objects to be occluded or deformed, affecting the decision-making and planning process, and thus affecting the completion of the task.

Method used

By acquiring historical image feature sequences and noisy image feature sequences, inputting them into the image feature prediction model, generating predicted image feature sequences, and generating image data based on these sequences, the image feature prediction model can process noisy image feature sequences to conform to the laws of development.

Benefits of technology

This improves the quality of image data, ensuring that objects are not occluded or deformed, thereby enhancing the accuracy of robot decision-making and planning, and ultimately improving task completion.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025121414_02072026_PF_FP_ABST
    Figure CN2025121414_02072026_PF_FP_ABST
Patent Text Reader

Abstract

Provided in the present application are an image data generation method, a mobile device, an electronic device and a medium. The image data generation method comprises: acquiring a historical image feature sequence, wherein the historical image feature sequence comprises a plurality of image features respectively corresponding to a plurality of historical moments; acquiring a noise image feature sequence, wherein the noise image feature sequence is temporally subsequent to the historical image feature sequence, and the dimension of image features in the noise image feature sequence is consistent with the dimension of the image features in the historical image feature sequence; inputting the historical image feature sequence and the noise image feature sequence into an image feature prediction model to obtain a predicted image feature sequence, wherein each image feature in the predicted image feature sequence is used for representing image information in a three-dimensional space; and on the basis of the predicted image feature sequence, generating image data. The technical solution of the present application can improve the quality of generated image data.
Need to check novelty before this filing date? Find Prior Art

Description

Image data generation methods, mobile devices, electronic devices and media Technical Field

[0001] This application relates to the field of robot vision technology, specifically to an image data generation method, mobile device, electronic device, and medium.

[0002] Background of the Invention

[0003] With the continuous advancement of science and technology, people's lives and work are gradually moving towards intelligence. For example, for a specific task, the scene image at a future moment can be predicted based on the current scene image. This allows for the planning of the task's execution process based on the future scene image. However, the future scene images predicted by current methods suffer from poor quality. Summary of the Invention

[0004] In view of this, embodiments of this application provide an image data generation method, a mobile device, an electronic device, and a medium, which can improve the quality of the generated image data.

[0005] In a first aspect, embodiments of this application provide an image data generation method, comprising: acquiring a historical image feature sequence, wherein the historical image feature sequence includes multiple image features corresponding to multiple historical moments respectively; acquiring a noisy image feature sequence, wherein the noisy image feature sequence is located after the historical image feature sequence in chronological order, and the dimension of the image features in the noisy image feature sequence is consistent with the dimension of the image features in the historical image feature sequence; inputting the historical image feature sequence and the noisy image feature sequence into an image feature prediction model to obtain a predicted image feature sequence, wherein each image feature in the predicted image feature sequence is used to characterize image information in three-dimensional space; and generating image data based on the predicted image feature sequence.

[0006] Secondly, embodiments of this application provide a model training method, comprising: selecting multiple image features from an initial image feature sequence as a sample historical image feature sequence; obtaining a sample noise image feature sequence, wherein the sample noise image feature sequence is located after the sample historical image feature sequence in terms of time order, and the dimension of the image features in the sample noise image feature sequence is consistent with the dimension of the image features in the sample historical image feature sequence; inputting the sample historical image feature sequence and the sample noise image feature sequence into an image feature prediction model to obtain a predicted sample image feature sequence, wherein each image feature in the predicted sample image feature sequence is used to characterize image information in three-dimensional space; training the image feature prediction model based on the difference between the predicted sample image feature sequence and the target image feature sequence in the initial image feature sequence to obtain a trained image feature prediction model, wherein the target image feature sequence is consistent with the predicted sample image feature sequence in terms of time dimension.

[0007] Thirdly, embodiments of this application provide an image data generation apparatus, comprising: an acquisition module for acquiring a historical image feature sequence, wherein the historical image feature sequence includes multiple image features corresponding to multiple historical moments respectively; the acquisition module is further configured to acquire a noisy image feature sequence, wherein the noisy image feature sequence is located after the historical image feature sequence in chronological order, and the dimension of the image features in the noisy image feature sequence is consistent with the dimension of the image features in the historical image feature sequence; a prediction module for inputting the historical image feature sequence and the noisy image feature sequence into an image feature prediction model to obtain a predicted image feature sequence, wherein each image feature in the predicted image feature sequence is used to characterize image information in three-dimensional space; and a generation module for generating image data based on the predicted image feature sequence.

[0008] Fourthly, embodiments of this application provide a model training apparatus, comprising: a selection module for selecting multiple image features from an initial image feature sequence as a sample historical image feature sequence; an acquisition module for acquiring a sample noise image feature sequence, wherein the sample noise image feature sequence is ordered after the sample historical image feature sequence in time, and the dimension of the image features in the sample noise image feature sequence is consistent with the dimension of the image features in the sample historical image feature sequence; a prediction module for inputting the sample historical image feature sequence and the sample noise image feature sequence into an image feature prediction model to obtain a predicted sample image feature sequence, wherein each image feature in the predicted sample image feature sequence is used to characterize image information in three-dimensional space; and a training module for training the image feature prediction model based on the difference between the predicted sample image feature sequence and the target image feature sequence in the initial image feature sequence to obtain a trained image feature prediction model, wherein the target image feature sequence is consistent with the predicted sample image feature sequence in time dimension.

[0009] Fifthly, embodiments of this application provide a mobile device including a control module, the control module being used to execute the image data generation method described in the first aspect or the model training method described in the second aspect.

[0010] In a sixth aspect, embodiments of this application provide an electronic device, including: a processor; and a memory for storing processor-executable instructions, wherein the processor is used to execute the image data generation method described in the first aspect or the model training method described in the second aspect.

[0011] In a seventh aspect, embodiments of this application provide a computer-readable storage medium storing a computer program for executing the image data generation method described in the first aspect or the model training method described in the second aspect.

[0012] Eighthly, embodiments of this application provide a computer program product comprising a computer program that, when executed by a processor of a computer device, enables the computer device to perform the image data generation method described in the first aspect or the model training method described in the second aspect.

[0013] Ninthly, embodiments of this application provide a chip, including: a processor; and a memory for storing processor-executable instructions, wherein the processor is used to execute the image data generation method described in the first aspect or the model training method described in the second aspect.

[0014] This application provides an image data generation method, mobile device, electronic device, and medium. By acquiring historical image feature sequences and noisy image feature sequences, and inputting these sequences into an image feature prediction model, a predicted image feature sequence can be obtained. Image data can then be generated based on this predicted sequence. In this application, the noisy image feature sequence follows the historical image feature sequence in chronological order. The dimension of the image features in the noisy image feature sequence is consistent with the dimension of the image features in the historical image feature sequence. The historical image feature sequence includes multiple image features corresponding to multiple historical moments. The image feature prediction model can process the noisy image feature sequence based on this historical sequence, ensuring that the state of things represented by the predicted image feature sequence conforms to the development law of things. Therefore, the image data generated based on this predicted image feature sequence is of high quality, which can improve the accuracy of the robot's decisions or plans based on the image data, thereby improving the task completion effect. Furthermore, since each image feature in the predicted image feature sequence is used to characterize image information in three-dimensional space, the predicted image feature sequence can be presented according to the required perspective, thereby further improving the quality of the final image or video, for example, minimizing the occurrence of objects being occluded or deformed in the image or video.

[0015] Brief description of the attached figures

[0016] Figure 1 shows a schematic diagram of the system architecture of an image data generation system provided in an exemplary embodiment of this application.

[0017] Figure 2 is a flowchart illustrating an exemplary embodiment of the image data generation method provided in this application.

[0018] Figure 3 is a flowchart illustrating an image data generation method provided in another exemplary embodiment of this application.

[0019] Figure 4 is a schematic diagram of an image data generation process provided in an exemplary embodiment of this application.

[0020] Figure 5 is a flowchart illustrating a model training method provided in an exemplary embodiment of this application.

[0021] Figure 6 is a schematic diagram of the training and inference phases of the model provided in an exemplary embodiment of this application.

[0022] Figure 7 shows a schematic diagram of the structure of an image data generation apparatus provided in an exemplary embodiment of this application.

[0023] Figure 8 is a schematic diagram of the structure of a model training device provided in an exemplary embodiment of this application.

[0024] Figure 9 is a block diagram of an electronic device for performing an image data generation method or a model training method according to an exemplary embodiment of this application.

[0025] Methods of implementing the present invention

[0026] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0027] Application Overview

[0028] With the development of intelligent technologies, an increasing number of tasks can be performed by robots. For example, some robots can complete tasks independently, while others can interact with their environment and collaborate with other robots to accomplish tasks. In the field of robotics, a robot world model can refer to the internal representation that a robot uses to understand and predict its surrounding environment. This model helps robots make decisions and plans when performing tasks, and it can also be used to build realistic robot simulation environments for closed-loop evaluation and reinforcement learning.

[0029] Current robot world models can understand the physical world encountered by robots and generate high-fidelity images or videos. However, the images or videos generated by current robot world models suffer from poor quality. For example, objects in the images or videos may not retain the original shape and color in the real world, especially when the original object is occluded or presented at an undesirable angle. In such cases, the appearance of the object in the image or video may be distorted, and sometimes the object may even disappear completely. Such poor-quality images or videos can affect the robot's decision-making or planning processes, thereby impacting the completion of tasks.

[0030] To address the aforementioned technical problems, this application provides an image data generation method. By acquiring historical image feature sequences and noisy image feature sequences, and inputting these sequences into an image feature prediction model, a predicted image feature sequence can be obtained. Image data can then be generated based on this predicted sequence. In this embodiment, the noisy image feature sequence follows the historical image feature sequence in chronological order. The dimension of the image features in the noisy image feature sequence is consistent with the dimension of the image features in the historical image feature sequence. The historical image feature sequence includes multiple image features corresponding to multiple historical moments. The image feature prediction model can process the noisy image feature sequence based on this historical sequence, ensuring that the state of things represented by the predicted image feature sequence conforms to the laws of development. Therefore, the image data generated based on this predicted image feature sequence is of high quality, which can improve the accuracy of the robot's decisions or plans based on the image data, thereby improving the task completion effect. Here, the image data can be an image or video, or point cloud data, which can be rendered to obtain an image or video. Furthermore, since each image feature in the predicted image feature sequence is used to characterize image information in three-dimensional space, the predicted image feature sequence can be presented according to the required perspective, thereby further improving the quality of the final image or video, for example, minimizing the occurrence of objects being occluded or deformed in the image or video.

[0031] Exemplary System

[0032] Figure 1 shows a schematic diagram of the system architecture of an image data generation system provided in an exemplary embodiment of this application. As shown in Figure 1, the image data generation system 100 may include a computer device 110 and a task execution device 120. The computer device 110 may be a server or a terminal device, such as a mobile phone or a laptop. The task execution device 120 may be a device for performing tasks, such as a vehicle or a robot.

[0033] The task execution device 120 may be equipped with a sensor 121, which can be used to acquire images of the surrounding environment. In one example, the task execution device 120 can send images acquired by the sensor 121 at multiple historical moments to a computer device 110. In some cases, each historical moment may correspond to multiple images, which may be acquired from multiple viewpoints. The computer device 110 can obtain a historical image feature sequence based on the images at multiple historical moments, generate a noisy image feature sequence according to the dimension of the image features in the historical image feature sequence, and input the historical image feature sequence and the noisy image feature sequence into an image feature prediction model. The image feature prediction model processes the noisy image feature sequence based on the historical image feature sequence to obtain a predicted image feature sequence. The noisy image feature sequence is ordered after the historical image feature sequence in time, and the dimension of the image features in the noisy image feature sequence is consistent with the dimension of the image features in the historical image feature sequence. Each image feature in the predicted image feature sequence is used to characterize image information in three-dimensional space. Further, the computer device 110 can generate image data based on the predicted image feature sequence, which may be an image or a video. The computer device 110 can send the image data to the task execution device 120, which can make decisions or plans based on the image data, such as adjusting the task execution strategy to improve the task completion effect.

[0034] For example, the task execution device 120 is a vehicle, and the task to be performed by the task execution device 120 is to drive to the target location. The task execution device 120 can adjust the driving route between the current location and the target location according to the image data; or, the task execution device 120 is a dual-arm robot, and the task to be performed by the task execution device 120 is to pick up the target object. The task execution device 120 can adjust the joint angle of the robotic arm according to the image data to pick up the target object.

[0035] In another example, computer device 110 may generate image data based on a predicted sequence of image features, make a decision or plan based on the image data, and then send the decision or plan result to task execution device 120, which may execute a task based on the decision or plan result.

[0036] In another example, the task execution device 120 can directly obtain historical image feature sequences based on images from multiple historical moments, generate a noisy image feature sequence according to the dimensions of image features in the historical image feature sequence, and input the historical image feature sequence and the noisy image feature sequence into an image feature prediction model. The image feature prediction model then processes the noisy image feature sequence based on the historical image feature sequence to obtain a predicted image feature sequence. Furthermore, the task execution device 120 can generate image data based on the predicted image feature sequence, make decisions or plans based on this image data, and execute tasks based on the decision or planning results.

[0037] For example, sensor 121 may be at least one of a lidar sensor, an ultrasonic sensor, a vision sensor, etc.

[0038] It should be understood that the above application scenario examples are only shown to facilitate understanding of the spirit and principles of this application, and the embodiments of this application are not limited thereto. Rather, the embodiments of this application can be applied to any applicable scenario.

[0039] Exemplary methods

[0040] Figure 2 is a schematic flowchart of an image data generation method provided in an exemplary embodiment of this application. The method in Figure 2 can be executed by the computer device 110 or the task execution device 120 in Figure 1. For ease of description, the following description uses the computer device 110 executing the method as an example. As shown in Figure 2, the image data generation method may include the following:

[0041] 210: Obtain historical image feature sequences.

[0042] Specifically, the historical image feature sequence includes multiple image features corresponding to multiple historical moments. These multiple historical moments can include multiple moments before the current moment, or moments before the current moment and the current moment.

[0043] In one example, each historical moment can correspond to a set of image features. These image features can represent image information corresponding to an image acquired from a single viewpoint, or image information corresponding to multiple images acquired from multiple viewpoints. The images here can be images acquired using sensors to gather environmental information around the task execution device 120. For example, for a given historical moment, multiple images can be acquired at that moment. Each of these images can correspond to a viewpoint, and the image information corresponding to these multiple images can be referred to as RGB information or 3D visual information. Based on these multiple images at that historical moment, a set of image features can be obtained. The image features corresponding to multiple historical moments can constitute a historical image feature sequence.

[0044] In one example, each image feature can be a visual latent code. Specifically, multiple images at each historical moment can be reconstructed into 3D point cloud data, such as 3D Gaussian point cloud data, or a static 3D Gaussian sphere in the form of a point cloud, using a multi-view 3D reconstruction method. Further, a visual encoder can be used to encode the reconstructed 3D point cloud data into a visual latent code, thus obtaining the image features. The visual encoder can be an encoder based on a variational autoencoder (VAE) architecture or other suitable encoders.

[0045] In one example, if the initial time is the current time, the image features corresponding to the current time can be repeatedly stacked to obtain multiple image features, that is, to obtain the historical image feature sequence.

[0046] 220: Obtain the feature sequence of the noisy image.

[0047] Specifically, the noisy image feature sequence is located after the historical image feature sequence in terms of time order, and the dimension of the image features in the noisy image feature sequence is consistent with the dimension of the image features in the historical image feature sequence.

[0048] In one example, a noisy image feature sequence can be generated based on the dimension of image features in a historical image feature sequence.

[0049] In one example, the specific parameters of each image feature in the noisy image feature sequence can be set according to actual needs. For example, they can be set according to at least one factor, such as the actual task scenario or the changing patterns of multiple image features over time in the historical image feature sequence. Optionally, in other examples, the specific parameters of each image feature in the noisy image feature sequence can all be set to the initial value of 0.

[0050] In one example, the first interval between multiple historical moments corresponding to the historical image feature sequence can be equal, and the interval between the moment corresponding to the first image feature in the noisy image feature sequence and the moment corresponding to the last image feature in the historical image feature sequence can be equal to or unequal to the first interval. Optionally, in other examples, the first interval between multiple historical moments corresponding to the historical image feature sequence can be equal to or unequal, the second interval between the moments corresponding to each image feature in the noisy image feature sequence can be equal to or unequal, and the first interval can be equal to or unequal to the second interval.

[0051] 230: Input the historical image feature sequence and the noisy image feature sequence into the image feature prediction model to obtain the predicted image feature sequence.

[0052] In one example, an image feature prediction model can be used to predict future image feature sequences based on existing image feature sequences. Specifically, the input to the image feature prediction model may include historical image feature sequences and noisy image feature sequences. The model can process the noisy image feature sequences based on the historical image feature sequences, such as through denoising or other processing, to obtain the predicted image feature sequences, which can also be called denoised image feature sequences. The historical image feature sequences and noisy image feature sequences can be input into the image feature prediction model as two independent sequences, or they can be concatenated into a single sequence and input into the model.

[0053] Specifically, each image feature in the predicted image feature sequence is used to characterize image information in three-dimensional space.

[0054] In one example, each image feature in the predicted image feature sequence can represent image information from multiple perspectives. This image information from multiple perspectives can represent the environmental information around the task execution device 120 in the task scene at a future time, such as the position and shape of objects around the task execution device 120 from multiple perspectives.

[0055] 240: Generate image data based on predicted image feature sequences.

[0056] Specifically, the image data can be an image or a video; or the image data can be point cloud data, which can be rendered to obtain an image or a video.

[0057] In one example, the predicted image feature sequence can be a visual latent code, which can be decoded into 3D point cloud data, such as 3D Gaussian point cloud data, using a visual decoder. The 3D point cloud data can then be rendered to obtain an image or video. The visual decoder can be a VAE-based visual decoder or other suitable decoder.

[0058] In one example, Gaussian splashing or other rendering methods can be used to render 3D point cloud data onto any preset virtual camera to obtain the image or video from that camera.

[0059] In one example, any one of the image feature sequences from the historical image feature sequence, the noisy image feature sequence, and the predicted image feature sequence can be a visual image feature sequence.

[0060] This application provides an image data generation method. By acquiring historical image feature sequences and noisy image feature sequences, and inputting these sequences into an image feature prediction model, a predicted image feature sequence can be obtained. Image data can then be generated based on this predicted sequence. In this embodiment, the noisy image feature sequence follows the historical image feature sequence in chronological order. The dimension of the image features in the noisy image feature sequence is consistent with the dimension of the image features in the historical image feature sequence. The historical image feature sequence includes multiple image features corresponding to multiple historical moments. The image feature prediction model can process the noisy image feature sequence based on this historical sequence, ensuring that the state of things represented by the predicted image feature sequence conforms to the development law of things. Therefore, the image data generated based on this predicted image feature sequence is of high quality, which can improve the accuracy of the robot's decisions or plans based on the image data, thereby improving the task completion effect. Furthermore, since each image feature in the predicted image feature sequence is used to characterize image information in three-dimensional space, the predicted image feature sequence can be presented according to the required perspective, thereby further improving the quality of the final image or video, for example, minimizing the occurrence of objects being occluded or deformed in the image or video.

[0061] According to an embodiment of this application, the image data generation method further includes: updating the historical image feature sequence using at least one image feature from the predicted image feature sequence to obtain an updated historical image feature sequence; and repeating step 230 using the updated historical image feature sequence as the historical image feature sequence.

[0062] Specifically, the predicted image feature sequence may include one or more image features. At least one image feature from the predicted image feature sequence can be used to update the historical image feature sequence. For example, at least one image feature from the predicted image feature sequence can be added to the original historical image feature sequence, and N image features can be removed from the original historical image feature sequence, thus obtaining the updated historical image feature sequence. Here, the number of these N image features can be equal to the number of at least one image feature in the predicted image feature sequence, ensuring that the dimension of the updated historical image feature sequence is consistent with the dimension of the original historical image feature sequence. Thus, the updated historical image feature sequence can be further used as input to the image feature prediction model.

[0063] Furthermore, in one example, the updated historical image feature sequence and the noisy image feature sequence can be input into the image feature prediction model. The image feature prediction model processes the noisy image feature sequence based on the updated historical image feature sequence to obtain a new predicted image feature sequence. The new predicted image feature sequence can be located after the updated historical image feature sequence in time order. In the current input of the image feature prediction model, the noisy image feature sequence can be the same as the noisy image feature sequence in the first input of the image feature prediction model.

[0064] In another example, steps 220 and 230 can be repeated. Specifically, a new noisy image feature sequence can be generated for the updated historical image feature sequence. The dimension of the new noisy image feature sequence is the same as that of the updated historical image feature sequence. The specific parameters of each image feature in the new noisy image feature sequence can be related to the time-varying patterns of multiple image features in the updated historical image feature sequence or other factors, and can be set according to actual needs. The updated historical image feature sequence and the new noisy image feature sequence are input into the image feature prediction model. The image feature prediction model processes the new noisy image feature sequence based on the updated historical image feature sequence to obtain a new predicted image feature sequence.

[0065] Similarly, the current historical image feature sequence can be continuously updated using at least one image feature from the current predicted image feature sequence to obtain an updated historical image feature sequence. The updated historical image feature sequence is then used as the historical image feature sequence, and step 230 is repeated, or steps 220 and 230 are repeated, so as to obtain the desired predicted image feature sequence for future moments.

[0066] In this embodiment, by using at least one image feature from the predicted image feature sequence to update the historical image feature sequence, an updated historical image feature sequence is obtained. The predicted image feature sequence for subsequent time periods can then be obtained based on the updated historical image feature sequence. In this way, the updating process of the historical image feature sequence and the prediction process of the image feature prediction model can be repeatedly executed according to the actual needs of the predicted image features, thereby improving prediction efficiency.

[0067] According to one embodiment of this application, updating a historical image feature sequence using at least one image feature from a predicted image feature sequence to obtain an updated historical image feature sequence includes: updating the historical image feature sequence using the last image feature from the predicted image feature sequence to obtain an updated historical image feature sequence.

[0068] In one example, the predicted image feature sequence may include multiple image features. The last image feature in the predicted image feature sequence is added after the original historical image feature sequence, and one image feature is removed from the original historical image feature sequence, thus obtaining an updated historical image feature sequence. The image feature removed from the original historical image feature sequence can be an image feature located at the beginning, end, or middle of the original historical image feature sequence.

[0069] In this embodiment, by using the last image feature in the predicted image feature sequence to update the historical image feature sequence, instead of using all image features in the predicted image feature sequence to update the historical image feature sequence, the computational load of the prediction process can be reduced, and the sparsity of the historical image feature sequence in time can be improved. This sparsity can improve the quality of the predicted image data.

[0070] According to one embodiment of this application, updating a historical image feature sequence using at least one image feature from a predicted image feature sequence to obtain an updated historical image feature sequence includes: updating the historical image feature sequence using the last image feature from the predicted image feature sequence according to a first-in-first-out rule to obtain an updated historical image feature sequence.

[0071] In one example, when updating the existing historical image feature sequence, N image features can be removed from the original historical image feature sequence. These N image features can be located at any position in the original historical image feature sequence, such as at the beginning, end, or middle of the original historical image feature sequence. These N image features can be continuous or discontinuous in time.

[0072] For example, following the first-in-first-out rule, the last image feature in the predicted image feature sequence can be added after the original historical image feature sequence, and the first image feature in the original historical image feature sequence can be removed, thus obtaining an updated historical image feature sequence.

[0073] In this embodiment, updating the historical image feature sequence according to the first-in, first-out (FIFO) rule can improve the timeliness of the historical image feature sequence, thereby improving the quality of the predicted image data. Furthermore, updating the historical image feature sequence using the last image feature from the predicted image feature sequence can reduce the computational load of the prediction process and improve the temporal sparsity of the historical image feature sequence, which can further improve the quality of the predicted image data.

[0074] According to one embodiment of this application, each image feature in the historical image feature sequence includes: features corresponding to the image under each of multiple viewpoints; and directional features corresponding to that viewpoint.

[0075] In one example, the viewpoint can characterize the angle at which objects around the task execution device 120 are observed in the task scene, and the directional feature corresponding to the viewpoint can characterize the line of sight direction corresponding to the viewpoint.

[0076] In one example, each image feature in the historical image feature sequence can characterize the features (i.e., image information) corresponding to multiple images acquired at a single moment, which may have been acquired from multiple different viewpoints. Therefore, each image feature in the historical image feature sequence can include features corresponding to images from each of multiple viewpoints.

[0077] In another example, each image feature in a historical image feature sequence may include features corresponding to the image from each of multiple viewpoints, as well as the directional features corresponding to that viewpoint. For example, the features of an image acquired at a certain historical moment, along with the directional features from multiple viewpoints used when acquiring the image at that historical moment, can be concatenated to obtain an image feature, i.e., a visual latent code Z<1, 0>. The visual latent codes from multiple historical moments can constitute a historical image feature sequence. A historical image feature sequence can also be called a memory queue.

[0078] In this embodiment, by adding directional features corresponding to the viewpoint to the image features, the image feature prediction model can be explicitly informed that the observation information (i.e., each image feature in the historical image feature sequence) was observed from what direction or viewpoint. This allows the generated image feature sequence for future moments (i.e., the predicted image feature sequence) to also contain certain directional features, thus achieving better spatial consistency.

[0079] According to an embodiment of this application, step 230 may include: using an image feature prediction model based on a bidirectional attention mechanism and based on historical image feature sequences to process the noisy image feature sequence, while generating each image feature in the predicted image feature sequence.

[0080] Specifically, during the processing of noisy image feature sequences based on historical image feature sequences (such as denoising or other processing), image feature prediction models can simultaneously generate each image feature in the predicted image feature sequence based on a bidirectional attention mechanism. This takes into account the mutual influence between the various image features in the predicted image feature sequence, improving the quality of the final predicted image feature sequence. For example, the image feature prediction model can implement the bidirectional attention mechanism using a diffusion generation paradigm based on a bidirectional Transformer in the instantaneous space, thereby achieving the effect of simultaneously generating each image feature in the predicted image feature sequence.

[0081] Of course, in other embodiments, the image feature prediction model may generate each image feature in the predicted image feature sequence simultaneously or sequentially based on other mechanisms, and this application embodiment does not limit this.

[0082] In this embodiment, the image feature prediction model can simultaneously generate each image feature in the predicted image feature sequence based on a bidirectional attention mechanism. This takes into account the mutual influence between the various image features in the predicted image feature sequence, thereby improving the quality of the final predicted image feature sequence.

[0083] Furthermore, in some embodiments, the image feature prediction model can use a causal Transformer-based block autoregressive generation paradigm over a longer temporal period. This is equivalent to the image feature prediction sequence currently output by the model being generated based on a unidirectional attention mechanism, relative to previously outputted predicted image feature sequences. By using a causal Transformer-based block autoregressive generation paradigm over a longer temporal period and a bidirectional Transformer-based diffusion generation paradigm in the instantaneous space, the advantages of these two generation paradigms can be organically combined to achieve high-quality generation of four-dimensional spatial data. Here, four-dimensional spatial data may include a temporal dimension and three spatial dimensions.

[0084] According to an embodiment of this application, step 230 may include: inputting historical image feature sequences and noisy image feature sequences into an image feature prediction model, and controlling the image feature prediction model to process the noisy image feature sequences based on the historical image feature sequences, based on task text encoding and / or action latent codes, to obtain predicted image feature sequences.

[0085] In one example, the input to the image feature prediction model may include control conditions in addition to historical image feature sequences and noisy image feature sequences. These control conditions can be used to control the image feature prediction model to generate image feature sequences that satisfy the control conditions, and then generate corresponding images based on these image feature sequences.

[0086] In one example, control conditions can be used to characterize the specific content of the task to be performed. For example, the specific content of the task to be performed could be: using a robotic arm to pick up a red ball from a table.

[0087] In one example, the control condition can be described by task text. For instance, the task text can describe the specific content of the task to be performed, and then a text encoder can encode the task text into a task text code. This task text code can then be used as the input control condition to the image feature prediction model. In this example, the task scenario corresponding to the task to be performed could be: planning the corresponding action scene based on textual instructions, i.e., visual planning. In one example, the task text code can be a text embedding vector, and the text encoder can be a text encoder based on a Contrastive Language-Image Pre-training (CLIP) architecture or other suitable encoders.

[0088] In one example, the control condition can be described by motion instructions. For instance, the motion instructions describe the specific content of the task to be performed, and then a motion encoder encodes the motion instructions into motion latent codes. These motion latent codes can then be used as input to the image feature prediction model as control conditions. In this example, the motion instructions could be parameters such as the joint rotation angle of the robotic arm, and the task scenario corresponding to the task to be performed could be: planning the corresponding motion scene based on the motion instructions, i.e., simulation evaluation. The task to be performed can be used to test whether the motion instructions are correct; that is, the image feature prediction model acts as an evaluator, used to test whether the motion scene generated based on the output image feature sequence can execute the task to be performed. In one example, the motion encoder could be a VAE-based encoder or other suitable encoder.

[0089] In one example, the task text encoding and action latent code can be input into the image feature prediction model simultaneously, and the image feature prediction model can be controlled to process the noisy image feature sequence based on the historical image feature sequence to obtain the predicted image feature sequence.

[0090] In this embodiment, by inputting control conditions into the image feature prediction model, the image feature prediction model can be applied to different task scenarios, thereby improving the flexibility of the image feature prediction model.

[0091] According to one embodiment of this application, each image feature in the historical image feature sequence is used to characterize three-dimensional Gaussian point cloud data, and the image feature prediction model includes a four-dimensional scene model. The image feature prediction model is used to predict the changes of three-dimensional Gaussian point cloud data over time.

[0092] In one example, the image feature prediction model includes a four-dimensional scene model, which can be used to predict a four-dimensional sequence of image features that can be used to represent a task scene. The four dimensions in the four-dimensional scene model can include a time dimension and three spatial dimensions.

[0093] In one example, both the historical image feature sequence and the noisy image feature sequence can be four-dimensional data. Each image feature in both sequences can be three-dimensional data, such as data used to characterize three-dimensional Gaussian point cloud data. The predicted image feature sequence output by the image feature prediction model can be used to predict the changes in three-dimensional Gaussian point cloud data over time.

[0094] In one example, the four-dimensional scene model can also be a four-dimensional world model, which can employ deep learning network architectures or other network architectures. For instance, this four-dimensional world model could be based on a multi-layer transformer architecture, used to generate four-dimensional spatiotemporal sequences, i.e., four-dimensional image feature sequences. In one example, this four-dimensional world model can be used in the embodied domain.

[0095] In this embodiment, existing 3D Gaussian point cloud data can be processed using a four-dimensional scene model to obtain predicted 3D Gaussian point cloud data. The image data generated based on the predicted 3D Gaussian point cloud data can have a certain logical consistency, meaning that the state of things presented in the image data can better conform to the laws of development. Furthermore, since the four-dimensional scene model is four-dimensional, the output image feature sequence is also four-dimensional. Therefore, image features corresponding to a suitable viewpoint can be selected from the output image feature sequence for decoding and rendering to obtain the image, or data corresponding to a suitable viewpoint can be selected from the decoded 3D Gaussian point cloud data for rendering to obtain the image. This can improve the quality of the final image (such as an image or video), for example, minimizing the occurrence of objects being occluded or distorted in the image.

[0096] Figure 3 is a flowchart illustrating an image data generation method provided in another exemplary embodiment of this application. The embodiment in Figure 3 is an example of the embodiment in Figure 2; to avoid repetition, the similarities can be referred to the descriptions in the above embodiments, which will not be repeated here. As shown in Figure 3, the image data generation method may include the following:

[0097] 310: Obtain historical image feature sequences.

[0098] In one example, for a historical moment where images need to be acquired, images can be captured from multiple viewpoints at that moment, resulting in multiple images, and the directional features corresponding to each viewpoint are recorded. Further, a multi-view 3D reconstruction method can be used to reconstruct 3D point cloud data from the multiple images at each historical moment, such as a static 3D Gaussian sphere in the shape of a point cloud. Then, a visual encoder can be used to encode the reconstructed 3D Gaussian sphere into a visual latent code, and the visual latent code is concatenated with the corresponding directional features to obtain an image feature Z.<i,0> Here, Z<i,0> In Z, 'i' can represent the sequence number. When i = 1, it indicates that the image feature is located in the initial historical image feature sequence.<i,0> The 0 in the value can indicate that no noise has been added.

[0099] In one example, an initial sequence of historical image features, or an initial memory queue, can be constructed by repeatedly stacking m copies of the image features (or visual latent codes).

[0100] 320: Generate a noisy image feature sequence based on the dimension of image features in the historical image feature sequence.

[0101] In one example, c dimensions and Z can be initialized.<i,0> Uniform Gaussian noise Z<i+1,T> The noisy image feature sequence is obtained and then concatenated to the end of the historical image feature sequence.

[0102] 330: Input the historical image feature sequence and the noisy image feature sequence into the image feature prediction model. Based on the task text encoding or action latent code, control the image feature prediction model to process the noisy image feature sequence based on the historical image feature sequence to obtain the predicted image feature sequence.

[0103] In one example, the concatenated historical image feature sequence and the noisy image feature sequence can be input into the image feature prediction model, and the task text encoding or action latent code can be used as a control condition input into the image feature prediction model. In one example, the image feature prediction model can use the diffusion inverse diffusion generation paradigm to iterate through d rounds to extract Z from the noisy image feature sequence.<i+1,T> Denoising to Z<i+1,0> This yields the predicted image feature sequence, which is equivalent to obtaining c noise-free 3D visual latent codes.

[0104] 340: Generate image data based on predicted image feature sequences.

[0105] In one example, Z is divided into parts.<i+1,0> The three-dimensional visual latent code in the image is decoded by a visual decoder to obtain three-dimensional point cloud data at each moment, such as a three-dimensional Gaussian sphere. The three-dimensional Gaussian sphere is then rendered onto any preset virtual camera using the Gaussian splashing method to obtain the image or video under that camera, thus obtaining the image data.

[0106] 350: Following the first-in-first-out rule, the historical image feature sequence is updated using the last image feature in the predicted image feature sequence to obtain the updated historical image feature sequence.

[0107] In one example, the last image feature in the predicted image feature sequence is added after the original historical image feature sequence, and the first image feature in the original historical image feature sequence is removed, thus obtaining an updated historical image feature sequence.

[0108] Furthermore, steps 320 to 340 can be repeated until a preset special end-of-sentence (EOS) is generated. For example, during the repeated execution of step 320, c dimensions and Z can be initialized.<i+1,0> Uniform Gaussian noise Z<i+2,T> A new noisy image feature sequence is obtained, and the new noisy image feature sequence is concatenated to the end of the updated historical image feature sequence.

[0109] Figure 4 illustrates the image data generation process that can use a block autoregressive generation paradigm based on causal Transformer over a longer time period, and a diffusion generation paradigm based on bidirectional Transformer in a transient space.

[0110] As shown in Figure 4, Z<1, 0> can represent one image feature in the initial historical image feature sequence. Z<1, 0> can be reused to form a historical image feature sequence. One Z<1, 0> is omitted in Figure 4. Z<2, T> can represent Gaussian noise when i=2. Two Gaussian noises are shown in Figure 4, which can form a noisy image feature sequence. Inputting Z<1, 0> and Z<2, T> into the generative model (image feature prediction model), the generative model can denoise the noisy image feature sequence according to a certain step size. In one example, T steps of noise are added during the noise addition process, and the denoising process can be carried out in T steps. For example, after the denoising process in the t-th step (t is greater than or equal to 0 and less than or equal to T), the image feature Z<2, t> can be obtained. After the T-step denoising process, the image feature Z<2, 0> can be obtained, that is, the denoising is completed. The three-dimensional point cloud data represented by the two image features Z<2, t> in Figure 4 can be different.

[0111] Similarly, after obtaining the predicted image feature sequence, i.e., obtaining two Z<2,0>, Z<1,0>, Z<2,0>, and Z<3,T> can be input into the generative model to start a new round of image feature sequence prediction. Z<3,T> can represent Gaussian noise when i=3. Figure 4 shows two Gaussian noises, which can constitute a noisy image feature sequence. The generative model can denoise the noisy image feature sequence according to a certain step size. After the denoising process at step t (t is greater than or equal to 0 and less than or equal to T), the image feature Z<3,t> can be obtained. After the denoising process at step T, the image feature Z<3,0> can be obtained, i.e., denoising is complete. The three-dimensional point cloud data represented by the two image features Z<3,t> in Figure 4 can be different. In this way, multiple image feature sequence prediction processes can be performed until the generation of EOS terminates.

[0112] In some embodiments, during the image data generation process, a key-value (K / V) caching method can be used to store data, such as storing historical image feature sequences. The K / V caching method can ensure that previously calculated data does not need to be calculated again, thereby improving computational efficiency.

[0113] This application embodiment also provides a model training method, as shown in Figure 5, which may include the following:

[0114] 510: Select multiple image features from the initial image feature sequence as the sample historical image feature sequence.

[0115] In one example, the initial image feature sequence can be an existing complete image feature sequence, which can be used to train an image feature prediction model. For example, multiple image features can be selected from the beginning of the initial image feature sequence to form a sample historical image feature sequence. These multiple image features can be continuous or discontinuous in time, and the number of multiple image features can be set according to actual needs.

[0116] In one example, multiple image features can be selected as a sample historical image feature sequence according to certain rules or randomly from the initial image feature sequence.

[0117] 520: Obtain the feature sequence of the sample noisy image.

[0118] In one example, the sample noise image feature sequence follows the sample historical image feature sequence in chronological order, and the dimension of the image features in the sample noise image feature sequence is the same as the dimension of the image features in the sample historical image feature sequence. In another example, the sample noise image feature sequence can be generated based on the dimension of the image features in the sample historical image feature sequence. The method for generating the sample noise image feature sequence is similar to the method for generating the noise image feature sequence described above, and will not be repeated here to avoid repetition.

[0119] 530: Input the historical image feature sequence and the noisy image feature sequence of the sample into the image feature prediction model to obtain the predicted sample image feature sequence.

[0120] In one example, an image feature prediction model can be used to denoise or otherwise process the sample noisy image feature sequence based on the sample historical image feature sequence to obtain the predicted sample image feature sequence.

[0121] In one example, each image feature in the predicted sample image feature sequence is used to characterize image information in three-dimensional space.

[0122] In one example, the process by which the image feature prediction model processes the sample noisy image feature sequence based on the sample's historical image feature sequence to obtain the predicted sample image feature sequence is similar to the process described above, where the image feature prediction model processes the noisy image feature sequence based on the historical image feature sequence to obtain the predicted image feature sequence. To avoid repetition, it will not be described again here.

[0123] In one example, during training, the input to the image feature prediction model may include a sequence of historical image features and a sequence of noisy image features. Furthermore, the input to the image feature prediction model may also include control conditions, the details of which can be found in the descriptions of the above embodiments and will not be repeated here to avoid repetition.

[0124] 540: The image feature prediction model is trained based on the difference between the predicted sample image feature sequence and the target image feature sequence in the initial image feature sequence, resulting in the trained image feature prediction model.

[0125] In one example, the target image feature sequence and the predicted sample image feature sequence are consistent in the time dimension. Based on the difference between the predicted sample image feature sequence and the target image feature sequence, the parameters of the image feature prediction model can be adjusted to obtain the trained image feature prediction model. The trained image feature prediction model can be used to execute the image data generation method described above. In one example, the execution entity of this model training method can be the aforementioned computer device 110, task execution device 120, or other devices.

[0126] This application provides a model training method. By inputting historical image feature sequences and noisy image feature sequences into an image feature prediction model, a predicted sample image feature sequence is obtained. Based on the difference between the predicted sample image feature sequence and the target image feature sequence, a trained image feature prediction model is obtained. The predicted image feature sequence output by the image feature prediction model obtained by this model training method represents the state of things in accordance with the laws of development. Therefore, the image data generated based on this predicted image feature sequence is of high quality, which can improve the accuracy of the robot's decisions or plans based on the image data, thereby improving the task completion effect. Furthermore, since each image feature in the predicted image feature sequence is used to represent image information in three-dimensional space, the predicted image feature sequence can be presented according to the required perspective, thereby further improving the quality of the final image or video, for example, minimizing the occurrence of occlusion or deformation of objects in the image or video.

[0127] In some embodiments, multiple image features are selected from the initial image feature sequence to form a sample historical image feature sequence. These multiple image features may be discontinuous in time, i.e., the sample historical image feature sequence is sparse in time. Using the sparse sample historical image feature sequence to train the image feature prediction model can improve the model's generalization ability.

[0128] Figure 6 illustrates the training and inference phases of the model provided in an exemplary embodiment of this application. As shown in Figure 6, during the training phase, two image features can be randomly selected from the initial image feature sequence as the sample historical image feature sequence, corresponding to the random selection of historical latent codes in Figure 6. Based on the dimension of the image features in the sample historical image feature sequence, four Gaussian noises can be generated. as well as The feature sequence of the sample noisy image is obtained. The feature sequence of the historical sample image and the feature sequence of the sample noisy image are input into the image feature prediction model, which can also be called the generative model. This generative model outputs the predicted feature sequence of the sample image (also called the denoised sample image feature sequence). The predicted sample image feature sequence may include four image features: V1, V2, V3, and V4. The loss can be calculated based on the difference between the predicted sample image feature sequence and the target image feature sequence. The parameters of the generative model can be adjusted based on the loss.

[0129] During the inference phase, it can be based on two sets of image features. and This constitutes a historical image feature sequence. Based on the dimensions of the image features in the historical image feature sequence, four Gaussian noise components can be generated. as well as Obtain the noisy image feature sequence. Input the historical image feature sequence and the noisy image feature sequence into the generation model, which can output a predicted image feature sequence, which may include four image feature sequences. as well as The time length corresponding to the predicted image feature sequence can be regarded as a time window, and this inference can be the i-th inference, so it can be represented as Chunk i. The last image feature in the predicted image feature sequence is used. Updating the original historical image feature sequence yields an updated historical image feature sequence, which includes two image features. and The updated historical image feature sequence can be used for the (i+1)th inference, i.e., Chunk i+1. Similarly, four Gaussian noises can be generated for the updated historical image feature sequence. as well as Obtain the noisy image feature sequence. Input the updated historical image feature sequence and the noisy image feature sequence into the generation model to output the predicted image feature sequence, which may include four image feature sets. And EOS. EOS can be a pre-defined special termination latent code; generating EOS can indicate the termination of the inference iteration process.

[0130] Exemplary device

[0131] Figure 7 is a schematic diagram of the structure of an image data generation apparatus provided in an exemplary embodiment of this application. As shown in Figure 7, the image data generation apparatus 700 includes: an acquisition module 710, a prediction module 720, and a generation module 730.

[0132] The acquisition module 710 is used to acquire historical image feature sequences, wherein the historical image feature sequences include multiple image features corresponding to multiple historical moments. The acquisition module 710 is also used to acquire noisy image feature sequences, wherein the noisy image feature sequences are ordered after the historical image feature sequences in time, and the dimension of the image features in the noisy image feature sequences is consistent with the dimension of the image features in the historical image feature sequences. The prediction module 720 is used to input the historical image feature sequences and the noisy image feature sequences into an image feature prediction model to obtain a predicted image feature sequence, wherein each image feature in the predicted image feature sequence is used to represent image information in three-dimensional space. The generation module 730 is used to generate image data based on the predicted image feature sequences.

[0133] This application provides an image data generation apparatus. By acquiring historical image feature sequences and noisy image feature sequences, and inputting these sequences into an image feature prediction model, a predicted image feature sequence can be obtained. Image data can then be generated based on this predicted sequence. In this embodiment, the noisy image feature sequence follows the historical image feature sequence in chronological order. The dimension of the image features in the noisy image feature sequence is consistent with the dimension of the image features in the historical image feature sequence. The historical image feature sequence includes multiple image features corresponding to multiple historical moments. The image feature prediction model can process the noisy image feature sequence based on this historical sequence, ensuring that the state of things represented by the predicted image feature sequence conforms to the laws of development. Therefore, the image data generated based on this predicted image feature sequence is of high quality, which can improve the accuracy of decisions or plans made by the robot based on this image data, thereby improving the task completion effect. Furthermore, since each image feature in the predicted image feature sequence is used to characterize image information in three-dimensional space, the predicted image feature sequence can be presented according to the required perspective, thereby further improving the quality of the final image or video, for example, minimizing the occurrence of objects being occluded or deformed in the image or video.

[0134] According to one embodiment of this application, the image data generation apparatus 700 further includes an update module 740, configured to: update the historical image feature sequence using at least one image feature from the predicted image feature sequence to obtain an updated historical image feature sequence; repeatedly execute the steps of obtaining a noisy image feature sequence and inputting the historical image feature sequence and the noisy image feature sequence into an image feature prediction model to obtain a predicted image feature sequence, using the updated historical image feature sequence as the historical image feature sequence.

[0135] According to one embodiment of this application, the update module 740 is used to update the historical image feature sequence using the last image feature in the predicted image feature sequence, so as to obtain an updated historical image feature sequence.

[0136] According to one embodiment of this application, the update module 740 is used to: update the historical image feature sequence using the last image feature in the predicted image feature sequence according to the first-in-first-out rule, so as to obtain the updated historical image feature sequence.

[0137] According to one embodiment of this application, each image feature in the historical image feature sequence includes: features corresponding to the image under each of multiple viewpoints; and directional features corresponding to that viewpoint.

[0138] According to one embodiment of this application, the prediction module 720 is used to: process the noisy image feature sequence using an image feature prediction model based on a bidirectional attention mechanism and based on a historical image feature sequence, and at the same time generate each image feature in the predicted image feature sequence.

[0139] According to one embodiment of this application, the prediction module 720 is used to: input historical image feature sequences and noisy image feature sequences into an image feature prediction model, and control the image feature prediction model to process the noisy image feature sequences based on the historical image feature sequences, based on task text encoding and / or action latent codes, to obtain the predicted image feature sequences.

[0140] According to one embodiment of this application, each image feature in the historical image feature sequence is used to characterize three-dimensional Gaussian point cloud data, and the image feature prediction model includes a four-dimensional scene model. The image feature prediction model is used to predict the changes of three-dimensional Gaussian point cloud data over time.

[0141] It should be understood that the operation and function of the acquisition module 710, prediction module 720, generation module 730 and update module 740 in the above embodiments can be referred to the description in the image data generation method provided in the embodiments of Figure 2 or Figure 3 above. In order to avoid repetition, they will not be described again here.

[0142] Figure 8 is a schematic diagram of the structure of a model training device provided in an exemplary embodiment of this application. As shown in Figure 8, the model training device 800 includes: a selection module 810, an acquisition module 820, a prediction module 830, and a training module 840.

[0143] The selection module 810 is used to select multiple image features from the initial image feature sequence as sample historical image feature sequences. The acquisition module 820 is used to acquire sample noisy image feature sequences, wherein the sample noisy image feature sequences are ordered after the sample historical image feature sequences in time, and the dimension of the image features in the sample noisy image feature sequences is consistent with the dimension of the image features in the sample historical image feature sequences. The prediction module 830 is used to input the sample historical image feature sequences and the sample noisy image feature sequences into the image feature prediction model to obtain predicted sample image feature sequences, wherein each image feature in the predicted sample image feature sequences is used to represent image information in three-dimensional space. The training module 840 is used to train the image feature prediction model based on the difference between the predicted sample image feature sequences and the target image feature sequences in the initial image feature sequences, to obtain a trained image feature prediction model, wherein the target image feature sequences are consistent with the predicted sample image feature sequences in the time dimension.

[0144] This application provides a model training apparatus that inputs a sample historical image feature sequence and a sample noisy image feature sequence into an image feature prediction model to obtain a predicted sample image feature sequence. Based on the difference between the predicted sample image feature sequence and the target image feature sequence, a trained image feature prediction model is obtained. The predicted image feature sequence output by the image feature prediction model obtained by this model training method represents the state of things in accordance with the laws of development. Therefore, the image data generated based on this predicted image feature sequence is of high quality, which can improve the accuracy of the robot's decisions or plans based on the image data, thereby improving the task completion effect. Furthermore, since each image feature in the predicted image feature sequence is used to represent image information in three-dimensional space, the predicted image feature sequence can be presented according to the required perspective, thereby further improving the quality of the final image or video, for example, minimizing the occurrence of occlusion or deformation of objects in the image or video.

[0145] It should be understood that the operation and function of the selection module 810, acquisition module 820, prediction module 830 and training module 840 in the above embodiments can be referred to the description in the model training method provided in the above embodiment of Figure 5. In order to avoid repetition, they will not be described again here.

[0146] This application also provides a mobile device, which includes a control module for executing the image data generation method provided in any of the above embodiments.

[0147] In some embodiments, the mobile device may be a robot, such as a dual-arm robot, or it may be a vehicle or other device that can be used to perform a specific task.

[0148] The operation and functions of the mobile device provided in this application embodiment can be referred to the description in the image data generation method provided in the above embodiment of Figure 2 or Figure 3. To avoid repetition, it will not be repeated here.

[0149] Figure 9 is a block diagram of an electronic device 900 for performing an image data generation method or a model training method according to an exemplary embodiment of this application. The electronic device 900 may specifically be a server, a mobile device, a control device for a mobile device, a server interacting with a mobile device, or other devices.

[0150] Referring to FIG9, the electronic device 900 includes a processing component 910, which further includes one or more processors, and memory resources represented by memory 920 for storing instructions executable by the processing component 910, such as application programs. The application programs stored in memory 920 may include one or more modules, each corresponding to a set of instructions. Furthermore, the processing component 910 is configured to execute instructions to perform the aforementioned image data generation method or model training method.

[0151] Electronic device 900 may also include a power supply component configured to perform power management of electronic device 900, a wired or wireless network interface configured to connect electronic device 900 to a network, and an input / output (I / O) interface. Electronic device 900 can be operated based on an operating system stored in memory 920, such as Windows Server. TM Mac OS X TM Unix TM Linux TM FreeBSD TM Or similar.

[0152] A non-transitory computer-readable storage medium, wherein when the instructions in the storage medium are executed by the processor of the aforementioned electronic device 900, the electronic device 900 is able to execute an image data generation method or a model training method.

[0153] A computer program product includes a computer program that, when executed by a processor of a computer device, enables the computer device to perform the image data generation method or model training method provided in any of the above embodiments.

[0154] All of the above-mentioned optional technical solutions can be combined in any way to form optional embodiments of this application, and will not be described in detail here.

[0155] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0156] Those skilled in the art will understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.

[0157] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.

[0158] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0159] In addition, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.

[0160] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program verification codes, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0161] It should be noted that in the description of this application, the terms "first," "second," "third," etc., are used for descriptive purposes only and should not be construed as indicating or implying relative importance. Furthermore, in the description of this application, unless otherwise stated, "a plurality of" means two or more.

[0162] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use and processing of the relevant data must comply with the relevant laws, regulations and standards of the relevant countries and regions, and corresponding operation portals are provided for users to choose to authorize or refuse.

[0163] The above description is merely a preferred embodiment of this application and is not intended to limit this application. Any modifications or equivalent substitutions made within the spirit and principles of this application should be included within the protection scope of this application.

Claims

1. An image data generating method characterized by comprising: include: Obtain a historical image feature sequence, wherein the historical image feature sequence includes multiple image features corresponding to multiple historical moments respectively; Obtain a noisy image feature sequence, wherein the noisy image feature sequence is located after the historical image feature sequence in terms of time order, and the dimension of the image features in the noisy image feature sequence is consistent with the dimension of the image features in the historical image feature sequence; The historical image feature sequence and the noisy image feature sequence are input into the image feature prediction model to obtain the predicted image feature sequence, wherein each image feature in the predicted image feature sequence is used to characterize image information in three-dimensional space; Image data is generated based on the predicted image feature sequence.

2. The image data generating method according to claim 1, characterized by, Also includes: The historical image feature sequence is updated using at least one image feature from the predicted image feature sequence to obtain an updated historical image feature sequence; The updated historical image feature sequence is used as the historical image feature sequence. The steps of obtaining the noisy image feature sequence and inputting the historical image feature sequence and the noisy image feature sequence into the image feature prediction model are repeated to obtain the predicted image feature sequence.

3. The image data generating method according to claim 2, characterized by, The step of updating the historical image feature sequence using at least one image feature from the predicted image feature sequence to obtain the updated historical image feature sequence includes: The historical image feature sequence is updated using the last image feature in the predicted image feature sequence to obtain the updated historical image feature sequence.

4. The image data generating method according to claim 2, characterized by, The step of updating the historical image feature sequence using at least one image feature from the predicted image feature sequence to obtain the updated historical image feature sequence includes: Following the first-in-first-out rule, the historical image feature sequence is updated using the last image feature in the predicted image feature sequence to obtain the updated historical image feature sequence.

5. The image data generation method according to any one of claims 1 to 4, characterized in that, Each image feature in the historical image feature sequence includes: features corresponding to the image from each of the multiple viewpoints; and directional features corresponding to the viewpoint.

6. The image data generating method according to any one of claims 1 to 5, characterized by, The step of inputting the historical image feature sequence and the noisy image feature sequence into the image feature prediction model to obtain the predicted image feature sequence includes: The image feature prediction model is based on a bidirectional attention mechanism and processes the noisy image feature sequence based on the historical image feature sequence, thereby generating each image feature in the predicted image feature sequence.

7. The image data generating method according to any one of claims 1 to 5, characterized by, The step of inputting the historical image feature sequence and the noisy image feature sequence into the image feature prediction model to obtain the predicted image feature sequence includes: The historical image feature sequence and the noisy image feature sequence are input into the image feature prediction model. Based on task text encoding and / or action latent code, the image feature prediction model is controlled to process the noisy image feature sequence based on the historical image feature sequence to obtain the predicted image feature sequence.

8. The image data generating method according to any one of claims 1 to 7, characterized by, Each image feature in the historical image feature sequence is used to characterize the three-dimensional Gaussian point cloud data. The image feature prediction model includes a four-dimensional scene model and is used to predict the changes of the three-dimensional Gaussian point cloud data over time.

9. A model training method, comprising: include: Multiple image features are selected from the initial image feature sequence as the sample historical image feature sequence; Obtain a sample noise image feature sequence, wherein the sample noise image feature sequence is located after the sample historical image feature sequence in chronological order, and the dimension of the image features in the sample noise image feature sequence is consistent with the dimension of the image features in the sample historical image feature sequence; The sample historical image feature sequence and the sample noisy image feature sequence are input into the image feature prediction model to obtain the predicted sample image feature sequence, wherein each image feature in the predicted sample image feature sequence is used to characterize image information in three-dimensional space; The image feature prediction model is trained based on the difference between the predicted sample image feature sequence and the target image feature sequence in the initial image feature sequence to obtain the trained image feature prediction model, wherein the target image feature sequence and the predicted sample image feature sequence are consistent in the time dimension.

10. A mobile device, comprising: It includes a control module, which is used to execute the image data generation method according to any one of claims 1 to 8.

11. An electronic device, comprising: include: processor; Memory used to store the processor's executable instructions. The processor is used to execute the image data generation method according to any one of claims 1 to 8 or the model training method according to claim 9.

12. A computer-readable storage medium, characterized in that, The storage medium stores a computer program for executing the image data generation method of any one of claims 1 to 8 or the model training method of claim 9.

13. A computer program product, characterised in that, The computer program product includes a computer program that, when executed by the processor of a computer device, enables the computer device to perform the image data generation method of any one of claims 1 to 8 or the model training method of claim 9.

14. An image data generating apparatus characterized by comprising: include: An acquisition module is used to acquire historical image feature sequences and noisy image feature sequences. The historical image feature sequences include multiple image features corresponding to multiple historical moments. The noisy image feature sequences are located after the historical image feature sequences in terms of time order. The dimensions of the image features in the noisy image feature sequences are the same as the dimensions of the image features in the historical image feature sequences. The prediction module is used to input the historical image feature sequence and the noisy image feature sequence into the image feature prediction model to obtain the predicted image feature sequence, wherein each image feature in the predicted image feature sequence is used to characterize image information in three-dimensional space; The generation module is used to generate image data based on the predicted image feature sequence.

15. A model training apparatus, comprising: include: The selection module is used to select multiple image features from the initial image feature sequence as a sample historical image feature sequence; An acquisition module is used to acquire a sample noise image feature sequence, wherein the sample noise image feature sequence is located after the sample historical image feature sequence in chronological order, and the dimension of the image features in the sample noise image feature sequence is consistent with the dimension of the image features in the sample historical image feature sequence. The prediction module is used to input the sample historical image feature sequence and the sample noisy image feature sequence into the image feature prediction model to obtain the predicted sample image feature sequence, wherein each image feature in the predicted sample image feature sequence is used to characterize image information in three-dimensional space; The training module is used to train the image feature prediction model based on the difference between the predicted sample image feature sequence and the target image feature sequence in the initial image feature sequence, so as to obtain the trained image feature prediction model, wherein the target image feature sequence and the predicted sample image feature sequence are consistent in the time dimension.