Method and system for generating simulation training data
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ANHUI KAIYANG TECHNOLOGY CO LTD
- Filing Date
- 2026-04-07
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies struggle to achieve high-fidelity realistic visual effects while ensuring the accuracy of image geometry, resulting in a low success rate for embodied intelligence strategies in the process of migrating from simulation to reality.
By acquiring the initial simulation image and its geometric structure and semantic constraint information in the simulation environment, inputting it into the pre-trained generative model, the target image is generated, ensuring that the intrinsic geometric structure and geometric structure features of the target image are consistent, and introducing the visual style features of the real scene.
It improves the visual perception accuracy and task execution success rate of embodied intelligence strategies in the process of migrating from simulation to reality, and realizes the elimination of visual domain differences between simulation and reality through the generated simulation training data.
Smart Images

Figure CN122244593A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of robot vision technology, and in particular to a method and system for generating simulation training data. Background Technology
[0002] With the rapid development of embodied intelligence and robotics, large-scale reinforcement learning training in simulation environments has become a key means of acquiring complex operation strategies. It is widely used in task scenarios that require high-precision perception, such as robotic arm grasping and precision assembly.
[0003] Existing technologies primarily use domain randomization, GAN (Generative Adversarial Network)-based style transfer, or NeRF (Neural Radiance Field) rendering to bridge the visual domain gap between simulation and reality. However, domain randomization often struggles to cover the complex non-parametric noise of the real world, while GAN-based style transfer, while improving image fidelity, easily loses the geometric accuracy of object edges, depth, and key grasping points, causing the strategy to fail in real-world scenarios due to localization errors.
[0004] In summary, existing technologies struggle to achieve high-fidelity realistic visual effects while ensuring the accuracy of image geometry, resulting in a low success rate for embodied intelligence strategies in the process of transferring from simulation to reality. Summary of the Invention
[0005] In view of this, the purpose of this application is to provide a method and system for generating simulation training data. By acquiring the geometric structural features corresponding to the pixel level of the initial simulation image as strong constraints, and using a pre-trained generative model to inject the visual style features of the real scene into the target image, it is possible to improve the visual fidelity of the simulation training data while ensuring that the internal geometric structure of the target image is absolutely aligned with the physical information of the simulation environment. This improves the visual perception accuracy and task execution success rate of embodied intelligence strategies in the process of transferring from simulation to reality.
[0006] In a first aspect, the present invention provides a method for generating simulation training data, comprising: Obtain the initial simulation image rendered in the simulation environment and the constraint information corresponding to the initial simulation image; wherein, the constraint information includes geometric structural features used to characterize the scene corresponding to the initial simulation image; the geometric structural features include feature information used to characterize the spatial structural relationships of objects in the scene.
[0007] The initial simulation image and constraint information are input as constraints into a pre-trained generative model to generate a target image. The target image has visual style features corresponding to the real scene, and the internal geometric structure of the target image is consistent with the geometric structure features.
[0008] Simulation training data is constructed based on the target image.
[0009] In an optional implementation, the constraint information further includes semantic constraint information. The step of obtaining the initial simulation image rendered in the simulation environment and the constraint information corresponding to the initial simulation image includes: The target scene is rendered in the simulation environment to obtain the initial simulation image.
[0010] Based on the rendering results of the simulation environment, the geometric structural features corresponding to the initial simulation image are extracted.
[0011] Semantic annotation is performed on the initial simulation image based on the scene information of the simulation environment to obtain semantic constraint information; among which, the semantic constraint information is used to characterize the category information of different objects in the initial simulation image.
[0012] Constraint information is generated based on geometric structural features and semantic constraint information.
[0013] In an optional implementation, the step of inputting the initial simulation image and constraint information as constraint conditions into a pre-trained generative model to generate the target image includes: Input the initial simulation image into the generative model.
[0014] The constraint information is processed by feature extraction to obtain constraint features.
[0015] The constraint features and the image features corresponding to the initial simulation image are fused together to obtain the fused features.
[0016] Based on fusion features, a generative model is used to generate the target image.
[0017] During the generation of the target image, a geometric consistency constraint is applied to the target image to ensure that the inherent geometric structure and geometric features of the target image remain consistent.
[0018] In optional implementations, the training methods for generative models include: The sample simulation image and the corresponding sample constraint information are input into the generative model to be trained to obtain the predicted image.
[0019] The predicted image is input into a preset geometric estimation model to obtain the predicted geometric structure features.
[0020] The difference between the predicted geometric features and the geometric features in the sample constraint information is calculated to determine the geometric fidelity loss.
[0021] The parameters of the generative model are optimized and adjusted based on geometric fidelity loss to ensure that the intrinsic geometric structure and geometric features of the target image generated by the generative model remain consistent.
[0022] In an optional implementation, after obtaining the predicted image, the method further includes: The predicted image and the sample simulation image are respectively input into the preset feature extraction model to obtain the predicted geometric features corresponding to the predicted image and the sample geometric features corresponding to the sample simulation image.
[0023] The feature constraint loss is determined based on the difference between the predicted geometric features and the sample geometric features.
[0024] Based on feature constraint loss and geometric fidelity loss, the parameters of the generative model are jointly optimized and adjusted.
[0025] In an optional implementation, after obtaining the predicted image, the method further includes: The predicted image is input into a preset semantic recognition model to obtain predicted semantic information.
[0026] The semantic consistency loss is determined based on the difference between the predicted semantic information and the semantic constraint information in the sample constraint information.
[0027] Based on semantic consistency loss and geometric fidelity loss, the parameters of the generative model are jointly optimized and adjusted.
[0028] In an optional implementation, the method further includes: During the generation of the target image, the target image features of the previous frame corresponding to the current frame are obtained and used as temporal context information.
[0029] The temporal context information and the constraint information of the current frame are input into the generative model to generate target images corresponding to multiple consecutive frames; wherein the target images corresponding to multiple consecutive frames are consistent in time series.
[0030] In an optional implementation, after the step of generating the target image, the method further includes: Acquire real-world environmental images and extract their visual style features.
[0031] Visual style features are input as additional constraint information into the generative model to generate an updated target image; wherein the visual style features of the updated target image are consistent with the visual style features of the real environment image.
[0032] In an optional implementation, the step of constructing simulation training data based on the target image includes: Obtain the action information and / or state information corresponding to the initial simulation image.
[0033] Associate the target image with action information and / or state information.
[0034] Simulation training data is constructed based on target images, action information, and / or state information.
[0035] Secondly, the present invention provides a system for generating simulation training data, comprising: The data acquisition module is used to acquire the initial simulation image rendered in the simulation environment and the constraint information corresponding to the initial simulation image; wherein, the constraint information includes geometric structural features used to characterize the scene corresponding to the initial simulation image; the geometric structural features include feature information used to characterize the spatial structural relationships of objects in the scene.
[0036] The image generation module is used to input the initial simulation image and constraint information as constraints into the pre-trained generative model to generate the target image. The target image has visual style features corresponding to the real scene, and the internal geometric structure of the target image is consistent with the geometric structure features.
[0037] The data construction module is used to construct simulation training data based on the target image.
[0038] This application provides a method and system for generating simulation training data. By inputting the initial simulation image rendered in the simulation environment and its corresponding geometric constraint information into a pre-trained generative model, the visual style features of the real scene and the inherent geometric structure of the simulation scene can be decoupled and efficiently fused. This improves the visual fidelity of the simulation training image and simulates complex non-parametric visual phenomena in the real world, while ensuring that the geometric information of the target image is completely aligned with the simulation environment at the pixel level. This eliminates the positioning error caused by geometric distortion in traditional style transfer and improves the generalization ability and task execution success rate of embodied intelligence strategies in the process of transferring from simulation to reality.
[0039] Other features and advantages of this application will be set forth in the following description and will be apparent in part from the description or may be learned by practicing the application. The objectives and other advantages of this application are realized and obtained through the structures particularly pointed out in the description, claims and drawings.
[0040] To make the above-mentioned objectives, features and advantages of this application more apparent and understandable, preferred embodiments are described below in detail with reference to the accompanying drawings. Attached Figure Description
[0041] To more clearly illustrate the technical solutions in the specific embodiments of this application or the prior art, the drawings used in the description of the specific embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0042] Figure 1 A flowchart illustrating the method for generating simulation training data provided in this application embodiment; Figure 2 A flowchart illustrating the method for obtaining initial simulation images and constraint information provided in this application embodiment; Figure 3 This is a flowchart of the target image generation method provided in the embodiments of this application; Figure 4 A flowchart illustrating the method for constructing simulation training data provided in this application embodiment; Figure 5 This is a schematic diagram of a simulation training data generation system provided in an embodiment of this application.
[0043] Icons: 1-Data acquisition module; 2-Image generation module; 3-Data construction module. Detailed Implementation
[0044] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0045] To help those skilled in the art better understand this application, a brief introduction to its application scenarios and design concepts is provided.
[0046] In existing technologies, methods such as domain randomization, style transfer based on generative adversarial networks (GANs), and high-fidelity rendering are commonly used to address the visual differences that exist during the transfer from simulation to reality. Domain randomization improves the model's robustness to different visual changes by randomly perturbing the texture, lighting, and color in the simulation environment. However, this method struggles to cover the complex unstructured visual features of the real environment and requires a large number of low-quality samples, resulting in low training efficiency. While GAN-based style transfer methods can improve the visual realism of images to some extent, they easily destroy the geometric structure information of the image during the transfer process, such as blurring object edges or shifting key positions, thus affecting the accuracy of subsequent tasks. High-fidelity rendering methods can generate near-realistic images, but their computational complexity is high, generation efficiency is low, and they struggle to simulate complex visual factors such as dynamic lighting changes and sensor noise. Therefore, existing technologies generally suffer from the problem of failing to maintain geometric consistency while improving the visual realism of images, resulting in poor adaptability of models trained on simulation data to real-world scenes, and making it difficult to effectively eliminate the visual domain differences between simulation and reality.
[0047] Based on this, embodiments of this application provide a method and system for generating simulation training data. By introducing constraint information to characterize geometric structural features and controlling the generation process of the target image in the generative model, the generated target image possesses visual style features consistent with the real scene while maintaining the original simulation scene's geometric structure. This application can improve the visual realism of the image while ensuring structural accuracy, thereby avoiding the structural distortion problem caused by style transfer in the prior art. Furthermore, by constructing simulation training data based on the target image, the training data can simultaneously possess realistic visual distribution and accurate structural information, thereby improving the generalization ability of the control strategy model trained based on this training data in real scenes. Consequently, it can effectively reduce the visual difference between the simulation environment and the real environment, improving the stability and execution success rate of the model in real applications.
[0048] To facilitate understanding of this embodiment, the embodiments of this application will be described in detail below.
[0049] This application provides a method for generating simulation training data, referring to... Figure 1 The method for generating simulation training data provided in this application includes: Step S101: Obtain the initial simulation image rendered in the simulation environment and the constraint information corresponding to the initial simulation image; wherein, the constraint information includes geometric structural features used to characterize the scene corresponding to the initial simulation image; the geometric structural features include feature information used to characterize the spatial structural relationships of objects in the scene.
[0050] Here, the initial simulation image rendered in the simulation environment and the corresponding constraint information are obtained. The initial simulation image is an RGB (red, green, blue) image generated in real time by the rendering engine in the simulation environment (such as MuJoCo, Isaac Gym, or a 3DGS (3D Gaussian sputtering) based differentiable rendering pipeline). The constraint information includes at least the geometric features used to characterize the scene corresponding to the initial simulation image. Specifically, the geometric features can be expressed by one or more of depth maps and surface normal maps.
[0051] Geometric features define the three-dimensional position, spatial shape, and surface orientation of objects in the simulation scene. In addition to geometric features, constraint information in optional embodiments also includes semantic constraint information, such as semantic segmentation maps or instance segmentation maps, to distinguish between the background, the robot body, and the target object in the scene, ensuring that the category attributes and semantic layout of objects do not change during image generation.
[0052] During the acquisition process, the differentiable rendering pipeline in the simulation environment simultaneously renders and extracts the corresponding multi-channel constraint information map while outputting the initial simulation image. The geometric features and semantic constraint information are perfectly aligned with the initial simulation image in pixel space. This alignment relationship constitutes an immutable strong constraint condition when generating the target image subsequently.
[0053] Step S102: Input the initial simulation image and constraint information as constraint conditions into the pre-trained generative model to generate the target image; wherein, the target image has visual style features corresponding to the real scene, and the internal geometric structure of the target image is consistent with the geometric structure features.
[0054] Here, the generative model (such as the conditional diffusion model based on the ControlNet architecture or the architecture based on the generative adversarial network GAN) serves as the core visual alignment generator, responsible for converting the simulated visual domain into the real visual domain.
[0055] Specifically, the generative model receives an initial simulated image, geometric features, semantic features, and a random noise vector as input. The random noise vector is used to introduce unstructured visual variations from the real world into the target image, such as sensor noise, random light spots, or microscopic textures. The generative model encodes the geometric and semantic features through an internal conditional encoder and injects them into different layers of the network, forcing the generated target image to resemble the real scene in visual style, but strictly adhering to the input geometric features in its intrinsic geometry.
[0056] The target image possesses visual style features corresponding to the real scene, including complex dynamic lighting and shadows, environmental occlusion, and visual representations of specific materials. To ensure consistency, the generative model introduces a special loss function during the training phase. Geometric fidelity loss penalizes any generated results that distort geometric information. It is calculated by inputting the target image into a pre-trained geometric estimation network (such as a monocular depth estimation network) and comparing the estimated geometric image with the original geometric structure features. Furthermore, the system extracts features from the target image and the initial simulation image using a pre-trained feature extraction network (such as DINOv2), minimizing the geometrically related feature differences between the two in the feature space—the feature space geometric contrast loss—thereby further enhancing geometric invariance.
[0057] In an embodiment for continuous task execution, the target image features generated in the previous frame and the constraint information of the current frame are input into the generative model, and a temporal attention mechanism is used to ensure a smooth transition of the image sequence at the pixel level, avoiding drastic changes in visual noise.
[0058] Step S103: Construct simulation training data based on the target image.
[0059] Here, the generated target image is associated with the action information, state information and reward signal recorded in the simulation environment to form a training dataset corresponding to embodied intelligence.
[0060] Because target images possess both the high visual fidelity of the real world and the absolute geometric accuracy of the simulation environment, reinforcement learning strategies (such as vision-based PPO or SAC algorithms) are trained directly on the target image dataset. Since the target images bridge the visual domain gap between simulation and reality, the feature representations learned by the embodied intelligence strategy can be directly generalized to images captured by real robot cameras, thus achieving a high success rate of Sim2Real (simulation-to-reality) transfer.
[0061] In one specific embodiment, an online visual calibration mode is also included. When a real camera captures images of an extreme environment, the visual style features of that environment are automatically extracted and fed back to the generative model. This dynamically generates target images that better fit the current real-world environment and adds them to the training dataset, enabling online evolution and continuous learning of the simulation training data.
[0062] In an optional implementation, the constraint information may also include semantic constraint information.
[0063] Reference Figure 2 Step S101 includes the following steps S201-S204.
[0064] Step S201: Render the target scene in the simulation environment to obtain the initial simulation image.
[0065] Here, the target scene typically includes specific task scenarios such as a robotic arm grasping an object. The simulation system utilizes a rendering engine, such as a physics engine like MuJoCo or Isaac Gym, or employs the 3DGS differentiable rendering pipeline proposed by the applicant, to render the scene under different viewpoints and lighting conditions. To achieve efficient image generation and training, the hardware configuration typically uses a high-performance GPU cluster, such as NVIDIA A100 or RTX 4090, to accelerate the generation process. The initial simulation image serves as the original simulation RGB image (I sim This provides the basic visual composition and initial visual presentation of the scene.
[0066] Step S202: Based on the rendering results of the simulation environment, extract the geometric structural features corresponding to the initial simulation image.
[0067] Here, geometric structural features are key data characterizing the three-dimensional physical properties of a scene. Geometric structural features (C) geo Specifically, this includes depth maps (C) geo_depth ) and surface normal diagram (C geo_normal In the depth map, pixel values represent the physical distance between the object's surface and the camera, while the surface normal map uses RGB color encoding to display the XYZ directions of the object's surface normal vector. In a 3DGS-based differentiable rendering pipeline, these geometric feature maps can be rendered and extracted in real-time synchronously with the initial simulation image, ensuring absolute pixel-level alignment between the geometric features and the initial simulation image. These geometric features define the 3D position and surface orientation of objects in the scene.
[0068] Step S203: Based on the scene information of the simulation environment, the initial simulation image is semantically annotated to obtain semantic constraint information; wherein, the semantic constraint information is used to characterize the category information of different objects in the initial simulation image.
[0069] Here, semantic constraint information (C) sem This includes semantic segmentation maps or instance segmentation maps. Semantic constraint information uses different color codes to distinguish the category information of different objects in the initial simulation image, such as accurately distinguishing the background, the robot body, and the target object. Since the simulation environment possesses complete scene metadata, the simulation system can automatically extract and generate pixel-level accurate semantic masks directly from the underlying information of the simulation engine. The role of semantic constraint information is to ensure that the generative model maintains the semantic structure of the scene during the generation of the target image, preventing object category confusion or geometric semantic collapse in the generated image.
[0070] Step S204: Generate constraint information based on geometric structural features and semantic constraint information.
[0071] Here, constraint information, acting as multi-channel strong constraints, integrates geometric structural features and semantic constraint information to provide pixel-level precise geometric and semantic guidance for subsequent generative models, ensuring the structural invariance of the generated image. The simulation system generates an initial simulation image (I) by rendering it within the simulation environment. sim ), geometric structural features (C geo ) and semantic constraint information (C sem The simulation system combines a triplet data packet with real-world images (I) acquired by a high-resolution RGB-D camera (e.g., Intel RealSense D455). real As a style objective, paired training data (I) is constructed. sim C geo C sem I real ).
[0072] During the generation phase, these constraints are input into a pre-trained generative model (such as a conditional diffusion model G, or alternatives like generative adversarial networks (GANs) or variational autoencoder (VAE) architectures). Regardless of the generative architecture used, the geometric fidelity loss (L) is preserved. geo The constraint information serves as a core constraint to ensure the geometric accuracy of the generated image. This constraint information provides precise pixel guidance for the generative model, ensuring that the final generated target image is visually realistic while its internal geometric structure is perfectly aligned with the simulation environment.
[0073] In an optional implementation, refer to Figure 3 Step S102 includes the following steps S301-S305.
[0074] Step S301: Input the initial simulation image into the generative model.
[0075] Here, the preferred underlying architecture for the generative model is a conditional diffusion model based on the ControlNet architecture. In other feasible embodiments, the generative model may also employ a Generative Adversarial Network (GAN) or a Variational Autoencoder (VAE) as the underlying generative architecture.
[0076] Step S302: Perform feature extraction processing on the constraint information to obtain constraint features.
[0077] Here, the constraint information includes geometric structural features representing the scene corresponding to the initial simulation image, as well as semantic constraint information. The generative model uses a conditional encoder to convert the depth map, surface normal map, and semantic segmentation map into constraint features that can match the image features.
[0078] Step S303: The constraint features and the image features corresponding to the initial simulation image are fused to obtain the fused features.
[0079] Here, the generative model injects constraint features into multiple different layers of the main network (such as the U-Net structure) of the generative model through a conditional encoder, so that the constraint features can guide the generation of image features at each layer.
[0080] Step S304: Based on the fusion features, generate the target image through a generative model.
[0081] Here, the generative model generates the target image based on fused features and combined with a random noise vector z. The random noise vector z is used to introduce unstructured visual variations from the real world, such as sensor noise or random light spots.
[0082] In alternative embodiments requiring real-time visual alignment feedback, knowledge distillation or model quantization techniques can be used to compress the trained large generative model into a lightweight model, such as the lightweight diffusion model LCM, SDXL-Turbo, or MobileDiffusion. Furthermore, target image generation can be extended to joint generation and alignment of multimodal data. Besides generating RGB images as target images, infrared temperature maps, LiDAR depth maps, or tactile sensor data can be injected into the generative model as additional constraints or generated targets.
[0083] Step S305: During the generation of the target image, a geometric consistency constraint is applied to the target image to ensure that the inherent geometric structure and geometric features of the target image remain consistent.
[0084] Here, during the generation of the target image, the generative model imposes geometric consistency constraints on the target image by calculating a loss function, thereby forcing the inherent geometric structure and geometric features of the target image to maintain pixel-level alignment. The formula for calculating the total loss function during training the generative model is as follows: .
[0085] in, This represents the total objective loss value of the generative model during the training process. Minimizing the total objective loss value drives the generative model to learn both visual style and structural constraints simultaneously. For standard diffusion model denoising loss, the L2 loss function is typically used to learn how to recover image styles with real-world visual features from random noise; For semantic consistency loss, a pre-trained semantic segmentation network is typically used to calculate the cross-entropy difference between the generated target image and the original semantic constraint information to ensure that the object category layout in the target image does not change. Geometric fidelity loss is used to force the generated target image to maintain consistency with the geometric constraint information provided by the simulation environment in terms of geometric structure, and is the core constraint term for solving the geometric distortion problem. and These represent the geometric fidelity loss weight coefficient and the semantic consistency loss weight coefficient, respectively, which are used to balance the strength relationship between visual realism and physical structural constraints during training.
[0086] Geometric Fidelity Loss The calculation formula is: .
[0087] in, This represents a pre-trained geometry estimation network, such as a monocular depth estimation network (e.g., ZoeDepth or MiDaS) or a surface normal estimation network, which is responsible for inferring geometric information from the generated target image. Represents a realistic target image generated by a generative model; This represents the estimated geometry map (such as the estimated depth map or the estimated normal map) predicted by the geometric estimation network for the target image. It represents the original geometric constraint information directly output by the simulation environment, that is, the geometric structure feature map that serves as the absolute real reference; This represents the L1 loss function, used to calculate the absolute difference per pixel between the estimated geometry and the original geometric constraint information; The geometric contrast loss in the feature space is used to extract geometrically relevant feature representations of the target image and the initial simulation image by inputting them into a pre-trained feature extraction network (such as DINOv2 or CLIP). The weighting coefficients representing the feature space loss are used to adjust the contribution of the feature level to the geometric constraints.
[0088] When calculating the geometric fidelity loss, the simulation training data generation system inputs the target image into a pre-trained geometric estimation network (such as ZoeDepth or MiDaS monocular depth estimation network), and achieves geometric alignment by minimizing the difference between the geometric image back-estimated by the geometric estimation network and the original geometric structure features.
[0089] In an alternative embodiment, the geometric fidelity loss can be fully defined in the feature space, utilizing the feature differences extracted from the large visual model instead of the explicit geometric estimation network. This inverse verification loss design ensures that the target image is visually realistic while its inherent geometry is pixel-level aligned with the simulation environment.
[0090] In response to the continuous operation characteristics of embodied intelligence, the simulation training data generation system can also introduce temporal context constraints during the generation process. By using a time-dimensional attention mechanism, it can ensure that the continuously generated target image sequence has smoothness on the time axis and avoid visual noise jumps.
[0091] In an optional implementation, the training method for the generative model in step S102 includes the following steps S401-S404.
[0092] Step S401: Input the sample simulation image and the sample constraint information corresponding to the sample simulation image into the generative model to be trained to obtain the predicted image.
[0093] Here, the method for generating simulation training data acquires the initial simulation image rendered by the simulation environment and extracts the corresponding depth map, surface normal map, and semantic segmentation map as sample constraint information. Specifically, the method constructs a triplet-paired training dataset containing the initial simulation image, sample constraint information, and real-world images captured by a high-resolution RGB-D camera. The method inputs the sample simulation images and their corresponding sample constraint information into the generative model to be trained, thereby obtaining predicted images with realistic visual styles. The generative model preferably employs a conditional diffusion model based on the ControlNet architecture, using a conditional encoder to encode the geometric and semantic information in the sample constraint information and inject it into the network layers. In an alternative embodiment, the generative model can also use a Generative Adversarial Network (GAN) or a Variational Autoencoder (VAE) as the basic generative architecture.
[0094] Step S402: Input the predicted image into the preset geometric estimation model to obtain the predicted geometric structure features.
[0095] Here, the geometric estimation model employs a pre-trained monocular depth estimation network or a surface normal estimation network. As a pre-trained geometric perceptron, the geometric estimation model is responsible for performing inverse geometric derivation on the generated prediction image, thereby extracting the depth information or surface orientation information contained within the prediction image, and outputting the extracted depth information or surface orientation information as the predicted geometric structure features.
[0096] Step S403: Calculate the difference between the predicted geometric structure features and the geometric structure features in the sample constraint information, and determine the geometric fidelity loss.
[0097] Here, by inputting the predicted image and the initial simulation image into the pre-trained feature extraction network, the difference in geometrically related features between the predicted image and the initial simulation image in the feature space is minimized, thereby further enhancing geometric invariance.
[0098] In an alternative embodiment, the geometric fidelity loss can be fully defined in the feature space, using a large visual model as a feature extractor to calculate the feature differences between the generated image and the original image, thereby replacing the reliance on a specific geometric estimation model.
[0099] Step S404: Optimize and adjust the parameters of the generative model based on the geometric fidelity loss so that the intrinsic geometric structure and geometric features of the target image generated by the generative model are consistent.
[0100] Here, the method for generating simulation training data uses calculated gradients for backpropagation. This forces the generative model to ensure that the target image generated is visually realistic while maintaining pixel-level alignment with the underlying geometric features. Through this back-validation loss design, the generative model learns the ability to perform high-fidelity style transfer while strictly adhering to physical geometric constraints, ultimately achieving high-precision alignment between simulation training data and the real visual domain.
[0101] In an optional implementation, after obtaining the predicted image in step S401, the method further includes the following steps S501-S503.
[0102] Step S501: Input the predicted image and the sample simulation image into the preset feature extraction model respectively to obtain the predicted geometric features corresponding to the predicted image and the sample geometric features corresponding to the sample simulation image.
[0103] Here, the feature extraction model employs a pre-trained large visual model. In specific implementations, the DINOv2 network or CLIP network can be used as the feature extraction network. The feature extraction model is responsible for extracting the corresponding predicted geometric features from the predicted image and the corresponding sample geometric features from the sample simulation image. These geometric features represent the high-dimensional geometric properties of the image in the feature space, and can capture deeper geometric consistency correlations than simple pixel comparisons.
[0104] Step S502: Determine the feature constraint loss based on the difference between the predicted geometric features and the sample geometric features.
[0105] Here, the feature constraint loss is expressed in the formula as follows: Feature constraint loss is a type of geometric contrast loss in feature space. By minimizing the difference in geometrically relevant features between the predicted image and the sample simulation image in feature space, the simulation training data generation method can effectively enhance the geometric invariance of the generated image, ensuring that the intrinsic geometric logic of the object remains locked when the visual style of the generated target image changes.
[0106] Step S503: Based on feature constraint loss and geometric fidelity loss, the parameters of the generative model are jointly optimized and adjusted.
[0107] Here, during the calculation process, the simulation training data generation system first incorporates the feature constraint loss into the geometric fidelity loss. The generation system then substitutes the obtained geometric fidelity loss into the total loss function. Finally, by jointly optimizing the total loss function, the generation system dynamically adjusts the network parameters of the generative model using the backpropagation algorithm.
[0108] As an alternative implementation, the geometric fidelity loss can be fully defined in the feature space, using the feature space geometric contrast loss calculated by the feature extraction model as the core constraint, thus replacing the dependence on a specific geometric estimation model. This alternative leverages the powerful feature representation capabilities of large visual models, making the system more consistent and flexible in handling geometric constraints of different modalities. Through the joint optimization at the feature level described above, the target image generated by the generative model can achieve visual realism while its inherent geometric structure is pixel-level aligned with the simulation environment.
[0109] In an optional implementation, after obtaining the predicted image in step S401, the method further includes the following steps S601-S603.
[0110] Step S601: Input the predicted image into the preset semantic recognition model to obtain predicted semantic information.
[0111] Here, the semantic recognition model employs a pre-trained deep convolutional neural network or visual transformer network. In specific implementations, semantic segmentation models such as DeepLabV3+, Mask R-CNN, PSPNet, or SegNet can be used. The semantic recognition model performs pixel-by-pixel category inference processing on the predicted image to obtain predicted semantic information. The predicted semantic information is output in the form of a semantic mask or a pixel-level category probability map. The predicted semantic information accurately reflects the spatial distribution and category classification of the background environment, robot body, work object, and other interfering objects in the predicted image.
[0112] Step S602: Determine the semantic consistency loss based on the difference between the predicted semantic information and the semantic constraint information in the sample constraint information.
[0113] Here, semantic constraint information refers to the absolute true category labels directly obtained from the simulation environment. This semantic constraint information provides the standard semantic layout corresponding to the initial simulation image. The semantic consistency loss is expressed in the formula as follows: The magnitude of semantic consistency loss is determined by calculating the cross-entropy loss or Dice loss between predicted semantic information and semantic constraint information. The role of semantic consistency loss is to ensure that the supervised generative model strictly preserves the semantic meaning of the original scene during the process of converting simulated images into realistic style images, preventing problems such as object category confusion, blurred edges, or loss of semantic features in the generated target image during visual enhancement.
[0114] Step S603: Based on semantic consistency loss and geometric fidelity loss, the parameters of the generative model are jointly optimized and adjusted.
[0115] Here, the gradients of the network parameters in each layer of the generative model are calculated using the total error generated by the total loss function and the backpropagation algorithm.
[0116] The simulation training data generation system dynamically updates the internal weight parameters of the generative model based on the calculated gradient, so that the target image generated by the generative model can, while satisfying the visual fidelity of the real world, force the consistency of the internal geometric structure and geometric features, and at the same time achieve consistency of category information and semantic constraint information.
[0117] In an alternative embodiment, the process of determining semantic consistency loss can introduce a multi-scale feature comparison mechanism, which helps to improve the constraint strength of semantic consistency by comparing the correlation between the predicted image feature map and the initial simulated image feature map at different semantic levels.
[0118] Another alternative is to use adversarial semantic loss, which utilizes a dedicated discriminator network to determine whether the semantic distribution of the generated target image conforms to the real physical logic, thereby further optimizing the simulation accuracy of the generative model for complex working environments.
[0119] In an optional implementation, the method further includes the following steps S701-S702.
[0120] Step S701: During the generation of the target image, the target image features of the previous frame corresponding to the current frame are obtained, and the target image features are used as temporal context information.
[0121] Here, the target image features acquired from the previous frame are used as temporal context information. These features not only contain the visual texture and lighting distribution of objects in the scene at the previous moment, but also a latent spatial representation of the object's motion trend. By extracting and utilizing this temporal context information, the simulation training data generation system can establish historical visual references for the current frame's image generation process, thus providing a data foundation for solving the problem of inter-frame visual jumps.
[0122] Step S702: Input the temporal context information and the constraint information of the current frame into the generative model to generate target images corresponding to multiple consecutive frames; wherein the target images corresponding to multiple consecutive frames are consistent in time series.
[0123] Here, temporal context information and constraint information corresponding to the current frame are input into the generative model to generate target images corresponding to multiple consecutive frames. The constraint information corresponding to the current frame includes at least the geometric structural features (such as depth map and surface normal map) and semantic constraint information (such as semantic segmentation map) at the current moment, which are rendered and output by the simulation environment.
[0124] Specifically, the generative model utilizes a built-in temporal attention mechanism to fuse temporal context information with the constraint features of the current frame. This temporal attention mechanism calculates the correlation weights between the pixel features of the current frame and those of the previous frame in the feature space, achieving feature alignment and information transfer across time steps. Through this approach, the target images corresponding to multiple consecutive frames generated by the generative model maintain pixel-level smoothness and visual coherence over time. This effectively avoids discontinuities such as background flickering, surface texture jitter, and abrupt changes in lighting that may occur when generating individual frames independently.
[0125] In one alternative embodiment, the simulation training data generation system can employ a recurrent neural network architecture (e.g., a Convolutional Long Short-Term Memory network, ConvLSTM) to replace the temporal attention mechanism, storing and updating temporal context information through frame-by-frame transmission of hidden states. In another alternative embodiment, the simulation training data generation system can also introduce optical flow constraint loss, calculating the motion vector field between adjacent target images using a pre-trained optical flow estimation network and minimizing the deviation between the predicted displacement and the actual motion trajectory, thereby further enhancing temporal consistency at the geometric level.
[0126] Furthermore, to meet the requirements of real-time simulation training, a lightweight generative architecture (such as the Latent Consistency Model (LCM) or SDXL-Turbo) can be used in conjunction with temporal context information for rapid denoising. This ensures that the generated target image sequence maintains high-frequency visual coherence while also possessing high generation efficiency. Through the aforementioned temporal constraint mechanism, the generated simulation training dataset can more realistically simulate the continuous evolution of visual phenomena in the real physical world, thereby significantly improving the discrimination accuracy of robot strategies when handling time-related tasks such as dynamic obstacle avoidance and continuous grasping.
[0127] In an optional implementation, after the step of generating the target image in step S102, the method further includes the following steps S801-S802.
[0128] Step S801: Obtain real-world environmental images and extract visual style features from the real-world environmental images.
[0129] Here, high-resolution red-green-blue depth cameras (such as the Intel RealSense D455) mounted on the robot's end effector or fixed at the work site are used to capture real-time images of the actual work scene, thus obtaining real-world environment images. Subsequently, the simulation training data generation system uses a pre-defined style extraction network to extract visual style features from the real-world environment images. The visual style features of the real-world environment images not only include global illumination intensity, ambient light color temperature, and fine texture of object surfaces unique to the real-world environment, but also specific visual biases caused by the influence of the real-world environment, such as oil stains on camera lenses, glare caused by backlighting, and electronic noise generated by specific sensors. The style extraction network can employ a pre-trained visual large model image encoder (such as the visual part of the contrastive language image pre-trained model CLIP or the pre-trained weights of the visual interchange model ViT). The style extraction network maps the real-world environment images into a set of compact style embedding vectors, which can accurately capture geometrically independent visual distribution information in the real-world environment images.
[0130] Step S802: Input the visual style features as additional constraint information into the generative model to generate an updated target image; wherein the visual style features of the updated target image are consistent with the visual style features of the real environment image.
[0131] Here, the simulation training data generation system inputs the acquired visual style features as additional constraint information into the generative model. Upon receiving the visual style features as a visual reference, the generative model combines them with the geometric structural features (e.g., depth map, surface normal map) and semantic constraint information (e.g., semantic segmentation map) corresponding to the initial simulation image to generate an updated target image. In practice, the generative model utilizes a cross-attention mechanism or adaptive instance normalization technique to inject style embedding vectors representing the real-world style into the hidden feature layer of the generative model, guiding it to redraw the pixels of the initial simulation image.
[0132] The updated target image maintains the same visual style features as the real-world environment image. This means that the updated target image highly replicates the features of the real-world environment image in terms of visual appearance, lighting logic, and noise distribution, while strictly adhering to the geometric structure features rendered from the simulation environment in terms of internal geometry. In this way, the simulation training data generation system can achieve online visual calibration and adaptive evolution of the simulation training data. For example, when the robot faces an extreme real-world lighting environment it has never encountered before, the simulation training data generation system can dynamically generate a large amount of simulation training data with the same style and geometric accuracy based on a small number of real-world environment images, and supplement the simulation training dataset with the updated target image.
[0133] In an alternative embodiment, the simulation training data generation system can utilize low-rank adaptive (LoRA) technology to fine-tune the conditional encoder of the generative model online, enabling the generative model to accelerate the fitting of the visual style of real-world images by updating the weight vector without changing the backbone network parameters.
[0134] In another alternative embodiment, the simulation training data generation system can use real-world environment images as the target domain for the style transfer algorithm, and utilize discriminator feedback in a generative adversarial network to further enhance the visual consistency between the updated target image and the real-world environment image.
[0135] By incorporating feedback from real-world environmental features, the method for generating simulation training data can effectively address the visual domain shift problem between the simulation environment and specific real-world scenarios. Embodied intelligent policies can continuously learn from simulation training datasets containing updated target images, rapidly adapting to new operational environments and significantly improving the robot's generalization ability and task execution success rate in diverse real-world scenarios.
[0136] In an optional implementation, refer to Figure 4 Step S1030 includes the following steps S901-S903.
[0137] Step S901: Obtain the action information and / or state information corresponding to the initial simulation image.
[0138] Here, when simulating robot tasks in a simulation environment, motion information and status information that are completely corresponding to the initial simulation image in terms of timestamps are acquired synchronously.
[0139] Motion information includes control command parameters for the robot body, specifically the robot's joint angles, joint angular velocities, six-degree-of-freedom pose of the end effector, the end effector's movement speed, and the gripper's opening or closing state. State information includes physical data from the simulation environment, specifically the robot's base coordinate position in the simulation coordinate system, the target object's three-dimensional spatial coordinates, the target object's orientation quaternion, the target object's velocity, and collision contact force data fed back by sensors. The simulation training data generation system accesses the simulation engine's low-level API interface to bind this high-dimensional motion and state information to the initial simulation image in real time.
[0140] Step S902: Associate the target image with action information and / or state information.
[0141] Here, the target image obtained after generative model transformation is correlated with the acquired action and state information through pixel-to-physical logic. Because the target image strictly adheres to the geometric features and semantic constraints provided by the simulation environment during generation, the visual content in the target image maintains an absolute correspondence with the action and state information in the simulation environment in physical space.
[0142] The target image is used as the visual input for the robot, motion information is used as the corresponding control label, and state information is used as the real-world value label. This associative processing ensures that the target image with a real-world visual style not only achieves high visual fidelity but also inherits the precise physical laws provided by the simulation environment at the underlying logic level, thereby realizing a deep coupling between visual representation and physical behavior.
[0143] Step S903: Construct simulation training data based on the target image, action information, and / or state information.
[0144] Here, thousands of associated data pairs are stored to construct a large-scale simulation training dataset for training embodied intelligence policies.
[0145] Specifically, the robot policy utilizes a constructed simulation training dataset for large-scale reinforcement learning or imitation learning training. For reinforcement learning tasks, the robot policy extracts state features based on the target image and applies the reward function formula: To continuously optimize the control strategy. Among them, Represents the instantaneous reward value at time t. This represents the discount factor. Robot policies can be derived using pixel-input-based deep reinforcement learning algorithms, such as PPO (Proximal Policy Optimization) or SAC (Soft Behavior Role-Critic Algorithm), with the goal of learning an optimal policy. This strategy can output the optimal action command a based on the input target image observation value o.
[0146] Because the target images in the simulation training data mimic the complex visual noise and lighting effects of the real world while maintaining accurate geometric structures, the robot policy trained on the simulation training data possesses extremely strong visual generalization capabilities. After training is complete, the robot policy can be directly transferred to robot hardware in real-world scenarios, using images captured by real cameras as input to directly execute tasks without the need for secondary fine-tuning in the real environment, thus achieving efficient Sim2Real transfer.
[0147] In one alternative embodiment, the simulation training data generation system can further perform additional data augmentation processing on the target images during the construction of the simulation training data. This could include random cropping, color jittering, or adding Gaussian noise to further enhance the distribution diversity of the training dataset and improve the robustness of the robot's strategy. In another alternative embodiment, the simulation training data generation system can construct an offline dataset from the correlated data for training a large-scale vision-language-action (VLA) model. This enables the robot to extract semantic information from the target image based on natural language instructions and perform corresponding physical actions.
[0148] Based on the above embodiments, this application provides a simulation training data generation system. The simulation training data generation system is deployed in a computing unit containing a high-performance graphics processing unit (GPU) cluster, such as an NVIDIA A100 or RTX 4090 graphics card, to accelerate the training and inference process of deep learning models. The simulation training data generation system is connected to a high-resolution red-green-blue-depth (RGB-D) camera (e.g., an Intel RealSense D455) and a simulation environment server via a high-speed data interface.
[0149] Reference Figure 5 The simulation training data generation system provided in this application includes: The data acquisition module 1 is used to acquire the initial simulation image rendered in the simulation environment and the constraint information corresponding to the initial simulation image; wherein, the constraint information includes geometric structural features used to characterize the scene corresponding to the initial simulation image; the geometric structural features include feature information used to characterize the spatial structural relationship of objects in the scene.
[0150] Here, data acquisition module 1 is used to acquire the initial simulation image rendered in the simulation environment and the corresponding constraint information. Data acquisition module 1 has a built-in differentiable rendering pipeline, which can render the 3D simulation scene based on 3DGS technology or a traditional rasterization rendering engine. The rendering engine in data acquisition module 1 is responsible for outputting the initial simulation image, while simultaneously extracting constraint information using underlying metadata. The constraint information includes geometric structural features characterizing the scene corresponding to the initial simulation image, specifically represented by a depth map and a surface normal map. Furthermore, data acquisition module 1 is also used to perform semantic annotation on the initial simulation image based on scene information from the simulation environment to obtain semantic constraint information, which is represented as a semantic segmentation map or instance segmentation map used to distinguish between the robot body and the work object.
[0151] Image generation module 2 is used to input the initial simulation image and constraint information as constraint conditions into the pre-trained generative model to generate the target image; wherein, the target image has visual style features corresponding to the real scene, and the internal geometric structure of the target image is consistent with the geometric structure features.
[0152] Here, image generation module 2 is used to input the initial simulation image and constraint information as constraint conditions into a pre-trained generative model to generate the target image. The generative model in image generation module 2 is preferably a diffusion model based on a conditional control network architecture, but generative adversarial networks or variational autoencoders can also be used as alternative architectures as needed.
[0153] The image generation module 2 further integrates a feature extraction unit, a feature fusion unit, and a denoising generation unit. The feature extraction unit uses a conditional encoder to encode geometric structural features and semantic constraint information. The feature fusion unit injects the encoded features into the backbone network of the generative model (e.g., a U-Net structure). The denoising generation unit outputs a target image with realistic scene visual style features through multi-step denoising processing.
[0154] To ensure the geometric accuracy of the target image, the image generation module 2 is also associated with a geometric estimation model, a feature extraction network, and a semantic recognition model during the training phase. The geometric estimation model uses a pre-trained monocular depth estimation network (e.g., ZoeDepth or MiDaS) to provide feedback on the depth error of the predicted image. The feature extraction network uses a large-scale vision pre-trained model (e.g., DINOv2 or CLIP) to calculate the geometric contrast loss between the predicted image and the initial simulation image in the feature space. The semantic recognition model (e.g., DeepLabV3+ or Mask R-CNN) ensures the semantic layout consistency of the target image.
[0155] Image generation module 2 also includes a temporal consistency component and an online calibration component. The temporal consistency component utilizes a temporal attention mechanism and temporal context information to ensure pixel smoothness of the generated target image sequence in the video stream dimension. The online calibration component acquires real-world environmental images captured by a real camera and feeds their visual style features back to the generative model in real time, driving the generative model to generate target images that match the current real-world environment.
[0156] Data construction module 3 is used to construct simulation training data based on the target image.
[0157] Here, data construction module 3 is used to construct simulation training data based on the target image. Data construction module 3 obtains the robot's motion and state information corresponding to the initial simulation image in the simulation environment through the data alignment unit. Data construction module 3 uses the target image as the robot's visual observation input and performs physical-logical association with the corresponding motion and state information, thereby constructing a dataset for large-scale reinforcement learning or imitation learning training. The simulation training data output by data construction module 3 can support efficient training of algorithms such as PPO or SAC, ultimately enabling the trained robot policy to possess high-precision Sim2Real transfer capabilities.
[0158] In an optional implementation, the constraint information further includes semantic constraint information. The data acquisition module 1 is also used for: The target scene is rendered in the simulation environment to obtain the initial simulation image.
[0159] Based on the rendering results of the simulation environment, the geometric structural features corresponding to the initial simulation image are extracted.
[0160] Semantic annotation is performed on the initial simulation image based on the scene information of the simulation environment to obtain semantic constraint information; among which, the semantic constraint information is used to characterize the category information of different objects in the initial simulation image.
[0161] Constraint information is generated based on geometric structural features and semantic constraint information.
[0162] In an optional implementation, the image generation module 2 is further configured to: Input the initial simulation image into the generative model.
[0163] The constraint information is processed by feature extraction to obtain constraint features.
[0164] The constraint features and the image features corresponding to the initial simulation image are fused together to obtain the fused features.
[0165] Based on fusion features, a generative model is used to generate the target image.
[0166] During the generation of the target image, a geometric consistency constraint is applied to the target image to ensure that the inherent geometric structure and geometric features of the target image remain consistent.
[0167] In an optional implementation, the image generation module 2 is further configured to: The sample simulation image and the corresponding sample constraint information are input into the generative model to be trained to obtain the predicted image.
[0168] The predicted image is input into a preset geometric estimation model to obtain the predicted geometric structure features.
[0169] The difference between the predicted geometric features and the geometric features in the sample constraint information is calculated to determine the geometric fidelity loss.
[0170] The parameters of the generative model are optimized and adjusted based on geometric fidelity loss to ensure that the intrinsic geometric structure and geometric features of the target image generated by the generative model remain consistent.
[0171] In an optional implementation, the image generation module 2 is further configured to: The predicted image and the sample simulation image are respectively input into the preset feature extraction model to obtain the predicted geometric features corresponding to the predicted image and the sample geometric features corresponding to the sample simulation image.
[0172] The feature constraint loss is determined based on the difference between the predicted geometric features and the sample geometric features.
[0173] Based on feature constraint loss and geometric fidelity loss, the parameters of the generative model are jointly optimized and adjusted.
[0174] In an optional implementation, the image generation module 2 is further configured to: The predicted image is input into a preset semantic recognition model to obtain predicted semantic information.
[0175] The semantic consistency loss is determined based on the difference between the predicted semantic information and the semantic constraint information in the sample constraint information.
[0176] Based on semantic consistency loss and geometric fidelity loss, the parameters of the generative model are jointly optimized and adjusted.
[0177] In an optional implementation, the image generation module 2 is further configured to: During the generation of the target image, the target image features of the previous frame corresponding to the current frame are obtained and used as temporal context information.
[0178] The temporal context information and the constraint information of the current frame are input into the generative model to generate target images corresponding to multiple consecutive frames; wherein the target images corresponding to multiple consecutive frames are consistent in time series.
[0179] In an optional implementation, the image generation module 2 is further configured to: Acquire real-world environmental images and extract their visual style features.
[0180] Visual style features are input as additional constraint information into the generative model to generate an updated target image; wherein the visual style features of the updated target image are consistent with the visual style features of the real environment image.
[0181] In an optional implementation, data construction module 3 is further configured to: Obtain the action information and / or state information corresponding to the initial simulation image.
[0182] Associate the target image with action information and / or state information.
[0183] Simulation training data is constructed based on target images, action information, and / or state information.
[0184] The computer program product provided in this application includes a computer-readable storage medium storing program code. The instructions included in the program code can be used to execute the methods described in the preceding method embodiments. For specific implementation details, please refer to the method embodiments, which will not be repeated here.
[0185] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working process of the system and apparatus described above can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.
[0186] Furthermore, in the description of the embodiments of this application, unless otherwise expressly specified and limited, the terms "installation," "connection," and "linking" should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral connection; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; and they can refer to the internal connection of two components. Those skilled in the art can understand the specific meaning of the above terms in this application based on the specific circumstances.
[0187] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0188] In the description of this application, it should be noted that the terms "center," "upper," "lower," "left," "right," "vertical," "horizontal," "inner," and "outer," etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. They are used only for the convenience of describing this application and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on this application. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and should not be construed as indicating or implying relative importance.
[0189] Finally, it should be noted that the above-described embodiments are merely specific implementations of this application, used to illustrate the technical solutions of this application, and not to limit them. The scope of protection of this application is not limited thereto. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that any person skilled in the art can still modify or easily conceive of changes to the technical solutions described in the foregoing embodiments, or make equivalent substitutions for some of the technical features, within the scope of the technology disclosed in this application. Such modifications, changes, or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and should all be covered within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of protection of the claims.
Claims
1. A method for generating simulation training data, characterized in that, include: The process involves obtaining an initial simulation image rendered in a simulation environment and the constraint information corresponding to the initial simulation image; wherein the constraint information includes geometric structural features characterizing the scene corresponding to the initial simulation image; and the geometric structural features include feature information characterizing the spatial structural relationships of objects in the scene. The initial simulation image and the constraint information are input as constraints into a pre-trained generative model to generate a target image; wherein the target image has visual style features corresponding to the real scene, and the internal geometric structure of the target image is consistent with the geometric structure features. Simulation training data is constructed based on the target image.
2. The method for generating simulation training data according to claim 1, characterized in that, The constraint information also includes semantic constraint information; The step of obtaining the initial simulation image rendered in the simulation environment and the constraint information corresponding to the initial simulation image includes: The target scene is rendered in the simulation environment to obtain the initial simulation image; Based on the rendering results of the simulation environment, extract the geometric structural features corresponding to the initial simulation image; Based on the scene information of the simulation environment, the initial simulation image is semantically annotated to obtain semantic constraint information; wherein, the semantic constraint information is used to characterize the category information of different objects in the initial simulation image; The constraint information is generated based on the geometric structural features and the semantic constraint information.
3. The method for generating simulation training data according to claim 1, characterized in that, The step of inputting the initial simulation image and the constraint information as constraint conditions into a pre-trained generative model to generate a target image includes: The initial simulation image is input into the generative model; The constraint information is subjected to feature extraction processing to obtain constraint features; The constraint features and the image features corresponding to the initial simulation image are fused together to obtain the fused features; Based on the fusion features, the target image is generated using the generative model. During the generation of the target image, a geometric consistency constraint is applied to the target image to ensure that the inherent geometric structure of the target image is consistent with the geometric structure features.
4. The method for generating simulation training data according to claim 1, characterized in that, The training methods for the generative model include: The sample simulation image and the sample constraint information corresponding to the sample simulation image are input into the generative model to be trained to obtain the predicted image; The predicted image is input into a preset geometric estimation model to obtain the predicted geometric structure features; Calculate the difference between the predicted geometric structure features and the geometric structure features in the sample constraint information to determine the geometric fidelity loss; The parameters of the generative model are optimized and adjusted based on the geometric fidelity loss so that the intrinsic geometric structure of the target image generated by the generative model is consistent with the geometric structure features.
5. The method for generating simulation training data according to claim 4, characterized in that, After the step of obtaining the predicted image, the method further includes: The predicted image and the sample simulation image are respectively input into a preset feature extraction model to obtain the predicted geometric features corresponding to the predicted image and the sample geometric features corresponding to the sample simulation image; The feature constraint loss is determined based on the difference between the predicted geometric features and the sample geometric features; Based on the feature constraint loss and the geometric fidelity loss, the parameters of the generative model are jointly optimized and adjusted.
6. The method for generating simulation training data according to claim 4, characterized in that, After the step of obtaining the predicted image, the method further includes: The predicted image is input into a preset semantic recognition model to obtain predicted semantic information; Based on the difference between the predicted semantic information and the semantic constraint information in the sample constraint information, the semantic consistency loss is determined. Based on the semantic consistency loss and the geometric fidelity loss, the parameters of the generative model are jointly optimized and adjusted.
7. The method for generating simulation training data according to claim 3, characterized in that, The method further includes: During the generation of the target image, the target image features of the previous frame corresponding to the current frame are obtained, and the target image features are used as temporal context information. The temporal context information and the constraint information of the current frame are input into the generative model to generate target images corresponding to multiple consecutive frames; wherein the target images corresponding to the multiple consecutive frames are consistent in time series.
8. The method for generating simulation training data according to claim 1, characterized in that, After the step of generating the target image, the method further includes: Acquire real-world environmental images and extract visual style features from the real-world environmental images; The visual style features are input as additional constraint information into the generative model to generate an updated target image; wherein the visual style features of the updated target image are consistent with the visual style features of the real-world environment image.
9. The method for generating simulation training data according to claim 1, characterized in that, The step of constructing simulation training data based on the target image includes: Obtain the action information and / or state information corresponding to the initial simulation image; Associate the target image with the action information and / or state information; The simulation training data is constructed based on the target image and the action information and / or state information.
10. A system for generating simulation training data, characterized in that, include: The data acquisition module is used to acquire an initial simulation image rendered in a simulation environment and the constraint information corresponding to the initial simulation image; wherein, the constraint information includes geometric structural features used to characterize the scene corresponding to the initial simulation image; the geometric structural features include feature information used to characterize the spatial structural relationships of objects in the scene; An image generation module is used to input the initial simulation image and the constraint information as constraint conditions into a pre-trained generative model to generate a target image; wherein the target image has visual style features corresponding to the real scene, and the internal geometric structure of the target image is consistent with the geometric structure features; The data construction module is used to construct simulation training data based on the target image.