Method for generating a simulated image
By performing image recognition and 3D reconstruction on the reference video stream, a simulation image applicable to multiple scenarios is generated, which solves the problem that the simulation system in the existing technology is difficult to adapt to diverse driving environments, and realizes the ability to generate and generalize simulation images quickly and accurately.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGZHOU XIAOPENG CONNECTIVITY TECH CO LTD
- Filing Date
- 2026-03-27
- Publication Date
- 2026-06-30
AI Technical Summary
Existing autonomous driving simulation systems struggle to quickly adapt to diverse driving environments, resulting in simulation image generation failing to meet training and evaluation requirements. Furthermore, existing methods suffer from visual artifacts or spatial structure distortions when applied across different scenarios.
By performing image recognition on the reference video stream, a first data stream containing static image regions and a second data stream containing 3D data of moving objects are generated. The target background image is rendered using a 3D reconstruction model, and a simulation image is generated by combining the target position information. This avoids scene-by-scene training and achieves cross-scene generalization.
It enables the rapid and accurate generation of simulation images in different scenarios, improves the generalization ability and applicability of simulation image generation, and meets the needs of real-time closed-loop simulation.
Smart Images

Figure CN122312901A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of intelligent driving, specifically to a method for generating simulation images. Background Technology
[0002] Currently, with the rapid development of artificial intelligence and computer vision technologies, autonomous driving technology, as an important development direction in the field of intelligent transportation, plays an increasingly crucial role in improving travel efficiency and driving safety. In the research and development and deployment of autonomous driving systems, simulation environments are typically relied upon for model training and performance evaluation. The quality of the generated simulation images affects the iteration efficiency and reliability of autonomous driving technology. However, current autonomous driving simulation systems struggle to quickly adapt to diverse driving environments, resulting in simulation systems failing to fully meet the training and evaluation needs of autonomous driving models in practical applications. Summary of the Invention
[0003] This application discloses a method for generating simulated images, which can quickly and accurately generate simulated images in different scenarios and improve the generalization ability of simulated image generation.
[0004] In a first aspect, embodiments of this application disclose a method for generating a simulated image, the method comprising: Obtain a reference video stream, the reference video stream comprising images corresponding to multiple timestamps; Image recognition is performed on each frame of the reference video stream to obtain a first data stream and a second data stream; the first data stream includes image data of static image regions in the images corresponding to the plurality of timestamps respectively; the second data stream includes three-dimensional data of moving objects in the images corresponding to the plurality of timestamps respectively, and the three-dimensional data includes the position information of the moving objects in three-dimensional space; From the first data stream, image data of static image regions in multiple frames of reference images within the time range of the target timestamp are obtained, and a target background image is rendered based on the image data of static image regions in the multiple frames of reference images and the target viewpoint. From the second data stream, the target position information of the moving object corresponding to the target timestamp is obtained, and a simulation image corresponding to the target timestamp under the target viewpoint is generated based on the target position information and the target background image.
[0005] As an optional implementation, the step of performing image recognition on each frame of the reference video stream to obtain a first data stream and a second data stream includes: Based on each frame of the reference video stream, a static image region in each frame is determined, and a first data stream is generated based on the static image region in each frame and the corresponding timestamp. The reference video stream is input into the target detection model, and the target detection model identifies moving objects in each frame of the reference video stream to obtain the three-dimensional data of the moving objects in each frame. Based on the three-dimensional data of the moving objects in each frame and the corresponding timestamp, a second data stream is generated.
[0006] As an optional implementation, the three-dimensional data includes the three-dimensional bounding box information of the moving object; The three-dimensional bounding box information includes the position information, size parameters, and orientation angle of the moving object; wherein, the position information includes the coordinates of the center point of the moving object in three-dimensional space, the size parameters include the length, width, and height of the three-dimensional bounding box, and the orientation angle includes the heading angle of the moving object on the horizontal plane.
[0007] As an optional implementation, the method further includes: In response to an editing operation on the 3D bounding box information of the target moving object corresponding to the target timestamp, the 3D bounding box information of the target moving object in the second data stream is updated; the target moving object is any moving object in the image corresponding to the target timestamp. Based on the updated 3D bounding box information, the steps of obtaining the target 3D bounding box information of the moving object corresponding to the target timestamp from the second data stream and generating a simulation image of the target timestamp from the target viewpoint based on the target 3D bounding box information and the target background image are re-executed to obtain the edited simulation image.
[0008] As an optional implementation, the step of rendering the target background image based on the image data of the static image region in the multi-frame reference images and the target viewpoint includes: The three-dimensional reconstruction model is used to perform three-dimensional structural reasoning on the image data of the static image region in the multi-frame reference image to obtain three-dimensional scene parameters. The three-dimensional scene parameters are used to characterize the geometric structure and appearance attributes of the static scene corresponding to the static image region of the multi-frame reference image in three-dimensional space. The static scene is rendered according to the target viewpoint and the three-dimensional scene parameters to obtain a target background image, which is a two-dimensional projection image of the static scene under the target viewpoint.
[0009] As an optional implementation, the 3D reconstruction model includes a feedforward 3D Gaussian splash model; the 3D scene parameters include multiple Gaussian functions and Gaussian parameters corresponding to each Gaussian function; the step of performing 3D structural inference on the image data of static image regions in the multi-frame reference images through the 3D reconstruction model to obtain the 3D scene parameters includes: The image data of the static image region in the multi-frame reference image is input into the feedforward three-dimensional Gaussian splash model; Static feature maps are obtained by extracting features from the image data of static image regions in the multi-frame reference images through the forward propagation network of the feedforward three-dimensional Gaussian splash model. By using the prediction branches of the feedforward 3D Gaussian splash model, parameter prediction is performed on one or more Gaussian functions corresponding to each pixel position in the static feature map to obtain the Gaussian parameters corresponding to each Gaussian function.
[0010] As an optional implementation, the three-dimensional scene parameters include the spatial position parameters of the static scene in three-dimensional space; before rendering the static scene according to the target viewpoint and the three-dimensional scene parameters to obtain the target background image, the method further includes: Obtain point cloud data corresponding to the multiple reference images, wherein the point cloud data includes coordinate information of multiple three-dimensional spatial points; Using the point cloud data as prior information, the spatial location parameters are constrained and adjusted to optimize the geometric structure of the static scene in three-dimensional space as represented by the three-dimensional scene parameters.
[0011] As an optional implementation, the point cloud data includes target depth values corresponding to each of the three-dimensional spatial points; the step of using the point cloud data as prior information to constrain and adjust the spatial position parameters includes: Based on the camera parameters corresponding to the first reference image, determine the target correspondence between multiple three-dimensional spatial points in the point cloud data corresponding to the first reference image and each pixel in the first reference image; the first reference image can be any reference image. Based on the three-dimensional scene parameters and the viewpoint corresponding to the first reference image, determine the rendering depth map corresponding to the first reference image; Based on the target correspondence and the rendering depth map, determine the rendering depth value corresponding to each of the three-dimensional space points; Calculate the depth deviation between the target depth value and the corresponding rendering depth value for each of the three-dimensional space points to obtain the depth deviation corresponding to the first reference image; Based on the depth deviations corresponding to the multiple reference images, a constraint loss is constructed, and the spatial position parameters are optimized through backpropagation to obtain the optimized spatial position parameters.
[0012] As an optional implementation, rendering the static scene based on the target viewpoint and the three-dimensional scene parameters to obtain the target background image includes: Based on the target perspective, the static scene represented by the three-dimensional scene parameters is rendered to obtain a target background image and a corresponding depth map; the depth map includes the depth value of each pixel in the background image in the static scene. The step of generating a simulated image from the target's perspective corresponding to the target's timestamp, based on the target's location information and the target's background image, includes: The target location information, the target background image, and the depth map are input into the generation model. The generation model generates a simulated image from the target viewpoint corresponding to the target timestamp based on the target location information, the target background image, and the depth map.
[0013] As an optional implementation, the step of generating a simulated image from the target's perspective corresponding to the target's timestamp using the generative model based on the target's location information, the target's background image, and the depth map includes: Based on the target location information, the embedding region of each moving object in the target background image is determined using the generation model. Under the spatial constraints of the depth map, image content is generated that aligns each of the moving objects with the background image within its corresponding embedded region; The image content corresponding to each of the moving objects is fused into the corresponding embedding region to obtain a simulated image corresponding to the target timestamp under the target viewpoint; the simulated image includes the target background image and each of the moving objects embedded in the target background image.
[0014] Secondly, embodiments of this application disclose a device for generating simulated images, the device comprising: The acquisition module is used to acquire a reference video stream, which includes images corresponding to multiple timestamps respectively; A decoupling module is used to perform image recognition on each frame of the reference video stream to obtain a first data stream and a second data stream; the first data stream includes image data of static image regions in the images corresponding to the plurality of timestamps respectively; the second data stream includes three-dimensional data of moving objects in the images corresponding to the plurality of timestamps respectively, and the three-dimensional data includes the position information of the moving objects in three-dimensional space; The rendering module is used to obtain image data of static image regions in multiple frames of reference images within the time range of the target timestamp from the first data stream, and to render the target background image based on the image data of the static image regions in the multiple frames of reference images and the target viewpoint. The generation module is used to obtain target position information of the moving object corresponding to the target timestamp from the second data stream, and generate a simulation image of the target timestamp from the target perspective based on the target position information and the target background image.
[0015] Thirdly, embodiments of this application disclose an electronic device, including a memory and a processor. The memory stores a computer program, and when the computer program is executed by the processor, the processor causes the processor to implement the method described in any of the embodiments of the first aspect above.
[0016] Fourthly, embodiments of this application disclose a computer-readable storage medium that stores a computer program, which, when executed by a processor, implements the methods described in any of the above embodiments.
[0017] Fifthly, embodiments of this application disclose a computer program product, including a computer program, wherein the computer program, when executed by a processor, implements the method described above.
[0018] Compared with related technologies, the embodiments of this application have the following beneficial effects: By performing image recognition on each frame of the reference video stream, a first data stream containing static image region data and a second data stream containing 3D data of moving objects are obtained. Then, during the generation of the simulated image, the target background image from the target's perspective is rendered from the first data stream based on the static image region data from multiple frames. Since the static image region data in the first data stream is an abstract representation of the static background in the scene, this rendering process does not require scene-by-scene training for different static scenes. This allows the simulation system to directly adapt to the scene corresponding to any newly input reference video stream, possessing cross-scene generalization capabilities. Simultaneously, the target position information of the moving object is obtained from the second data stream and combined with the target background image to generate the simulated image. This makes the generation of dynamic objects independent of the image content of a specific scene; only 3D data is needed to flexibly embed it into different background scenes. Therefore, without additional training for each new scene, simulated images from different perspectives and time stamps can be generated quickly and accurately, improving the generalization and applicability of simulated image generation. Attached Figure Description
[0019] To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0020] Figure 1 This is a flowchart of a method for generating a simulated image in one embodiment; Figure 2 This is a schematic diagram of any frame in a reference video stream in one embodiment; Figure 3 This is a flowchart of a method for generating a simulated image in one embodiment; Figure 4 This is a flowchart illustrating the process of obtaining a target background image in one embodiment; Figure 5 This is a flowchart illustrating the acquisition of 3D scene parameters in one embodiment; Figure 6 This is a flowchart illustrating the optimization of 3D scene parameters in one embodiment; Figure 7 This is a flowchart illustrating the generation process of a simulated image in one embodiment; Figure 8 This is a block diagram of a device for generating a simulated image in one embodiment; Figure 9 This is a structural block diagram of an electronic device in one embodiment. Detailed Implementation
[0021] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0022] It is understood that the terms "first," "second," etc., used in this application may be used herein to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish one element from another. For example, without departing from the scope of this application, a first data stream may be referred to as a second data stream, and similarly, a second data stream may be referred to as a first data stream. The first and second data streams are obtained by processing the same video stream; the first data stream includes image data of static image regions in each frame of the video stream; the second data stream includes three-dimensional data of moving objects in each frame of the video stream.
[0023] In related technologies, simulation image generation based on 3D reconstruction and simulation image generation based on generative models are the two main ways for simulation systems to generate simulation images.
[0024] 3D reconstruction-based methods acquire multi-view images of real-world scenes, utilize 3D reconstruction technology to construct the 3D structure of the scene, and render simulated images from new perspectives, which can maintain good geometric consistency and spatial accuracy of the scene. However, this method usually requires scene-by-scene optimization training for each new scene, the reconstruction process is time-consuming, it is difficult to quickly adapt to diverse driving scenarios, and when the input viewpoints are sparse, the rendering of new perspectives is prone to visual artifacts, limiting its cross-scene generalization application.
[0025] Generative model-based approaches use input conditional information to predict and generate simulated images, producing highly realistic and detailed visuals without requiring separate training for each scene, thus exhibiting strong visual expressiveness. However, this method struggles to guarantee geometric consistency during generation, and issues such as spatial structure distortion and inaccurate occlusion relationships can easily arise as the generation process progresses. Furthermore, its slow generation speed makes it difficult to meet the requirements of real-time closed-loop simulation.
[0026] This application discloses a method for generating simulated images, which can quickly and accurately generate simulated images in different scenarios and improve the generalization ability of simulated image generation.
[0027] like Figure 1 As shown, in one embodiment, a method for generating simulated images is provided, which can be applied to an electronic device equipped with a simulation system. The simulation system is used to generate simulated images required for model training and evaluation. The electronic device can be a server, personal computer, workstation, cloud server, or other device with computing capabilities, or it can be an in-vehicle computing platform mounted on a mobile device (such as a vehicle). The method may include the following steps 110 to 140.
[0028] Step 110: Obtain the reference video stream.
[0029] The reference video stream includes images corresponding to multiple timestamps.
[0030] Optionally, the reference video stream can be driving recording video captured by the onboard camera of the autonomous vehicle while it is driving on a real road, or it can be a pre-collected sequence of images with temporal relationships read from an existing public dataset. Each timestamp corresponds to one frame of image, which records the driving environment near the vehicle at a certain moment.
[0031] For example, Figure 2 This is a schematic diagram for reference to any frame in the video stream, such as... Figure 2 As shown, the image illustrates a typical urban driving scenario, including static environmental elements such as lane lines, road signs, roadside buildings, and green belts, as well as dynamic moving objects such as other vehicles and pedestrians. The static environmental elements constitute the basic spatial structure of the scene, while the dynamic moving objects reflect traffic participation behaviors within the scene; together, they form a complete description of the driving environment.
[0032] In some embodiments, each frame of the reference video stream may carry corresponding acquisition parameter information, including but not limited to the timestamp of the acquisition time of each frame, the pose information of the acquisition device, and the intrinsic parameter information of the acquisition device.
[0033] The timestamp is used to establish the temporal correlation between images of different frames, providing a basis for temporal alignment in subsequent processing. The pose information of the acquisition device includes the camera's extrinsic parameter matrix, which describes the position and orientation of the acquisition device in the world coordinate system and is used to convert between pixel coordinates and 3D spatial coordinates. The intrinsic parameter information of the acquisition device may include focal length, principal point coordinates, distortion coefficients, etc. This intrinsic parameter information describes the geometric model of image imaging and is a key basis for 3D reconstruction and viewpoint transformation.
[0034] In some embodiments, the camera acquiring the reference video stream can be a variety of different types of image acquisition devices. Specifically, the camera system can be compatible with both standard pinhole cameras and large field-of-view distortion inputs such as fisheye cameras in surround view monitors (AVMs). This allows for flexible application to driving devices with different sensor configurations.
[0035] Step 120: Perform image recognition on each frame of the reference video stream to obtain the first data stream and the second data stream.
[0036] The first data stream includes image data of static image regions in images corresponding to multiple timestamps; the second data stream includes three-dimensional data of moving objects in images corresponding to multiple timestamps, and the three-dimensional data includes the position information of the moving objects in three-dimensional space.
[0037] The static image region in an image refers to the static background part retained after removing dynamic objects in each frame of the image. It can include road structures (such as lane lines, road markings, curbs, pedestrian crossings, etc.), traffic facilities (such as traffic signs, traffic lights, guardrails, medians, etc.), buildings and structures (such as roadside buildings, bridges, tunnels, toll stations, etc.), natural environment (such as green belts, trees, sky, mountains, etc.), and fixed facilities (such as streetlights, utility poles, bus stops, etc.).
[0038] It should be noted that the division between static image regions and dynamic moving objects can be flexibly adjusted according to the actual application scenario. For example, in a street scene, parked vehicles can be considered static, while vehicles that have just started moving are classified as dynamic.
[0039] Optionally, the moving object may include, but is not limited to, motor vehicles (such as cars, trucks, buses, motorcycles, etc.), non-motor vehicles (such as bicycles, electric bicycles, etc.), pedestrians, animals, and other dynamic objects (such as moving cones, construction signs, etc.).
[0040] In some embodiments, the electronic device may determine the static image region in each frame of the reference video stream based on each frame image, and generate a first data stream based on the static image region in each frame image and the corresponding timestamp; input the reference video stream into the target detection model, identify the moving object in each frame image of the reference video stream through the target detection model, obtain the three-dimensional data of the moving object in each frame image, and generate a second data stream based on the three-dimensional data of the moving object in each frame image and the corresponding timestamp.
[0041] Specifically, electronic devices can use semantic segmentation networks to perform pixel-level classification of each frame of images to determine the static image regions of each frame. The semantic segmentation network can be a pre-trained deep learning model on an autonomous driving dataset, such as DeepLabV3+, SegFormer, or OCRNet.
[0042] Taking DeepLabV3+ as an example, this semantic segmentation network extracts multi-scale features from images through an encoder-decoder structure and expands the receptive field through dilated convolution, ultimately outputting pixel-level classification results of the same size as the input image. In the classification results, pixel regions belonging to categories such as roads, lane lines, roadside buildings, traffic signs, green belts, and sky are identified as static image regions; while pixel regions belonging to categories such as vehicles, pedestrians, and cyclists are identified as dynamic object regions. After the semantic segmentation network outputs, electronic devices can generate a binary mask image based on the classification results, retaining the pixel values corresponding to static regions and setting the pixel values corresponding to dynamic regions to preset values (such as 0 or 255), thus obtaining the image data of the static image regions.
[0043] Alternatively, the electronic device may also use optical flow or inter-frame difference methods for static region identification. For a series of consecutive frames of images, by calculating the pixel motion vectors between adjacent frames, pixel regions with motion vectors less than a preset threshold are identified as static image regions, while regions with larger motion vectors are identified as dynamic object regions.
[0044] In some embodiments, the target detection model may be a perception model, which may be a vision-based 3D target detection network or a multimodal fusion perception network, capable of directly or indirectly inferring the three-dimensional information of a moving object from a monocular image or a panoramic image.
[0045] In some embodiments, the three-dimensional data includes three-dimensional bounding box information of the moving object; the three-dimensional bounding box information includes the position information, size parameters, and orientation angle of the moving object; wherein, the position information includes the coordinates of the center point of the moving object in three-dimensional space, the size parameters include the length, width, and height of the three-dimensional bounding box, and the orientation angle includes the heading angle of the moving object on the horizontal plane.
[0046] Taking the FCOS3D (Fully Convolutional One-Stage 3D Object Detection) model as an example, the image is input into the model, and multi-scale feature maps of the image are extracted through the backbone network (such as ResNet). Then, at each location of the feature map, it is predicted whether the location belongs to the center point projection of a certain moving object. For locations that belong to the center point of the moving object, the network further regresses the depth, three-dimensional dimensions (length, width, height), orientation angle, and center point projection offset of the moving object. Finally, based on the camera intrinsic parameter matrix, the two-dimensional projection points and depth values are back-projected into three-dimensional space to obtain the three-dimensional center point coordinates of each moving object in the camera coordinate system. Combined with the regressed dimensions and orientation, the three-dimensional bounding box information of each moving object in the image is constructed.
[0047] In some embodiments, the electronic device supports interactive editing of dynamic objects by the user to flexibly generate diverse simulation scenarios, meeting the diverse data requirements of driving model training. For example... Figure 3 As shown, the method may include the following steps 302 to 304.
[0048] Step 302: In response to the editing operation of the 3D bounding box information of the target moving object corresponding to the target timestamp, update the 3D bounding box information of the target moving object in the second data stream.
[0049] The target moving object is any moving object in the image corresponding to the target timestamp. Users can adjust the 3D data of any one or more moving objects in the scene according to the generation requirements of the simulation scene, including but not limited to adjusting the position of the moving object and changing its orientation angle.
[0050] In some embodiments, the electronic device may include a graphical interface for displaying the three-dimensional bounding boxes of various moving objects in the current scene to the user and receiving editing operations from the user on the three-dimensional bounding box information.
[0051] For example, the graphical interface can be the front-end interactive module of a simulation system, running on the display screen of an electronic device, or presented on a user terminal device via a remote connection. Through the graphical interface, users can intuitively observe the spatial layout of various moving objects in the scene and edit and adjust the moving objects in a visual manner.
[0052] For example, in the graphical interface, the 3D bounding box of a moving object can be overlaid on the scene as a semi-transparent wireframe. Each bounding box corresponds to a moving object, and different colors can be used to distinguish the category of the moving object (e.g., blue for cars, green for pedestrians, orange for trucks, yellow for bicycles, etc.). The transparency of the bounding box can be adjusted according to the user's needs to balance the requirements of scene visibility and editing precision. When the user selects a moving object, its corresponding 3D bounding box is highlighted, and a detailed attribute information panel is displayed, including the moving object's unique identifier, category, position information, size parameters, and orientation angle.
[0053] In some embodiments, editing operations include, but are not limited to, position editing, size editing, orientation editing, and addition / deletion editing. Position editing refers to changing the coordinates of the center point of the moving object; size editing refers to modifying the length, width, and height of the 3D bounding box; orientation editing refers to modifying the heading angle of the moving object on the horizontal plane; and addition / deletion editing refers to adding and deleting moving objects.
[0054] For example, users can select the 3D bounding box of a target moving object in the graphical interface by dragging with a mouse, swiping on a touchscreen, or using gesture control, and then drag the 3D bounding box to a new position in the 3D scene. During the dragging process, the electronic device can display the changes in the coordinates of the center point of the bounding box in real time and provide a snapping function so that the moving object can be automatically aligned with the road surface or the center line of the lane to ensure the rationality of the position modification.
[0055] Step 304: Based on the updated 3D bounding box information, re-execute the steps of obtaining the target 3D bounding box information of the moving object corresponding to the target timestamp from the second data stream, and generating a simulation image of the target timestamp from the target perspective based on the target 3D bounding box information and the target background image, to obtain the edited simulation image.
[0056] After the user completes the editing of the 3D bounding box information of the target moving object, the electronic device re-executes the simulation image generation process based on the updated 3D bounding box information. Since the background image has already been pre-rendered, the electronic device only needs to regenerate based on the updated position information of the moving object and the background image, without repeating the time-consuming 3D reconstruction and background rendering process. This reduces computational overhead, allowing the edited simulation image to be presented in near real-time.
[0057] For example, after a user drags the 3D bounding box of a car from the left side of the road to the right side of the road in the graphical interface, the electronic device immediately re-renders the image of the car onto the corresponding area of the target background image based on the updated position information, and the user can instantly see the simulated image after the vehicle's position has changed.
[0058] In some embodiments, the graphical interface also supports real-time preview of editing operations. While the user adjusts the 3D bounding box information of the target moving object, the electronic device can update the image rendering results in real time and display the edited simulation image in the graphical interface, allowing the user to intuitively observe the changes brought about by each adjustment. This helps users quickly locate the optimal parameter configuration and improves editing efficiency.
[0059] In some embodiments, the graphical interface may also provide editing history and undo functionality. The electronic device can record the moving objects, timestamps, and 3D bounding box information before and after each user editing operation, and supports undo and redo operations. Users can revert to any historical editing operation's simulation image using keyboard shortcuts (such as Ctrl+Z to undo, Ctrl+Y to redo) or the history panel, thus meeting the needs of diverse user scenarios.
[0060] Step 130: Obtain image data of static image regions in multiple frames of reference images within the time range of the target timestamp from the first data stream, and render the target background image based on the image data of static image regions in multiple frames of reference images and the target viewpoint.
[0061] In some embodiments, the time range to which the target timestamp belongs may include a time interval centered on the target timestamp, the length of which can be set according to actual application requirements. For example, the time range may be set to an interval of 1 second before and after the target timestamp, or a time span of 5 frames before and after the target timestamp.
[0062] The reason for acquiring multiple reference frames within the time frame, rather than just using a single frame corresponding to the target timestamp, is that a single frame only provides static background information from a single viewpoint, which is insufficient to support high-quality rendering from any new viewpoint. By using multiple frames, observation data of the same static scene from multiple viewpoints can be obtained, allowing for the reconstruction of the scene's 3D geometry and supporting the rendering of a complete target background image from any new viewpoint (target viewpoint). The greater the viewpoint difference between the multiple reference frames, the more complete the reconstructed 3D structure and the higher the rendering quality.
[0063] The target viewpoint refers to the camera viewpoint that is expected to be rendered. The target viewpoint can be the same as the original acquisition viewpoint or any new viewpoint.
[0064] In some embodiments, such as Figure 4 As shown, the steps are based on the image data of the static image region in the multi-frame reference images and the target viewpoint to render the target background image, and may also include steps 402 to 404.
[0065] Step 402: Perform 3D structural reasoning on the image data of static image regions in multiple reference images using a 3D reconstruction model to obtain 3D scene parameters.
[0066] 3D scene parameters are used to characterize the static scene corresponding to the static image region of multiple reference images, and its geometric structure and appearance attributes in 3D space.
[0067] In this context, the static scene corresponding to the static image region of a multi-frame reference image refers to the same physical environment described by all the reference images after removing dynamic objects. For example, in continuous video frames captured during vehicle movement, static elements such as roads, buildings, and traffic signs may appear at different viewpoints and distances in different frames, but they all correspond to the same physical entity in three-dimensional space. Through joint analysis of multiple frames, the true position and shape of these static elements in three-dimensional space can be recovered.
[0068] Specifically, 3D scene parameters are a mathematical expression of a static scene, capable of describing the geometry (such as point clouds, meshes, and depth values) and appearance features (such as color, texture, and reflectivity) of each spatial location within the static scene. They can integrate static image regions from multiple frames of 2D reference images into a unified 3D static scene representation, enabling the simulation system to render a projected image of the static scene from any new perspective, not just the original acquired viewpoint.
[0069] 3D reconstruction models refer to deep learning models or traditional geometric algorithms used to infer 3D structures from 2D images. 3D reconstruction models include feed-forward 3D Gaussian Splatting, which can quickly predict the 3D Gaussian representation of a scene from multi-view images in a single forward propagation, avoiding the high time consumption of traditional scene-by-scene optimization training.
[0070] The 3D scene parameters include multiple Gaussian functions and their corresponding Gaussian parameters. A Gaussian function is a probability distribution function in 3D space; each Gaussian function represents a small local region in the static scene. Gaussian parameters may include center position, covariance matrix, opacity, and spherical harmonic coefficients, etc.
[0071] The center position describes the location of the tiny local region corresponding to the Gaussian function in space; the covariance matrix describes the shape and size of this region; the opacity describes the degree to which this region affects light; and the spherical harmonic coefficient describes the color appearance of this region from different viewpoints. By combining a large number of Gaussian functions, the geometric structure and appearance attributes of complex scenes can be accurately fitted.
[0072] In some embodiments, since the amount of data in the static image regions of the multi-frame reference images is large, and it is necessary to efficiently infer the 3D scene parameters to support real-time rendering, therefore, as Figure 5 As shown, the steps involve performing three-dimensional structural reasoning on the image data of static image regions in multiple reference images using a three-dimensional reconstruction model to obtain three-dimensional scene parameters, and may also include steps 502 to 506.
[0073] Step 502: Input the image data of the static image region in the multi-frame reference images into the feedforward three-dimensional Gaussian splash model.
[0074] Because the feedforward 3D Gaussian splash model learns the mapping relationship from 2D images to 3D Gaussian representations through neural networks, it can predict 3D scene parameters in one go during the inference phase. In contrast, traditional methods (such as NeRF-based methods) require iterative optimization for each new scene for tens of minutes or even hours, which is difficult to meet the needs of simulation systems for rapid adaptation to diverse scenes.
[0075] Furthermore, the feedforward 3D Gaussian splash model learns from a large amount of scene data during the training phase, acquiring prior knowledge from image features to 3D structures, thus enabling it to quickly generalize to unseen scenes during inference. Therefore, electronic devices can quickly obtain accurate 3D scene parameters by inputting image data of static image regions from multiple frames of reference images into the feedforward 3D Gaussian splash model.
[0076] Step 504: Using the forward propagation network of the feedforward three-dimensional Gaussian splash model, feature extraction is performed on the image data of the static image region in the multi-frame reference images to obtain a static feature map.
[0077] Forward propagation networks typically employ convolutional neural networks (CNNs) or visual Transformers (ViTs) architectures to extract multi-scale visual features from input images.
[0078] Specifically, electronic devices can extract features from multiple reference images using a feedforward 3D Gaussian splashing model's forward propagation network, and then fuse the features from multiple frames using a cross-view attention mechanism or cost volume to generate a unified static feature map.
[0079] For example, for the input multi-frame reference images (assuming N frames), the forward propagation network first extracts the features of each reference image using a feature extractor with shared weights (such as ResNet, Swin Transformer, etc.), resulting in N two-dimensional feature maps. Then, these two-dimensional feature maps are aggregated in three-dimensional space through planar scanning or cost volume construction.
[0080] The planar scanning method discretizes the 3D space into multiple depth planes. For each depth plane, the features of each frame are projected onto a reference viewpoint along that depth plane to form a feature volume. Then, a 3D convolutional network is used to aggregate the feature volumes to obtain a 3D feature volume. Finally, this 3D feature volume is compressed along the depth direction or fused through an attention mechanism to obtain the final static feature map.
[0081] Optionally, the static feature map can be a two-dimensional feature map (corresponding to a certain reference viewpoint) or a three-dimensional feature volume (corresponding to a grid in three-dimensional space). It should be noted that the feature vector at each location in the static feature map encodes the geometric and appearance information of that location in three-dimensional space, providing a basis for subsequent Gaussian parameter prediction.
[0082] Step 506: Through the prediction branches of the feedforward 3D Gaussian splash model, perform parameter prediction on one or more Gaussian functions corresponding to each pixel position in the static feature map to obtain the Gaussian parameters corresponding to each Gaussian function.
[0083] Each prediction branch refers to multiple independent prediction heads in the feedforward 3D Gaussian splash model, and each prediction branch is responsible for predicting different parameters of the Gaussian function. Since the parameters of the Gaussian function have high dimensionality and different physical meanings, using multiple independent branches for separate prediction can reduce the learning difficulty and improve the prediction accuracy.
[0084] In some embodiments, the prediction branches may include, but are not limited to, depth prediction branches, covariance prediction branches, opacity prediction branches, and color prediction branches.
[0085] Specifically, electronic devices can use a depth prediction branch to predict the depth value of the corresponding 3D point relative to the optical center of a reference camera for each pixel location in the static feature map. This depth value, combined with the pixel's 2D coordinates and camera intrinsic parameters, can be back-projected to obtain the point coordinates in 3D space, which are the center points of the Gaussian functions. To improve modeling capabilities, multiple depth values can be predicted for each pixel location, corresponding to multiple Gaussian functions, to represent scene structures at different depth levels (such as foreground objects and background walls).
[0086] Specifically, electronic devices can use the covariance prediction branch to predict the covariance matrix parameters for each Gaussian function corresponding to each pixel location in the static feature map. The covariance matrix describes the shape and orientation of the Gaussian function, typically in an anisotropic form, and is constructed by predicting rotation matrices and scaling vectors. The accuracy of the covariance matrix directly affects the geometric precision and texture sharpness of the rendered image.
[0087] Specifically, electronic devices can use an opacity prediction branch to predict the opacity value of each Gaussian function corresponding to each pixel location in the static feature map. The value typically ranges from 0 to 1. Opacity represents the degree of influence of the Gaussian function on light and determines its color contribution weight during rendering. Gaussian functions with higher opacity correspond to entities in a static scene, while those with lower opacity correspond to semi-transparent or empty areas.
[0088] Specifically, electronic devices can use a color prediction branch to predict the color parameters of each Gaussian function corresponding to each pixel position in the static feature map. To improve rendering efficiency, spherical harmonics are typically used to represent color, allowing the color changes of the Gaussian function to be calculated based on the viewing direction when rendering images from different perspectives, thus simulating visual effects such as specular reflection and gloss. The order of the spherical harmonics can be selected according to the accuracy requirements; for example, using third-order spherical harmonics can well represent diffuse reflection and slight specular reflection effects.
[0089] In some embodiments, since the 3D scene parameters predicted by the feedforward 3D Gaussian splash model in a single forward propagation may have certain geometric errors, especially when the input viewpoint is relatively sparse, the electronic device can optimize the 3D scene parameters after obtaining them and before rendering to improve geometric accuracy. The 3D scene parameters include the spatial position parameters of the static scene in 3D space.
[0090] Specifically, such as Figure 6As shown, the optimization process of the electronic device for the three-dimensional scene parameters may include steps 602 to 604.
[0091] Step 602: Obtain the point cloud data corresponding to the multiple reference images.
[0092] In some embodiments, the electronic device may acquire lidar point cloud data synchronously recorded during the acquisition of a reference video stream. This point cloud data may be acquired by an onboard lidar sensor and contains high-precision three-dimensional spatial point coordinates.
[0093] Optionally, the point cloud data includes coordinate information for multiple 3D spatial points. Each 3D spatial point contains its x, y, and z coordinates in 3D space. These coordinates are typically in a world coordinate system or a vehicle coordinate system, sharing a unified coordinate system with the images in the reference video stream.
[0094] Optionally, the point cloud data includes the target depth value corresponding to each 3D spatial point. This depth value represents the distance of the 3D spatial point in the point cloud relative to a reference camera or reference plane. The depth value can be calculated from 3D coordinates through coordinate transformation. For example, by transforming a point in the point cloud to the camera coordinate system, its z-coordinate is the depth value of that point relative to the camera's optical center.
[0095] Step 604: Using point cloud data as prior information, constrain and adjust the spatial location parameters to optimize the geometric structure of the static scene in three-dimensional space as represented by the three-dimensional scene parameters.
[0096] Since point cloud data is a real scene geometric measurement result obtained by direct acquisition by LiDAR or accurate calculation by motion reconstruction algorithm, the coordinate information of the three-dimensional spatial points contained therein has high accuracy and reliability, and can truly reflect the actual position of the object surface in the scene. Therefore, the point cloud data corresponding to multiple reference images are used as prior data to constrain and correct the spatial position parameters in the three-dimensional scene parameters, so that the optimized three-dimensional scene parameters are more in line with the actual physical structure. This can correct the geometric deviations that may occur in the feedforward three-dimensional Gaussian splash model under sparse view input, such as depth estimation errors and surface position offsets.
[0097] In some embodiments, the electronic device may determine the target correspondence between multiple three-dimensional spatial points in the point cloud data corresponding to the first reference image and each pixel in the first reference image based on the camera parameters corresponding to the first reference image; the first reference image can be any reference image; a rendering depth map corresponding to the first reference image is determined based on the three-dimensional scene parameters and the viewpoint corresponding to the first reference image; a rendering depth value corresponding to each three-dimensional spatial point is determined based on the target correspondence and the rendering depth map; the depth deviation between the target depth value and the corresponding rendering depth value corresponding to each three-dimensional spatial point is calculated to obtain the depth deviation corresponding to the first reference image; a constraint loss is constructed based on the depth deviations corresponding to multiple reference images, and the spatial position parameters are optimized through backpropagation to obtain the optimized spatial position parameters.
[0098] Optionally, the camera parameters include intrinsic parameters (focal length, principal point coordinates) and extrinsic parameters (rotation matrix, translation vector). The camera projection matrix can project three-dimensional spatial points onto the image plane, thereby establishing the correspondence between three-dimensional points and image pixels.
[0099] Specifically, for each 3D spatial point in the point cloud data, the electronic device can project it onto the first reference image to obtain the pixel coordinates corresponding to each 3D spatial point, thereby determining the target correspondence. It should be noted that since the point cloud and the image may not be perfectly aligned (e.g., the installation positions of the LiDAR and the camera are different), coordinate system calibration and extrinsic parameter calibration can be performed before establishing the correspondence to ensure the accuracy of the projection.
[0100] Specifically, when rendering an image from the perspective of the first reference image, the electronic device calculates the contribution of each Gaussian function at each pixel location to that pixel, performs alpha mixing in order of depth from near to far, and finally retains the depth value of the Gaussian function closest to the camera and whose opacity has accumulated to a threshold as the rendering depth value of that pixel, so as to generate a rendering depth map with the same resolution as the first reference image, wherein the rendering depth value of each pixel represents the estimated depth of the 3D scene point corresponding to that pixel from the camera.
[0101] It should be noted that the depth deviation can be an absolute deviation (|target depth - rendering depth|), a relative deviation (|target depth - rendering depth| / target depth) or a squared deviation ((target depth - rendering depth)²), and the specific form can be selected according to the actual needs.
[0102] Specifically, the electronic device can summarize the depth deviations of multiple reference images to construct a total loss function, such as L1 loss or L2 loss; calculate the gradient of the total loss function with respect to the spatial location parameters of each Gaussian function, and update the spatial location parameters along the gradient descent direction.
[0103] Specifically, electronic devices can employ optimization algorithms such as stochastic gradient descent or Adam, with appropriate learning rates and iteration counts. Through multiple rounds of iterative optimization, the spatial position parameters gradually converge, minimizing the difference between the target depth value and the corresponding rendered depth value for each 3D spatial point, thereby improving the geometric accuracy of 3D reconstruction.
[0104] Step 404: Render the static scene according to the target viewpoint and 3D scene parameters to obtain the target background image.
[0105] The target background image is a two-dimensional projection image of a static scene from the target's perspective.
[0106] In some embodiments, the electronic device can render a static scene characterized by three-dimensional scene parameters according to the target viewpoint to obtain a target background image and a corresponding depth map; the depth map includes the depth value of each pixel in the background image in the static scene. The target location information, the target background image, and the depth map are input into the generation model, and the generation model generates a simulated image corresponding to the target timestamp from the target viewpoint based on the target location information, the target background image, and the depth map.
[0107] Step 140: Obtain the target position information of the moving object corresponding to the target timestamp from the second data stream, and generate a simulation image of the target timetamp from the target perspective based on the target position information and the target background image.
[0108] The simulation image includes the target background image and various moving objects embedded in the target background image. By fusing the static background with the dynamic objects, the generated simulation image not only preserves the geometric structure and environmental texture of the scene, but also flexibly presents moving objects with different layouts, categories, and postures, thereby meeting the needs of autonomous driving model training for diverse scene data.
[0109] In some embodiments, the electronic device can generate a model to determine the embedding region of each moving object in the target background image based on the target location information; under the spatial constraints of the depth map, generate image content of each moving object aligned with the background image in the corresponding embedding region; and fuse the image content corresponding to each moving object into the corresponding embedding region to obtain a simulated image of the target viewpoint corresponding to the target timestamp.
[0110] Alternatively, the generative model can be a lightweight neural network that naturally and realistically integrates the moving object into the target background image. The generative model can employ Conditional Generative Adversarial Networks (cGANs), Diffusion Models, or U-Net architectures, etc., and learns the light and shadow interaction relationship between the moving object and the background environment during training, so that the generated moving object can blend naturally into the background and produce correct shadow, reflection, and occlusion effects.
[0111] In one specific embodiment, such as Figure 7 As shown, the specific process for generating the simulation image is as follows.
[0112] The electronic device first acquires a reference video stream, which includes images corresponding to multiple timestamps. Then, it performs image recognition on each frame of the reference video stream to obtain a first data stream and a second data stream. The first data stream includes image data of static image regions from the images corresponding to the multiple timestamps. The second data stream includes three-dimensional data of moving objects from the images corresponding to the multiple timestamps.
[0113] When it is necessary to generate a simulation image corresponding to the target timestamp, the electronic device obtains image data of the static image region in the multi-frame reference images within the time range of the target timestamp from the first data stream; based on the image data of the static image region in the multi-frame reference images, the three-dimensional scene parameters are obtained through inference by the three-dimensional reconstruction model; based on the three-dimensional scene parameters and the target viewpoint, the target background image is rendered.
[0114] Simultaneously, the electronic device obtains the target position information of the moving objects corresponding to the target timestamp from the second data stream. Based on the target position information and the target background image, the electronic device generates a simulated image from the target's perspective, corresponding to the target timestamp, using a generative model. The simulated image includes the target background image and various moving objects embedded in the target background image.
[0115] In this embodiment, the electronic device performs image recognition on each frame of a reference video stream to obtain a first data stream containing static image region data and a second data stream containing 3D data of moving objects. Then, during the generation of the simulated image, a target background image from the target perspective is rendered based on the static image region data from the first data stream. Since the static image region data in the first data stream is an abstract representation of the static background in the scene, this rendering process does not require scene-by-scene training for different static scenes. This allows the simulation system to directly adapt to the scene corresponding to any newly input reference video stream, possessing cross-scene generalization capabilities. Simultaneously, the target position information of the moving object is obtained from the second data stream and combined with the target background image to generate the simulated image. This ensures that the generation of dynamic objects does not depend on the image content of a specific scene; only 3D data is needed to flexibly embed it into different background scenes. Therefore, without additional training for each new scene, simulated images from different perspectives and timestamps can be generated quickly and accurately, improving the generalization and applicability of simulated image generation.
[0116] like Figure 8 As shown, in one embodiment, a simulated image generation apparatus 800 is provided, which can be applied to the aforementioned electronic device. The simulated image generation apparatus 800 may include an acquisition module 810, a decoupling module 820, a rendering module 830, and a generation module 840.
[0117] The acquisition module 810 is used to acquire a reference video stream, which includes images corresponding to multiple timestamps.
[0118] The decoupling module 820 is used to perform image recognition on each frame of the reference video stream to obtain a first data stream and a second data stream. The first data stream includes image data of static image regions in images corresponding to multiple timestamps. The second data stream includes three-dimensional data of moving objects in images corresponding to multiple timestamps, and the three-dimensional data includes the position information of the moving objects in three-dimensional space.
[0119] The rendering module 830 is used to obtain image data of static image regions in multiple frame reference images within the time range of the target timestamp from the first data stream, and to render the target background image based on the image data of static image regions in multiple frame reference images and the target viewpoint.
[0120] The generation module 840 is used to obtain the target position information of the moving object corresponding to the target timestamp from the second data stream, and generate a simulation image from the target perspective corresponding to the target timestamp based on the target position information and the target background image.
[0121] In some embodiments, the decoupling module 820 is further configured to: determine the static image region in each frame image based on each frame image in the reference video stream; generate a first data stream based on the static image region in each frame image and the corresponding timestamp; input the reference video stream into the target detection model; identify the moving objects in each frame image in the reference video stream through the target detection model; obtain the three-dimensional data of the moving objects in each frame image; and generate a second data stream based on the three-dimensional data of the moving objects in each frame image and the corresponding timestamp.
[0122] Optionally, the three-dimensional data includes the three-dimensional bounding box information of the moving object; the three-dimensional bounding box information includes the position information, size parameters, and orientation angle of the moving object; wherein, the position information includes the coordinates of the center point of the moving object in three-dimensional space, the size parameters include the length, width, and height of the three-dimensional bounding box, and the orientation angle includes the heading angle of the moving object on the horizontal plane.
[0123] In some embodiments, the simulated image generation apparatus 800 further includes an editing module.
[0124] The editing module is used to update the 3D bounding box information of the target moving object in the second data stream in response to the editing operation of the 3D bounding box information of the target moving object corresponding to the target timestamp; the target moving object is any moving object in the image corresponding to the target timestamp.
[0125] Optionally, the generation module 840 is further configured to re-execute the steps of obtaining the target three-dimensional bounding box information of the moving object corresponding to the target timestamp from the second data stream based on the updated three-dimensional bounding box information, and generating a simulation image of the target timestamp from the target viewpoint based on the target three-dimensional bounding box information and the target background image, so as to obtain the edited simulation image.
[0126] In some embodiments, the rendering module 830 is further configured to perform three-dimensional structural reasoning on the image data of the static image region in the multi-frame reference images through a three-dimensional reconstruction model to obtain three-dimensional scene parameters. The three-dimensional scene parameters are used to characterize the static scene corresponding to the static image region of the multi-frame reference images, and the geometric structure and appearance attributes in three-dimensional space. The static scene is rendered according to the target viewpoint and the three-dimensional scene parameters to obtain a target background image. The target background image is a two-dimensional projection image of the static scene under the target viewpoint.
[0127] Optionally, the 3D reconstruction model includes a feedforward 3D Gaussian splash model; the 3D scene parameters include multiple Gaussian functions and the corresponding Gaussian parameters for each Gaussian function; and the 3D scene parameters include the spatial position parameters of the static scene in 3D space.
[0128] Optionally, the rendering module 830 is further configured to input image data of static image regions in multi-frame reference images into a feedforward 3D Gaussian splash model; extract features from the image data of static image regions in multi-frame reference images through the forward propagation network of the feedforward 3D Gaussian splash model to obtain a static feature map; and predict the parameters of one or more Gaussian functions corresponding to each pixel position in the static feature map through each prediction branch of the feedforward 3D Gaussian splash model to obtain the Gaussian parameters corresponding to each Gaussian function.
[0129] Optionally, the rendering module 830 is also used to acquire point cloud data corresponding to multiple reference images, the point cloud data including the coordinate information of multiple three-dimensional spatial points; and to use the point cloud data as prior information to constrain and adjust the spatial position parameters in order to optimize the geometric structure of the static scene represented by the three-dimensional scene parameters in three-dimensional space.
[0130] Optionally, the point cloud data includes the target depth value corresponding to each 3D spatial point.
[0131] In some embodiments, the rendering module 830 is further configured to: determine the target correspondence between multiple three-dimensional spatial points in the point cloud data corresponding to the first reference image and each pixel in the first reference image, based on the camera parameters corresponding to the first reference image; the first reference image can be any reference image; determine the rendering depth map corresponding to the first reference image based on the three-dimensional scene parameters and the viewpoint corresponding to the first reference image; determine the rendering depth value corresponding to each three-dimensional spatial point based on the target correspondence and the rendering depth map; calculate the depth deviation between the target depth value and the corresponding rendering depth value corresponding to each three-dimensional spatial point to obtain the depth deviation corresponding to the first reference image; construct a constraint loss based on the depth deviations corresponding to multiple reference images respectively, and optimize the spatial position parameters through backpropagation to obtain the optimized spatial position parameters.
[0132] Optionally, the rendering module 830 is also used to render the static scene represented by the three-dimensional scene parameters according to the target viewpoint to obtain the target background image and the corresponding depth map; the depth map includes the depth value of each pixel in the background image in the static scene; the target position information, the target background image and the depth map are input into the generation model, and the generation model generates a simulation image corresponding to the target timestamp under the target viewpoint according to the target position information, the target background image and the depth map.
[0133] In some embodiments, the generation module 840 is further configured to determine the embedding region of each moving object in the target background image based on the target position information using the generation model; generate image content of each moving object aligned with the background image in the corresponding embedding region under the spatial constraints of the depth map; fuse the image content corresponding to each moving object into the corresponding embedding region to obtain a simulation image from the target viewpoint corresponding to the target timestamp; the simulation image includes the target background image and each moving object embedded in the target background image.
[0134] In this embodiment, the electronic device performs image recognition on each frame of a reference video stream to obtain a first data stream containing static image region data and a second data stream containing 3D data of moving objects. Then, during the generation of the simulated image, a target background image from the target perspective is rendered based on the static image region data from the first data stream. Since the static image region data in the first data stream is an abstract representation of the static background in the scene, this rendering process does not require scene-by-scene training for different static scenes. This allows the simulation system to directly adapt to the scene corresponding to any newly input reference video stream, possessing cross-scene generalization capabilities. Simultaneously, the target position information of the moving object is obtained from the second data stream and combined with the target background image to generate the simulated image. This ensures that the generation of dynamic objects does not depend on the image content of a specific scene; only 3D data is needed to flexibly embed it into different background scenes. Therefore, without additional training for each new scene, simulated images from different perspectives and timestamps can be generated quickly and accurately, improving the generalization and applicability of simulated image generation.
[0135] Figure 9 This is a structural block diagram of an electronic device in one embodiment. For example... Figure 9 As shown, the electronic device 900 may include one or more of the following components: a processor 910 and a memory 920 coupled to the processor 910, wherein the memory 920 may store one or more computer programs, which may be configured to implement the methods described in the above embodiments when executed by one or more processors 910.
[0136] The processor 910 may include one or more processing cores. The processor 910 uses various interfaces and lines to connect to various parts within the electronic device 900, and performs various functions and processes data of the electronic device 900 by running or executing instructions, programs, code sets or instruction sets stored in the memory 920, and by calling data stored in the memory 920.
[0137] The memory 920 may include random access memory (RAM) or read-only memory (ROM). The memory 920 can be used to store instructions, programs, code, code sets, or instruction sets. The memory 920 may include a program storage area and a data storage area. The program storage area may store instructions for implementing an operating system, instructions for implementing at least one function, instructions for implementing the various method embodiments described above, etc. The data storage area may also store data created by the electronic device 900 during use.
[0138] Understandably, the electronic device 900 may include more or fewer structural elements than those shown in the block diagram above, such as power supply, input buttons, camera, speaker, screen, RF (Radio Frequency) circuit, Wi-Fi (Wireless Fidelity) module, Bluetooth module, sensor, etc., and may not be limited herein.
[0139] This application discloses a computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the methods described in the above embodiments.
[0140] This application discloses a computer program product, including a computer program, which, when executed by a processor, implements the methods described in the above embodiments.
[0141] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. The storage medium can be a magnetic disk, optical disk, read-only memory (ROM), etc.
[0142] Any references to memory, storage, databases, or other media used herein may include non-volatile and / or volatile memory. Suitable non-volatile memory may include ROM, programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM), which is used as external cache memory.
[0143] The foregoing has provided a detailed description of a method for generating simulated images according to embodiments of this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the embodiments above are merely for the purpose of helping to understand the method and its core ideas. Furthermore, those skilled in the art will recognize that, based on the ideas of this application, there may be changes in the specific implementation methods and application scope. Therefore, the content of this specification should not be construed as a limitation of this application.
Claims
1. A method for generating a simulated image, characterized in that, The method includes: Obtain a reference video stream, the reference video stream comprising images corresponding to multiple timestamps; Image recognition is performed on each frame of the reference video stream to obtain a first data stream and a second data stream; the first data stream includes image data of static image regions in the images corresponding to the plurality of timestamps respectively; the second data stream includes three-dimensional data of moving objects in the images corresponding to the plurality of timestamps respectively, and the three-dimensional data includes the position information of the moving objects in three-dimensional space; From the first data stream, image data of static image regions in multiple frames of reference images within the time range of the target timestamp are obtained, and a target background image is rendered based on the image data of static image regions in the multiple frames of reference images and the target viewpoint. From the second data stream, the target position information of the moving object corresponding to the target timestamp is obtained, and a simulation image corresponding to the target timestamp under the target viewpoint is generated based on the target position information and the target background image.
2. The method according to claim 1, characterized in that, The step of performing image recognition on each frame of the reference video stream to obtain a first data stream and a second data stream includes: Based on each frame of the reference video stream, a static image region in each frame is determined, and a first data stream is generated based on the static image region in each frame and the corresponding timestamp. The reference video stream is input into the target detection model, and the target detection model identifies moving objects in each frame of the reference video stream to obtain the three-dimensional data of the moving objects in each frame. Based on the three-dimensional data of the moving objects in each frame and the corresponding timestamp, a second data stream is generated.
3. The method according to claim 1, characterized in that, The three-dimensional data includes the three-dimensional bounding box information of the moving object; The three-dimensional bounding box information includes the position information, size parameters, and orientation angle of the moving object; wherein, the position information includes the coordinates of the center point of the moving object in three-dimensional space, the size parameters include the length, width, and height of the three-dimensional bounding box, and the orientation angle includes the heading angle of the moving object on the horizontal plane.
4. The method according to claim 3, characterized in that, The method further includes: In response to an editing operation on the 3D bounding box information of the target moving object corresponding to the target timestamp, the 3D bounding box information of the target moving object in the second data stream is updated; the target moving object is any moving object in the image corresponding to the target timestamp. Based on the updated 3D bounding box information, the steps of obtaining the target 3D bounding box information of the moving object corresponding to the target timestamp from the second data stream and generating a simulation image of the target timestamp from the target viewpoint based on the target 3D bounding box information and the target background image are re-executed to obtain the edited simulation image.
5. The method according to claim 1, characterized in that, The step of rendering the target background image based on the image data of the static image region in the multi-frame reference images and the target viewpoint includes: The three-dimensional reconstruction model is used to perform three-dimensional structural reasoning on the image data of the static image region in the multi-frame reference image to obtain three-dimensional scene parameters. The three-dimensional scene parameters are used to characterize the geometric structure and appearance attributes of the static scene corresponding to the static image region of the multi-frame reference image in three-dimensional space. The static scene is rendered according to the target viewpoint and the three-dimensional scene parameters to obtain a target background image, which is a two-dimensional projection image of the static scene under the target viewpoint.
6. The method according to claim 5, characterized in that, The 3D reconstruction model includes a feedforward 3D Gaussian splash model; the 3D scene parameters include multiple Gaussian functions and Gaussian parameters corresponding to each Gaussian function; the 3D scene parameters are obtained by performing 3D structural reasoning on the image data of static image regions in the multi-frame reference images through the 3D reconstruction model, including: The image data of the static image region in the multi-frame reference image is input into the feedforward three-dimensional Gaussian splash model; Static feature maps are obtained by extracting features from the image data of static image regions in the multi-frame reference images through the forward propagation network of the feedforward three-dimensional Gaussian splash model. By using the prediction branches of the feedforward 3D Gaussian splash model, parameter prediction is performed on one or more Gaussian functions corresponding to each pixel position in the static feature map to obtain the Gaussian parameters corresponding to each Gaussian function.
7. The method according to claim 5, characterized in that, The three-dimensional scene parameters include the spatial position parameters of the static scene in three-dimensional space; before rendering the static scene according to the target viewpoint and the three-dimensional scene parameters to obtain the target background image, the method further includes: Obtain point cloud data corresponding to the multiple reference images, wherein the point cloud data includes coordinate information of multiple three-dimensional spatial points; Using the point cloud data as prior information, the spatial location parameters are constrained and adjusted to optimize the geometric structure of the static scene in three-dimensional space as represented by the three-dimensional scene parameters.
8. The method according to claim 7, characterized in that, The point cloud data includes the target depth value corresponding to each of the three-dimensional spatial points; the step of using the point cloud data as prior information to constrain and adjust the spatial position parameters includes: Based on the camera parameters corresponding to the first reference image, determine the target correspondence between multiple three-dimensional spatial points in the point cloud data corresponding to the first reference image and each pixel in the first reference image; the first reference image can be any reference image. Based on the three-dimensional scene parameters and the viewpoint corresponding to the first reference image, determine the rendering depth map corresponding to the first reference image; Based on the target correspondence and the rendering depth map, determine the rendering depth value corresponding to each of the three-dimensional space points; Calculate the depth deviation between the target depth value and the corresponding rendering depth value for each of the three-dimensional space points to obtain the depth deviation corresponding to the first reference image; Based on the depth deviations corresponding to the multiple reference images, a constraint loss is constructed, and the spatial position parameters are optimized through backpropagation to obtain the optimized spatial position parameters.
9. The method according to claim 5, characterized in that, The step of rendering the static scene according to the target viewpoint and the three-dimensional scene parameters to obtain the target background image includes: Based on the target perspective, the static scene represented by the three-dimensional scene parameters is rendered to obtain a target background image and a corresponding depth map; the depth map includes the depth value of each pixel in the background image in the static scene. The step of generating a simulated image from the target's perspective corresponding to the target's timestamp, based on the target's location information and the target's background image, includes: The target location information, the target background image, and the depth map are input into the generation model. The generation model generates a simulated image from the target viewpoint corresponding to the target timestamp based on the target location information, the target background image, and the depth map.
10. The method according to claim 9, characterized in that, The step of generating a simulated image from the target's perspective corresponding to the target's timestamp using the generative model based on the target's location information, the target's background image, and the depth map includes: Based on the target location information, the embedding region of each moving object in the target background image is determined using the generation model. Under the spatial constraints of the depth map, image content is generated that aligns each of the moving objects with the background image within its corresponding embedded region; The image content corresponding to each of the moving objects is fused into the corresponding embedding region to obtain a simulated image corresponding to the target timestamp under the target viewpoint; the simulated image includes the target background image and each of the moving objects embedded in the target background image.