NeRF-based intelligent flying car visual simulation scene generation method and device

By improving the NeRF network structure and introducing image inpainting technology, the problems of poor reconstruction quality, slow training and rendering speed, and limited viewpoint of NeRF in small sample scenarios are solved, realizing the efficient generation of multi-view flying car simulation scenes and supporting scene editing.

CN122244313APending Publication Date: 2026-06-19BEIHANG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIHANG UNIV
Filing Date
2026-03-17
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing NeRF-based scene image generation methods suffer from poor reconstruction quality, slow training and rendering speed, limited viewpoints, and inability to effectively generate large-scale and motion scenes in small sample scenarios.

Method used

An improved neural radiation field network combining CNN pre-trained layers with multi-resolution hash coding is used, along with an image inpainting model incorporating fast Fourier convolution, to generate high-quality novel viewpoint images and repair blank or deformed areas in images caused by viewpoint shift.

Benefits of technology

It can quickly generate high-quality, novel perspective images from sparse input images, support scene editing and dataset expansion, reduce algorithm testing costs, and provide rich environmental perception and path planning verification for autonomous driving systems.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure REF-OBJ-1773736761602-000001
    Figure REF-OBJ-1773736761602-000001
  • Figure REF-OBJ-1773736761602-000002
    Figure REF-OBJ-1773736761602-000002
  • Figure REF-OBJ-1773736761602-000003
    Figure REF-OBJ-1773736761602-000003
Patent Text Reader

Abstract

This application relates to a method and apparatus for generating visual simulation scenes of intelligent flying cars based on NeRF, belonging to the fields of image processing and computer vision technology. This method achieves efficient representation and rapid rendering of three-dimensional scenes under sparse input image conditions through an improved neural radiation field network. It can quickly generate initial scene images under a specified target viewpoint based on input images with limited viewpoints. On this basis, an image inpainting model is used to intelligently repair image defects caused by viewpoint changes, and the scene content can be selectively edited according to user needs, thereby obtaining high-quality, multi-view final scene images. This realizes the rapid generation of high-quality, multi-view flying car simulation scenes under small sample and limited viewpoint conditions, and allows for flexible adjustment and expansion of scene content.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application belongs to the field of image processing and computer vision technology, and in particular relates to a method and apparatus for generating visual simulation scenes of intelligent flying cars based on NeRF. Background Technology

[0002] Flying cars, as a novel mode of transportation, can fully utilize the three-dimensional space of cities to achieve intelligent transportation, thereby effectively alleviating urban traffic pressure. However, due to the time-consuming, labor-intensive, and costly nature of testing in real-world scenarios, generating high-quality simulation scenarios to assist in the simulation testing and verification of flying cars has become an important research direction in this field. Neural Radiance Fields (NeRF) is a novel 3D reconstruction technology that can synthesize novel perspective images from sparse image sets, providing a new technical path for autonomous driving scenario simulation.

[0003] Currently, there is a wealth of research on vehicle scene simulation, especially generative models that have emerged in recent years with promising results. Among them, diffusion models (DM) excel in realistic image synthesis, but their adaptability to LiDAR scene generation faces significant challenges. This is mainly because diffusion models, operating in point space, need to preserve the curve patterns and 3D geometry of the LiDAR scene, consuming most of their representational power. Furthermore, although the training of diffusion models is relatively stable, the complex training process and numerous hyperparameters still present challenges, requiring careful tuning and optimization of hyperparameters, as well as the selection of appropriate noise levels and training strategies.

[0004] Scholars have proposed the DriveDream-2 large language model, the world's first model to generate diverse driving videos in a user-friendly manner. Specifically, this model decomposes the traffic simulation task into generating foreground conditions (trajectories of the vehicle and other agents) and background conditions (high-resolution maps of lane boundaries, lane dividers, and pedestrian crossings). For foreground generation, a function library is built to fine-tune the large language model (LLM), enabling it to generate agent trajector trajector trajectories based on user text input. For background conditions, a diffusion model is used to simulate a high-resolution map generator of road structures, where previously generated agent trajector trajectories serve as conditional input, allowing the high-resolution map generator to learn the association between foreground and background conditions in the driving scene. Based on the generated traffic structure conditions, the DriveDream framework is used to generate multi-view driving videos. However, this type of method primarily generates multi-view driving videos through text input, rather than generating novel images based on existing image data.

[0005] NeRF is a high-quality scene reconstruction technique that combines 3D reconstruction and neural rendering. Its main task is to synthesize novel perspectives, that is, to capture a series of images from several known perspectives of a scene and synthesize images from an unknown perspective. One proposed technique, pixelNeRF, simulates continuous neural scenes based on a few input images. It introduces pre-trained layers of convolutional neural networks and bilinear interpolation on top of NeRF, extracting corresponding image features for each sampling point, fully utilizing the features of the input images, and passing the feature points, spatial locations, and viewpoint orientation to the NeRF network. This allows it to construct continuous static scenes with only a sparse set of images as input. However, pixelNeRF suffers from slow training and rendering, and is prone to failure when generating large-scale unbounded scenes.

[0006] Overall, achieving NeRF reconstruction in autonomous driving scenarios presents numerous challenges. First, NeRF is only suitable for small-scale, static scenes. Input scenes are captured within a very short timeframe, with constant lighting and no motion. The quality of NeRF synthesis significantly degrades once moving objects or changes in lighting occur. Second, NeRF involves substantial computation, resulting in slow training and rendering speeds. Third, data collected from autonomous driving sensors is characterized by single camera trajectories and limited viewpoints. This restricts the 3D scene reconstruction capabilities of related NeRF models in autonomous driving applications. This is because the NeRF network structure, MLP, uses independent neurons to represent each pixel of the input image and requires independent optimization for each scene. Scenes cannot share any knowledge, and no prior knowledge from the scene can be utilized for reconstruction. When the number of captured views is insufficient, reconstruction quality rapidly degrades or even fails.

[0007] To address the technical problems of traditional NeRF-based scene image generation methods in the aforementioned technologies, such as poor reconstruction quality in small sample scenes, slow training and rendering speed, limited viewpoint, and inability to effectively generate large-scale and motion scenes, no effective solution has yet been proposed. Summary of the Invention

[0008] In view of the shortcomings of the prior art, the purpose of the invention is to provide a method and device for generating visual simulation scenes of intelligent flying cars based on NeRF. By improving the neural radiation field network that combines CNN pre-trained layers with multi-resolution hash coding, it can quickly generate high-quality novel perspective images under sparse input images. Furthermore, it uses an image inpainting model based on fast Fourier convolution to repair and edit blank or deformed areas in the image caused by perspective shift, thereby providing rich, realistic and flexible scene data for the simulation testing of intelligent flying cars.

[0009] The first aspect of this application proposes a NeRF-based method for generating visual simulation scenes for intelligent flying cars, comprising: acquiring an input image of a target scene, wherein the target scene is a low-altitude flight scene or a ground driving scene of an intelligent flying car; using an improved neural radiation field network to generate an initial scene image from the target perspective based on the input image; wherein the improved neural radiation field network uses an improved CNN pre-trained layer to extract features from the input image and uses multi-resolution hashing to hash the spatial sampling points from the target perspective, and inputs the extracted image features, the hashed spatial sampling point features, and the viewpoint direction of the spatial sampling points into a multilayer perceptron network to predict the color and volume density of the spatial sampling points, and finally generates the initial scene image through volume rendering; identifying the regions to be repaired in the initial scene image, generating corresponding mask images, and inputting the initial scene image and the mask images into a pre-trained image inpainting model to perform content repair on the regions to be repaired, generating the final target scene image as scene data for visual simulation of intelligent flying cars.

[0010] According to a second aspect of the present disclosure, a storage medium is provided, the storage medium including a stored program, wherein, when the program is executed, a processor performs the method described in any of the above embodiments.

[0011] According to a third aspect of the present disclosure, a NeRF-based intelligent flying car visual simulation scene generation device is provided, comprising: an image acquisition module for acquiring an input image of a target scene, wherein the target scene is a low-altitude flight scene or a ground driving scene of an intelligent flying car; an initial scene image generation module for generating an initial scene image from a target perspective based on the input image using an improved neural radiation field network; wherein the improved neural radiation field network uses an improved CNN pre-trained layer to extract features from the input image and uses multi-resolution hash coding to hash the spatial sampling points from the target perspective, and inputs the extracted image features, the hash-coded spatial sampling point features, and the viewpoint direction of the spatial sampling points into a multilayer perceptron network to predict the color and volume density of the spatial sampling points, and finally generates the initial scene image through volume rendering; and a target scene image generation module for identifying the region to be repaired in the initial scene image, generating a corresponding mask image, and inputting the initial scene image and the mask image into a pre-trained image inpainting model to perform content repair on the region to be repaired, generating a final target scene image as scene data for the visual simulation of an intelligent flying car.

[0012] According to a fourth aspect of the present disclosure, a NeRF-based intelligent flying car visual simulation scene generation device is provided, comprising: a processor; and a memory connected to the processor, configured to provide the processor with instructions for processing the following steps: acquiring an input image of a target scene, wherein the target scene is a low-altitude flight scene or a ground driving scene of an intelligent flying car; generating an initial scene image from a target perspective based on the input image using an improved neural radiation field network; wherein the improved neural radiation field network uses an improved CNN pre-trained layer to extract features from the input image and uses multi-resolution hash coding to hash-encode spatial sampling points from the target perspective, and inputs the extracted image features, hash-encoded spatial sampling point features, and the viewpoint direction of the spatial sampling points into a multilayer perceptron network to predict the color and volume density of the spatial sampling points, and finally generates the initial scene image through volume rendering; identifying regions to be repaired in the initial scene image, generating corresponding mask images, and inputting the initial scene image and the mask images into a pre-trained image inpainting model to perform content repair on the regions to be repaired, generating a final target scene image as scene data for intelligent flying car visual simulation.

[0013] The beneficial effects of this application are as follows: (1) This application improves the combination of CNN pre-training layers and multi-resolution hash coding, while retaining the original NeRF network’s ability to extract features from sparse input images, and significantly reduces the number of network parameters (for example, by using the ResNet50 residual structure to reduce the number of parameters from about 250 million to about 20 million). It also accelerates feature query and MLP inference processes through multi-resolution hash coding, thereby significantly improving training and rendering speed and making the generation of novel perspective images in small sample scenarios more efficient.

[0014] (2) This application utilizes the LAMA image inpainting model based on Fast Fourier Convolution (FFC) to intelligently repair blank or deformed areas at the edges of images caused by perspective shift. It can generate visually coherent and logically consistent novel perspective images, thereby effectively expanding the existing dataset. In particular, it generates rare sample data in low-altitude and ground scenes of flying cars, providing richer scene materials for simulation testing.

[0015] (3) This application realizes flexible editing of existing images of flying cars by generating mask images and combining them with image restoration technology, including target deletion, scene replacement and attitude adjustment such as pitch angle and heading angle, so that the test scene can be closer to the actual working conditions. At the same time, it greatly reduces the time cost and manpower cost required for algorithm testing, and lays a theoretical foundation for the verification of environmental perception and path planning of autonomous driving system.

[0016] Therefore, this application improves the NeRF network structure and introduces image inpainting technology to achieve rapid generation of high-quality, multi-view visual simulation scenes of flying cars under sparse input images, and supports scene editing and dataset expansion. This solves the technical problems of traditional NeRF-based scene image generation methods, such as poor reconstruction quality, slow training and rendering speed, limited viewpoints, and inability to effectively generate large-scale and moving scenes in small-sample scenarios. Attached Figure Description

[0017] The accompanying drawings are for illustrative purposes only and are not intended to limit the scope of this application. Throughout the drawings, the same reference numerals denote the same components. Obviously, the drawings described below are merely some embodiments described in this application, and those skilled in the art can obtain other drawings based on these drawings.

[0018] Figure 1 This is a hardware structure block diagram of a computing device for implementing the method described in Embodiment 1 of this disclosure; Figure 2 This is a flowchart of the NeRF-based visual simulation scene generation method for intelligent flying cars according to Embodiment 1 of this application; Figure 3 This is an overall framework diagram of the NeRF-based intelligent flying car visual simulation scene generation method according to Embodiment 1 of this application; Figure 4 This is a schematic diagram of the pre-trained layer structure that incorporates CNN and bilinear interpolation according to Embodiment 1 of this application; Figure 5 This is a schematic diagram of the resnet34 residual structure and the resnet50 residual structure according to Embodiment 1 of this application; Figure 6 This is a schematic diagram of the multi-resolution hash coding structure according to Embodiment 1 of this application; Figure 7 This is a schematic diagram generated from the image of the flying car model vehicle described in Embodiment 1 of this application; Figure 8 This is a schematic diagram edited based on the low-altitude scene of the flying car described in Embodiment 1 of this application; Figure 9 This is a schematic diagram edited based on the ground scene of the flying car described in Embodiment 1 of this application; Figure 10 This is a schematic diagram of adjusting the pitch angle of the flying car at low altitude in Embodiment 1 of this application by 15°; Figure 11 This is a schematic diagram of adjusting the heading angle of 15° in the ground scene of the flying car according to Embodiment 1 of this application; Figure 12 This is a schematic diagram of the NeRF-based intelligent flying car visual simulation scene generation device according to Embodiment 2 of this application; Figure 13 This is a schematic diagram of the NeRF-based intelligent flying car visual simulation scene generation device according to Embodiment 3 of this application. Detailed Implementation

[0019] To enable those skilled in the art to better understand the technical solutions of this disclosure, the technical solutions of the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are merely some embodiments of this disclosure, and not all embodiments. Based on the embodiments of this disclosure, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of this disclosure.

[0020] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this disclosure are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this disclosure described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0021] Example 1 According to this embodiment, a method embodiment for generating visual simulation scenes of intelligent flying cars based on NeRF is provided. It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions. Although a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in a different order than that shown here.

[0022] The method embodiments provided in this example can be executed on a server or similar computing device. Figure 1 A hardware block diagram of a computing device for implementing a NeRF-based method for generating visual simulation scenes of intelligent flying cars is shown. Figure 1As shown, a computing device may include one or more processors (processors may include, but are not limited to, microprocessors such as MCUs or programmable logic devices such as FPGAs), a memory for storing data, a transmission device for communication functions, and an input / output interface. The memory, transmission device, and input / output interface are connected to the processor via a bus. In addition, it may also include a display, keyboard, and cursor control device connected to the input / output interface. Those skilled in the art will understand that... Figure 1 The structure shown is for illustrative purposes only and does not limit the structure of the aforementioned electronic device. For example, a computing device may also include... Figure 1 The more or fewer components shown, or having the same Figure 1 The different configurations shown.

[0023] It should be noted that the aforementioned one or more processors and / or other data processing circuits are generally referred to herein as "data processing circuits". These data processing circuits may be embodied, in whole or in part, in software, hardware, firmware, or any other combination thereof. Furthermore, the data processing circuits may be a single, independent processing module, or may be integrated, in whole or in part, into any other element in a computing device. As involved in the embodiments of this disclosure, the data processing circuits serve as processor control (e.g., selection of a variable resistor termination path connected to an interface).

[0024] The memory can be used to store software programs and modules of application software, such as the program instruction / data storage device corresponding to the NeRF-based intelligent flying car visual simulation scene generation method in this embodiment of the present disclosure. The processor executes various functional applications and data processing by running the software programs and modules stored in the memory, thereby realizing the aforementioned NeRF-based intelligent flying car visual simulation scene generation method. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include memory remotely located relative to the processor, and these remote memories can be connected to the computing device via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

[0025] The transmission device is used to receive or send data via a network. Specific examples of the network described above may include a wireless network provided by the computing device's communication provider. In one example, the transmission device includes a Network Interface Controller (NIC), which can connect to other network devices via a base station to communicate with the Internet. In another example, the transmission device may be a Radio Frequency (RF) module used for wireless communication with the Internet.

[0026] The display can be, for example, a touchscreen liquid crystal display (LCD), which allows users to interact with the user interface of the computing device.

[0027] It should be noted here that, in some optional embodiments, the above... Figure 1 The computing device shown may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that... Figure 1 This is only one instance of a specific particular instance, and is intended to illustrate the types of components that may exist in the aforementioned computing devices.

[0028] Under the above operating environment, according to the first aspect of this embodiment, a method for generating visual simulation scenes of intelligent flying cars based on NeRF is provided. Figure 2 A flowchart illustrating the method is shown below. (Refer to...) Figure 2 As shown, the method includes: S202: Obtain an input image of the target scene, wherein the target scene is a low-altitude flight scene or a ground driving scene of an intelligent flying car; S204: An improved neural radiation field network is used to generate an initial scene image from the target viewpoint based on the input image; wherein, the improved neural radiation field network uses an improved CNN pre-trained layer to extract features from the input image, and uses multi-resolution hashing to hash the spatial sampling points from the target viewpoint. The extracted image features, the hashed spatial sampling point features, and the viewpoint direction of the spatial sampling points are input into a multilayer perceptron network to predict the color and volume density of the spatial sampling points, and finally the initial scene image is generated through volume rendering. S206: Identify the region to be repaired in the initial scene image, generate a corresponding mask image, and input the initial scene image and the mask image into a pre-trained image inpainting model to perform content repair on the region to be repaired, generating the final target scene image as scene data for the visual simulation of the intelligent flying car.

[0029] In this embodiment, during the simulation scene construction stage, the input image of the target scene is first acquired. The target scene is either a low-altitude flight scene or a ground driving scene of the intelligent flying car (corresponding to step S202). Specifically, the input image can be acquired by a camera mounted on the flying car during actual flight or driving, or images from existing public datasets (such as urban low-altitude aerial photography datasets or autonomous driving ground scene datasets) can be directly used. Given that this method aims to solve the scene generation problem under small sample conditions, the input image can be a limited number of multi-view images, for example, by taking pictures around the target scene to acquire a small number (e.g., 10 to 20) of images from different perspectives. These images should cover typical visual elements of the flying car in low-altitude or ground operation scenes, including but not limited to city streets, buildings, traffic signs, and other vehicles, thereby providing basic visual information for subsequent neural radiation field reconstruction.

[0030] Then, using the improved neural radiation field network, an initial scene image from the target's perspective is generated based on the input image (corresponding to step S204). Figure 3 The diagram illustrates the overall framework of this method. The entire process from the input image to the final generated target scene image can be understood by referring to this diagram. Specifically, this step first uses traditional methods such as Structure from Motion (SfM) to estimate the camera pose (including intrinsic and extrinsic parameters) for each image from the input image, thereby constructing the 3D geometric constraints of the scene. Next, for any target viewpoint specified by the user (e.g., a new angle that a flying car wishes to observe), light rays are emitted into the 3D space according to the camera parameters of that target viewpoint, and discrete sampling is performed on the light rays to obtain a series of spatial sampling points.

[0031] For each sampling point, on the one hand, the improved CNN pre-trained layer is used to extract features from the input image to generate a feature map containing rich semantic and texture information. Figure 4 The diagram illustrates the structure of a pre-trained layer incorporating CNN and bilinear interpolation, as shown below. Figure 4 As shown, firstly, given an image, a pre-trained CNN Encoder is used to extract features from each pixel in the image, constructing a feature map. Then, given the camera intrinsics of a desired new perspective, a ray is emitted into space as described in NeRF, projecting the intersection point x onto the plane corresponding to the input image, and the corresponding features are extracted from the feature map. In this way, the feature information of the input image can be fully utilized, providing rich texture and semantic guidance for subsequent rendering.

[0032] On the other hand, multi-resolution hashing encoding is used to encode the 3D coordinates of the spatial sampling points. This encoding method can efficiently convert 3D spatial coordinates into high-dimensional feature representations, significantly improving query speed while maintaining expressive power. This encoding method can effectively reduce the size of the subsequent multilayer perceptron network, thereby accelerating training and rendering.

[0033] Next, the image features, the spatial point features obtained by hash encoding, and the viewpoint direction of the sampling points are concatenated into a comprehensive feature vector, which is then fed into a multilayer perceptron (MLP) network to predict the color value and volume density of the sampling point. By performing volume rendering integration on the color and density of all sampling points along a ray, a two-dimensional image from the target viewpoint, i.e., the initial scene image, can be synthesized. Figure 7 Example images of novel images of a flying car model using this method are shown. It can be seen that even with a limited number of input images, this method can still generate high-quality images of the flying car from various perspectives. Therefore, by introducing CNN pre-trained features, the ability to utilize information from sparse input images is enhanced, and multi-resolution hashing significantly improves the query efficiency and rendering speed of positional encoding, enabling the rapid generation of high-quality, novel perspective images even with a limited number of input images.

[0034] Finally, the regions to be repaired in the initial scene image are identified, a corresponding mask image is generated, and the initial scene image and the mask image are input into a pre-trained image inpainting model to perform content repair on the regions to be repaired, generating the final target scene image as scene data for the visual simulation of the intelligent flying car (corresponding to step S208). Specifically, due to the deviation between the target viewpoint and the input image viewpoint, some unnatural regions often appear in the generated initial scene image, such as blank areas at the edges (because the new viewpoint exposes parts of the scene not covered by the original viewpoint) or deformed areas caused by occlusion. In addition, to meet the need for customized scene content in simulation testing, users can also actively specify certain areas in the image (e.g., specific targets that need to be removed) as regions to be repaired. To this end, this step first automatically detects incomplete regions caused by viewpoint changes based on the geometric relationship between the target viewpoint and the input image viewpoint; at the same time, it receives the mask region manually specified by the user, determines both or one of them as the region to be repaired, and generates a mask image with the same size as the initial image, wherein the region to be repaired is distinguished from other regions by different pixel values. The initial image is then overlaid with a mask image and input into a pre-trained image inpainting model. This model intelligently infers and fills in the texture and structure within the mask area based on the image content of the unoccluded areas surrounding the mask, generating inpainted content that is consistent with the surrounding environment. Finally, the inpainted content is fused with the original unoccluded areas to obtain a complete and realistic image of the target scene.

[0035] Figures 8 to 11 An example image shows the effect of using this method to perform perspective transformation and content restoration on a flying car scene. Among them, Figure 8 As an example of a low-altitude flying car scene, it demonstrates that after specifying the area to be repaired by a mask, the model successfully removes the white vehicle at the bottom of the scene and generates a coherent background. Figure 9 This example demonstrates the repair effect after removing the truck on the right side of the scene, as shown in the example of a flying car on the ground. Figure 10 An example of repairing blank areas at the edges after adjusting the pitch angle by 15° for a low-altitude scene of a flying car; Figure 11 Examples of repairing deformed areas after adjusting the heading angle by 15° for a ground scene of a flying car are presented. These examples demonstrate that by combining image inpainting techniques with viewpoint transformation, this method not only effectively eliminates artifacts such as edge gaps and distortions caused by viewpoint shifts, generating visually coherent and realistic novel viewpoint images, but also allows for selective editing of scene content (such as removing specific targets) by applying masks to specific areas, thus naturally completing scene reconstruction during the inpainting process. Based on this, this method can effectively expand the original dataset, providing richer scene materials for flying car simulation testing and algorithm verification.

[0036] As described in the background section, implementing NeRF reconstruction in autonomous driving scenarios presents numerous challenges. First, NeRF is only suitable for small-scale, static scenes. Input scenes are captured within a very short timeframe, with constant lighting and no motion. The quality of NeRF synthesis significantly degrades once moving objects or changes in lighting occur. Second, NeRF involves substantial computation, resulting in slow training and rendering speeds. Third, data collected from autonomous driving sensors is characterized by single camera trajectories and limited viewpoints. This restricts the 3D scene reconstruction capabilities of related NeRF models in autonomous driving applications. This is because the NeRF network structure, MLP, represents each pixel of the input image using independent neurons and requires independent optimization for each scene. Scenes cannot share any knowledge, and no prior knowledge from the scene can be utilized for reconstruction. When the number of captured views is insufficient, reconstruction quality rapidly degrades or even fails.

[0037] In view of this, this application achieves efficient representation and rapid rendering of 3D scenes under sparse input image conditions through an improved Neural Radiation Field Network (NeRF). It can quickly generate initial scene images from a specified target viewpoint based on input images with limited viewpoints. Furthermore, through an image inpainting model, it intelligently repairs image defects caused by viewpoint changes and allows selective editing of scene content according to user needs, thereby obtaining high-quality, multi-view final scene images. Through these technical means, this application achieves rapid generation of high-quality, multi-view flying car simulation scenes under small sample and limited viewpoint conditions, and allows for flexible adjustment and expansion of scene content. This solves the technical problems of traditional NeRF-based scene image generation methods, such as poor reconstruction quality, slow training and rendering speed, limited viewpoints, and inability to effectively generate large-scale and moving scenes in small sample scenarios.

[0038] Optionally, the improved neural radiation field network includes: a CNN pre-training layer, a multi-resolution hash coding layer, and a multilayer perceptron network; and, the operation of generating an initial scene image from the target perspective based on the input image using the improved neural radiation field network includes: obtaining camera parameters corresponding to the input image and target camera parameters for generating the new perspective; emitting light rays into three-dimensional space according to the target camera parameters, and performing spatial sampling on the light rays to obtain multiple spatial sampling points; extracting features from the input image using the CNN pre-training layer to obtain a feature map; for each spatial sampling point, projecting the spatial sampling point onto the feature map according to the camera parameters corresponding to the input image, and extracting the corresponding image features through bilinear interpolation; using the multi-resolution hash coding layer to perform multi-resolution hash coding on the three-dimensional coordinates of the spatial sampling point to obtain spatial sampling point features; concatenating the image features, the spatial sampling point features, and the viewpoint direction of the spatial sampling point, and inputting them into the multilayer perceptron network to predict the color and volume density of the spatial sampling point; and synthesizing the initial scene image from the target perspective based on the color and volume density of all spatial sampling points using volume rendering technology.

[0039] Specifically, the improved neural radiation field network comprises three core components: a CNN pre-training layer, a multi-resolution hash coding layer, and a multilayer perceptron network. The following section will combine... Figure 3 and Figure 4 The specific steps for generating the initial scene image using this network are explained in detail.

[0040] First, the camera parameters corresponding to the input images and the target camera parameters used to generate new perspectives are obtained. Specifically, in the training and application of the neural radiation field network, the intrinsic and extrinsic parameters of the camera corresponding to each input image need to be known. To obtain the camera pose and parameters of the captured images, this embodiment uses the COLMAP method. COLMAP is a general-purpose Structure from Motion (SfM) and Multi-View Stereo (MVS) pipeline with graphical and command-line interfaces, providing extensive functionality for the reconstruction of ordered and unordered image sets. In practice, images can be taken around the target scene (e.g., a flying car model or an actual scene) using a mobile phone or camera. It is recommended that the angle between adjacent images not exceed 30° to ensure sufficient viewpoint overlap. The acquired image set is input into the COLMAP tool, and the camera parameters corresponding to each image (including camera intrinsic parameters such as focal length and principal point coordinates, and camera extrinsic parameters such as rotation matrix and translation vector) can be generated through the SfM process, thereby determining the three-dimensional spatial position relationship of each pixel in the scene. These camera parameters obtained through COLMAP will serve as input conditions for subsequent training of the neural radiation field network.

[0041] Furthermore, after obtaining the input image, the PointRend module in the detectron2 framework can be used to further segment the scene. PointRend is a point-rendering-based image segmentation optimization module that can perform refined processing on complex regions such as object boundaries. By adaptively selecting difficult regions for point-by-point prediction, it obtains high-precision scene segmentation results. Through scene segmentation, different objects and regions in the image (such as flying cars, other vehicles, roads, buildings, and the sky) can be identified. Its output, along with camera parameters, can be used as input conditions for an improved neural radiation field network, providing more accurate semantic guidance for subsequent neural radiation field reconstruction and image inpainting.

[0042] Meanwhile, in order to generate scene images from a specified new perspective, it is necessary to obtain the target camera parameters set by the user, including the target camera intrinsic parameters and the target camera extrinsic parameters, which are used to define the position and orientation of the virtual camera of the scene to be observed. Figure 3 In the overall framework diagram shown, the input image and its corresponding camera parameters together constitute the input conditions for network training, while the target camera parameters are used to guide the synthesis process of new perspective images.

[0043] Then, based on the target camera parameters, light rays are emitted into three-dimensional space, and spatial sampling is performed on these light rays to obtain multiple spatial sampling points. Specifically, for each pixel in the target viewpoint, a light ray is emitted into three-dimensional space along the line of sight, starting from its corresponding camera position. Several three-dimensional spatial points are discretely selected on this light ray according to a preset sampling strategy (such as uniform sampling or hierarchical sampling) as sampling points for subsequent feature extraction and color density prediction. These sampling points cover the entire light path from the near plane to the far plane and are used to integrate and synthesize the final color value of the pixel.

[0044] Next, the input image is used to extract features using the CNN pre-trained layer to obtain a feature map. For example... Figure 4 As shown, the input image is first fed into a pre-trained convolutional neural network (CNN) encoder, which extracts high-dimensional semantic and texture features from each pixel in the image, generating a feature map corresponding to the resolution of the input image. This feature map preserves the spatial structure information of the input image and provides rich feature representations for subsequent feature queries.

[0045] For each spatial sampling point, the spatial sampling point is projected onto the feature map according to the camera parameters corresponding to the input image, and the corresponding image features are extracted using bilinear interpolation. Specifically, for each spatial sampling point, its coordinates in three-dimensional space and the camera parameters of the input image (especially extrinsic and intrinsic parameters) are used to calculate the two-dimensional pixel position of the sampling point in each input image through projection transformation. Since this projected position may be located between pixel grids, the feature vector corresponding to this position is extracted from the feature map using the bilinear interpolation method, which serves as the image feature of the sampling point under this input viewpoint. For multiple input images, the features extracted from all viewpoints can be aggregated (e.g., by averaging or taking the maximum value) to obtain the final image feature representation.

[0046] On the other hand, the multi-resolution hash coding layer is used to perform multi-resolution hash coding on the three-dimensional coordinates of the spatial sampling points to obtain the spatial sampling point features. This coding layer maps the three-dimensional spatial coordinates to a compact high-dimensional feature space by introducing a multi-resolution hash table structure. Compared with traditional frequency-based position coding, multi-resolution hash coding can significantly reduce computational overhead and storage requirements while maintaining expressive power, thereby accelerating the inference process of subsequent MLP networks.

[0047] Then, the image features, the spatial sampling point features, and the viewpoint direction of the spatial sampling points are concatenated to form a comprehensive feature vector, which is input into the multilayer perceptron network to predict the color and volume density of the spatial sampling points. The multilayer perceptron network is a lightweight, fully connected neural network that learns the mapping relationship from comprehensive features to color values ​​and volume density, enabling it to represent the geometric structure and appearance texture of a scene. Volume density represents the probability that light rays are terminated at the sampling point, determining the weight of that point's contribution to the final pixel color; the color value characterizes the radiance of that point under a given viewpoint direction.

[0048] Finally, based on the color and volume density of all spatial sampling points, an initial scene image from the target viewpoint is synthesized using volume rendering technology. Specifically, the color value of each sampling point is weighted and integrated along each ray, with the weights determined by the volume density of each point and the ray propagation distance, ultimately yielding the synthesized color of that pixel. By performing the same volume rendering operation on all pixels, a complete two-dimensional image from the target viewpoint, i.e., the initial scene image, can be generated.

[0049] It's important to note that during the model training phase, the initial scene image obtained from volume rendering is compared with the ground truth image to construct a loss function (such as L2 loss or perceptual loss). The parameters of the entire improved neural radiation field network are then optimized using the backpropagation algorithm. This optimization process enables the network to learn the mapping relationship from the input image to a 3D scene representation, continuously improving the quality of the generated images. After sufficient training, the network can be used for subsequent novel perspective image generation tasks.

[0050] Figure 7 An example image of a flying car model using this method is shown, demonstrating that high-quality images of the flying car from various perspectives can be generated through the above steps even with a limited number of input images.

[0051] Through the above steps, efficient reconstruction of 3D scenes and generation of new perspective images were achieved under conditions of small sample size and limited viewpoint, providing high-quality initial input for subsequent image restoration and scene editing.

[0052] Optionally, the CNN pre-training layer adopts the residual structure in the ResNet50 network to reduce the number of network parameters and prevent overfitting.

[0053] Specifically, in this embodiment, the CNN pre-training layer adopts the residual structure of the ResNet50 network to optimize the number of network parameters and improve the model's generalization ability. In traditional NeRF-based scene reconstruction methods, the rendering and training time is very slow, especially in complex scenes, due to the need for dense image feature extraction in the pre-training layer, which becomes a major bottleneck restricting the practicality of the method. To solve this problem, this embodiment makes targeted improvements to the network structure of the CNN pre-training layer.

[0054] Residual structures are a key design feature in deep convolutional neural networks used to mitigate gradient vanishing and support increasing network depth. In the common ResNet series of networks, ResNet34 uses a basic residual block consisting of two 3×3 convolutional layers; while ResNet50 introduces a bottleneck block, such as... Figure 5 As shown, the structure sequentially includes 1×1 convolutions (for dimensionality reduction), 3×3 convolutions (for spatial feature extraction), and 1×1 convolutions (for dimensionality increase), forming a "bottleneck" shape. This design can significantly reduce the number of parameters and computational cost while maintaining network depth, because the 1×1 convolution first compresses the number of channels, then performs 3×3 convolutions, and finally restores the number of channels, making the number of feature map channels in the intermediate layers much smaller than the number of input and output channels, thereby greatly reducing the number of parameters.

[0055] In this method, by replacing the conventional ResNet34 structure with the ResNet50 residual structure in the CNN pre-training layers, the total number of network parameters is reduced from approximately 250 million to approximately 20 million, a reduction of over 90%. This reduction in parameters not only lowers model storage and computational costs, significantly improving training and rendering speed, but also effectively prevents overfitting in scenarios with few training samples. Fewer parameters mean lower model complexity, reduced dependence on training data, and improved generalization ability. Simultaneously, the bottleneck structure of ResNet50, through deeper nonlinear transformations, can extract richer image features, contributing to improved quality in subsequent scene reconstruction. Figure 5 The diagram illustrates the comparison between the ResNet34 residual structure (left) and the ResNet50 residual structure (right), which clearly shows the structural differences between the two modules.

[0056] By employing the improved CNN pre-training layer described above, this method can extract image features more efficiently under sparse input image conditions, effectively solving the problem of slow training and rendering speed in complex scenes by traditional methods. At the same time, it provides high-quality input for subsequent multi-resolution hashing and MLP networks, significantly improving overall operating efficiency while ensuring the visual realism of the generated images.

[0057] Optionally, the operation of performing multi-resolution hash encoding on the three-dimensional coordinates of the spatial sampling points to obtain spatial sampling point features using the multi-resolution hash encoding layer includes: configuring multiple hash tables of different resolutions, the multiple hash tables of different resolutions being arranged in a geometric progression from high to low; for each resolution level, finding voxel corner points containing the coordinates of the spatial sampling points, and hashing the voxel corner points, using the hash value of the voxel corner points as the key, and retrieving the feature vector corresponding to each voxel corner point from the hash table of the corresponding resolution; performing linear interpolation on the feature vectors of the retrieved multiple voxel corner points according to the position of the spatial sampling point within the voxel to obtain the feature vector at that resolution level; and concatenating the feature vectors at all resolution levels to obtain the spatial sampling point features.

[0058] Specifically, the multi-resolution hash coding layer in this embodiment constructs multiple hash tables with different resolutions to efficiently encode the three-dimensional coordinates of spatial sampling points, thereby significantly improving query speed while ensuring expressive power. The following section combines... Figure 6 The specific steps of this encoding operation are explained in detail.

[0059] First, configure multiple hash tables with different resolutions, arranged in a geometric progression from high to low resolution. For example... Figure 6 As shown, the multi-resolution hash coding structure consists of L resolution levels, each corresponding to an independent hash table. The highest resolution level has the finest mesh division, with each grid cell being the smallest, capable of capturing the fine geometric details and texture information of the scene; the lowest resolution level has the coarsest mesh division, with each grid cell being the largest, used to represent the global structure and general outline of the scene. These resolution levels are arranged from high to low according to a geometric progression (e.g., with a common ratio of 2), ensuring that the grid sizes of adjacent levels are multiples of each other, thus achieving multi-scale coverage of three-dimensional space. Each hash table stores the trainable feature vectors of the corresponding grid vertices, and the dimension (F-dimensional) of the feature vectors is typically consistent across all levels.

[0060] For each resolution level, the voxel corner points containing the coordinates of the spatial sampling point are found, and these voxel corner points are hashed. Specifically, given the 3D coordinates of a spatial sampling point, under the current resolution level's grid division, the voxel (i.e., the 3D grid cell) to which the coordinates belong is first determined. This voxel consists of eight corner points (i.e., grid vertices). Then, the 3D coordinates of these eight corner points are hashed respectively to obtain the hash value corresponding to each corner point. The design goal of the hash function is to be able to map any 3D coordinates to the index space of the hash table, while minimizing the probability of collisions where different coordinates are mapped to the same index.

[0061] Next, using the hash value of the voxel corner point as the key, the feature vector corresponding to each voxel corner point is retrieved from the hash table at the corresponding resolution. Each resolution-level hash table is essentially a lookup table with hash values ​​as keys and trainable feature vectors as values. Using the hash value as an index, the feature vector associated with that corner point can be directly read from the hash table. This hash-based retrieval method significantly reduces memory usage compared to traditional dense grid storage methods because no storage space needs to be allocated for unoccupied areas in sparse scenes.

[0062] Then, based on the position of the spatial sampling point within the voxel, linear interpolation is performed on the feature vectors of the retrieved voxel corner points to obtain the feature vector at this resolution level. Specifically, using the three-dimensional normalized coordinates of the spatial sampling point within the voxel as weights, trilinear interpolation is performed on the feature vectors of the eight corner points to obtain the feature representation of the sampling point at this resolution level. Linear interpolation ensures the continuity of the encoding results, allowing similar spatial points to obtain similar feature representations, which is beneficial for the subsequent learning of smooth geometric and color functions by the MLP network.

[0063] Finally, the feature vectors from all resolution levels are concatenated to obtain the spatial sampling point features. For example... Figure 6 As shown, the feature vectors obtained from interpolation at each resolution level (e.g., from coarse to fine resolution) are sequentially concatenated to form a high-dimensional composite feature vector. This vector integrates multi-scale information from global structure to local details, providing a rich input representation for the subsequent MLP network. By inputting this feature vector along with parameters such as image features and viewpoint orientation into the MLP network, the color and volume density of the sampling points can be predicted.

[0064] Through the aforementioned multi-resolution hashing encoding mechanism, this method can efficiently transform 3D spatial coordinates into compact and expressive feature representations. At high resolution, each vertex has a unique storage location, accurately representing fine details; at low resolution, parameter sharing and sparse storage are achieved through hash functions, significantly reducing memory overhead. Furthermore, compared to traditional frequency-based position encoding, hashing encoding offers faster lookup speeds, effectively reducing the size of the MLP network and thus significantly improving training and rendering speeds while maintaining reconstruction quality.

[0065] Optionally, the operation of identifying the region to be repaired in the initial scene image and generating a corresponding mask image includes: determining the blank or deformed regions of the image edges in the initial scene image caused by the change in viewing angle based on the offset of the target viewing angle relative to the viewing angle of the input image, and marking them as regions to be repaired; generating a mask image with the same size as the initial scene image, wherein the pixel value of the region to be repaired is set to a first value, and the pixel values ​​of other regions are set to a second value.

[0066] Specifically, after generating the initial scene image, due to the discrepancy between the target viewpoint and the input image viewpoint, the generated image often contains defective areas caused by the viewpoint change, which need to be identified and repaired. The following section combines... Figures 8 to 11 This section provides a detailed explanation of the identification and mask image generation process.

[0067] First, based on the offset of the target viewpoint relative to the input image viewpoint, the blank or deformed areas at the edges of the initial scene image caused by the viewpoint change are determined and marked as areas to be repaired. Specifically, when there is a significant difference between the target viewpoint and the input image viewpoint, the new viewpoint will expose 3D scene areas not covered by the original viewpoint, resulting in blank areas at the edges of the generated initial image (e.g., ...). Figure 10 The blank areas at the top or bottom edges of the image after adjusting the pitch angle by 15° in a low-altitude scene, and Figure 11 (The blank areas on the left and right edges of the image after adjusting the heading angle by 15° in the ground scene). Simultaneously, due to changes in the occlusion relationships of objects in the scene under different viewpoints, background areas originally obscured by foreground objects may be partially revealed. However, the geometry of the revealed parts may be distorted or deformed, forming unnatural image areas. This step automatically detects these blank and deformed areas by analyzing the geometric transformation relationship between the target viewpoint and the input image viewpoint, combined with the depth information of the 3D scene, and identifies them as areas requiring repair.

[0068] Furthermore, to meet the need for customized scene content in simulation testing, users can also actively specify certain areas in the image as areas to be repaired. For example... Figure 8 and Figure 9 As shown, users can specify targets they want to remove from the scene by manually drawing masks (such as...). Figure 8 The white vehicle at the bottom Figure 9 (The truck on the right in the middle), these designated areas are also marked as areas to be repaired. This user-defined mask and the aforementioned automatically detected areas can be used together or separately as the source of areas to be repaired, providing a flexible operating method for subsequent scene editing.

[0069] Then, a mask image with the same size as the initial scene image is generated. In this mask image, the pixel values ​​of the area to be repaired are set to a first value, and the pixel values ​​of other areas are set to a second value. Specifically, the mask image is a single-channel image with the same width and height as the initial scene image, where the value of each pixel position indicates whether that position belongs to the area to be repaired. In this embodiment, the pixel values ​​of the area to be repaired are set to the first value (e.g., 255, representing white), and the pixel values ​​of other areas that do not need to be repaired (i.e., areas that remain unchanged) are set to the second value (e.g., 0, representing black), thus forming a binary mask image. This mask image accurately identifies the boundaries of the areas that need to be filled and reconstructed by the image inpainting model, providing precise guidance for subsequent repair operations.

[0070] The above steps can accurately locate image defect areas caused by changes in viewing angle, while also supporting user-defined editing needs, providing high-quality input conditions for subsequent image restoration models.

[0071] Optionally, the pre-trained image inpainting model is a LAMA image inpainting model based on Fast Fourier Convolution, and the LAMA image inpainting model includes a generator; and the operation of inputting the initial scene image and the mask image into the pre-trained image inpainting model to perform content inpainting on the region to be inpainted and generate the final target scene image includes: superimposing the initial scene image and the mask image and inputting them into the generator; the generator automatically inferring and filling the region to be inpainted corresponding to the mask image based on the image content of the unoccluded region through the Fast Fourier Convolution module, generating the repaired content of the region to be inpainted; and fusing the repaired content of the region to be inpainted with the unoccluded region in the initial scene image to obtain the final target scene image.

[0072] Specifically, in this embodiment, the pre-trained image inpainting model adopts a LAMA (Large Mask Inpainting) image inpainting model based on Fast Fourier Convolution, the core of which is its generator part. The following section combines... Figures 8 to 11 The specific steps of this repair operation are explained in detail.

[0073] First, the initial scene image and the mask image are superimposed and input into the generator. Specifically, after obtaining the mask image generated in step S206, the mask image and the initial scene image are concatenated along the channel dimension to form a multi-channel input tensor. The initial scene image provides RGB three-channel visual content information, and the mask image provides single-channel region indication information. The superposition of the two constitutes a four-channel input, which serves as the generator's input data. This superposition method allows the generator to clearly identify which regions need repair (regions with a mask value of the first value) and which regions need to be preserved (regions with a mask value of the second value), thus accurately focusing on the regions to be repaired during the repair process.

[0074] Then, the generator uses a Fast Fourier Convolution (FFC) module to automatically infer and fill the corresponding repaired region in the mask image based on the image content of the unoccluded area, generating the repaired content for the repaired region. The LAMA model's generator uses Fast Fourier Convolution (FFC) as its core building block. This module is designed to address the problem of limited receptive field and difficulty in capturing global contextual information of images in traditional convolutional neural networks.

[0075] Finally, the repaired content of the area to be repaired is fused with the unoccluded area in the initial scene image to obtain the final target scene image. Specifically, for pixel positions marked as areas to be repaired in the mask image, the repaired content output by the generator is used for filling; for pixel positions marked as non-repaired areas in the mask image, the original pixel values ​​of the initial scene image are retained. Through this mask-based selective fusion, the original information of the unoccluded areas is preserved, while the content of the repaired areas is seamlessly embedded, forming a complete, coherent, and realistic target scene image.

[0076] Figures 8 to 11 An example image showing the results of image restoration using this method is displayed. Figure 8 and Figure 9 This demonstrates the effect of the generator successfully inferring and filling the background of the removal area after specifying the target to be removed by a mask; Figure 10 and Figure 11 The examples demonstrate the successful restoration of edge blank areas caused by perspective adjustments. These examples show that the LAMA inpainting model based on Fast Fourier Convolution can effectively utilize global image context information to generate restoration content that is consistent with the surrounding environment, resulting in a visually realistic and natural final target scene image.

[0077] Through the above repair operations, this method can not only eliminate image defects caused by changes in viewing angle, but also respond to user-defined editing needs and achieve flexible adjustment of scene content, thereby providing high-quality and diverse scene data for flight car simulation testing.

[0078] Furthermore, the LAMA (Large Mask Inpainting) image inpainting model based on Fast Fourier Convolution includes a discriminator that plays a role during the model training phase. Specifically, during training, the discriminator receives the content of the inpainted region output by the generator and compares it with the corresponding real content in the original image. Through adversarial training, it judges the authenticity of the input content and uses this as the optimization target to backpropagate gradients, prompting the generator to produce more realistic inpainting effects. It should be noted that the discriminator only participates in parameter optimization during the offline training phase of the model. In the application phase of the method (i.e., when performing step S206 for image inpainting), only the trained generator is used for forward inference; the discriminator does not participate in the actual inpainting process. Through this generative adversarial network training mechanism, the LAMA model can learn high-quality image inpainting capabilities, thereby ensuring that the inpainting effect in the application phase is realistic and natural.

[0079] Optionally, the Fast Fourier Convolution module includes a local branch, a global branch, and a fusion unit. Furthermore, the generator, through the Fast Fourier Convolution module, automatically infers and fills the area to be repaired corresponding to the mask image based on the image content of the unoccluded area, generating the repair content for the area to be repaired. This operation includes: allocating all channels of the input feature map to the local branch and the global branch according to a preset ratio, wherein a subset of channels allocated to the local branch is used to extract local detail features, and a subset of channels allocated to the global branch is used to extract global context features; inputting the subset of channels allocated to the local branch to the local branch, extracting local detail features of the image through conventional convolution operations; inputting the subset of channels allocated to the global branch to the global branch, converting the feature map to the frequency domain through Fourier transform, extracting global context features of the image in the frequency domain, and then converting it back to the spatial domain through inverse Fourier transform to obtain the global context features of the image; inputting the local detail features output by the local branch and the global context features output by the global branch to the fusion unit for feature fusion to obtain a fused feature map; and generating the repair content for the area to be repaired based on the fused feature map.

[0080] Specifically, the Fast Fourier Convolution (FCC) module in this embodiment is a core component of the LAMA image inpainting model generator. Its design aims to simultaneously capture local details and global contextual information of an image, thereby achieving high-quality inpainting of large-scale missing regions. The generation process of the inpainted content will be described in detail below, taking into account the internal structure of this module.

[0081] First, all channels of the input feature map are allocated to the local branch and the global branch according to a preset ratio. The subset of channels allocated to the local branch is used to extract local detail features, and the subset of channels allocated to the global branch is used to extract global context features. Specifically, the FFC module receives a feature map from the previous layer, which has C channels. Based on a preset partitioning ratio (e.g., 50% of the channels are allocated to the local branch and 50% to the global branch), the C channels are divided into two subsets: the local branch channel subset (denoted as C0). local ) and global branch channel subset (denoted as C) global ), satisfying C local +C global = C. The partitioning ratio can be adjusted according to task requirements to balance the weight between preserving local details and global semantic understanding. The two partitioned channel subsets will be processed by two parallel branches respectively.

[0082] A subset of channels allocated to the local branch is input to the local branch, and local detail features of the image are extracted through conventional convolution operations. The local branch adopts a conventional convolutional neural network structure, consisting of several two-dimensional convolutional layers, activation function layers (such as ReLU), and normalization layers. These convolutional layers perform a sliding window operation on the input feature map in the spatial domain, with each convolutional kernel focusing only on pixels within its local neighborhood, thus effectively capturing local detail information such as edges, textures, and corners of the image. The output of the local branch preserves the spatial resolution of the input feature map and generates feature representations rich in local details, providing a foundation for fine texture synthesis of subsequent restoration content.

[0083] A subset of channels allocated to the global branch is input to the global branch. The feature map is transformed to the frequency domain using a Fourier transform. Global context features of the image are extracted in the frequency domain, and then transformed back to the spatial domain using an inverse Fourier transform to obtain the global context features of the image. The design of the global branch overcomes the limitation of the limited receptive field of traditional convolution: First, a Fast Fourier Transform (FFT) is performed on the input feature map along the spatial dimension to transform the features from the spatial domain to the frequency domain. In the frequency domain, each frequency point corresponds to the energy of the image at different frequency components, and the value of each frequency point is implicitly related to the global information of the entire image. Subsequently, a convolution operation is performed on the feature map in the frequency domain. This operation can simultaneously adjust the amplitude and phase of all frequency points, thereby achieving the modeling of global context information. After frequency domain convolution, the feature map is transformed back to the spatial domain using an inverse Fourier transform (IFFT). In this way, the global branch can capture the dependencies between distant pixels, understand the overall layout, semantic category, and scene structure of the image, and provide content guidance for the repair area that conforms to global logic.

[0084] The local detail features output from the local branch and the global context features output from the global branch are input into the fusion unit for feature fusion, resulting in a fused feature map. The fusion unit typically integrates the outputs of the two branches using element-wise addition or channel concatenation followed by convolution. This fusion preserves the accuracy of local details while incorporating global semantic consistency, resulting in a feature map that contains rich texture information and conforms to the overall image structure. The fused feature map, as the output of this FFC module, can be further passed to subsequent network layers for further processing.

[0085] Finally, based on the fused feature map, the repaired content for the region to be repaired is generated. In the generator of a LAMA model, multiple FFC modules are typically stacked to progressively reconstruct the image from low to high resolution. In the last layer, an output convolutional layer maps the fused feature map to an RGB three-channel repair result. This repair result corresponds to the pixel values ​​within the region to be repaired specified by the mask, and its content is automatically inferred by the network based on the context of the unmasked region.

[0086] Through the above design, the Fast Fourier Convolution module can simultaneously take into account the continuity of local details and the rationality of global semantics during the repair process, making it particularly suitable for repairing large-area missing regions. Figures 8 to 11 The repair effect shown is due to the effective use of global context information by the FFC module, which makes the background filling after removing the target natural and coherent, and the edge blank areas caused by the viewpoint adjustment are filled in accordance with the scene logic, thereby generating a visually realistic and believable final target scene image.

[0087] In summary, this application proposes a NeRF-based method for generating visual simulation scenes for intelligent flying cars. Its core innovations lie in two aspects: First, an improved neural radiation field network is constructed by combining an improved CNN pre-trained layer with multi-resolution hash coding, and the PointRend module in detectron2 is used for scene segmentation, enabling rapid generation of low-altitude and ground scenes for intelligent flying cars in small-sample environments. Second, LAMA image inpainting technology based on Fast Fourier Convolution (FFC) is introduced, generating mask images to achieve small-angle offsets, target deletion, and scene editing of existing images, thereby effectively expanding the dataset and improving the simulation test scenes.

[0088] Compared with existing technologies, this application has the following advantages: First, applying the small-sample scene visual simulation scene generation method to the low-altitude and ground environments of flying cars can proactively save testing costs and provide an efficient simulation verification method for flying car R&D; Second, by combining CNN pre-trained layers and multi-resolution hash coding, the training and rendering speeds are significantly accelerated while improving image generation quality; Third, image inpainting technology is used to generate novel images by small-angle deflection of existing images, effectively expanding rare sample data and enhancing the diversity and coverage of the dataset; Fourth, by generating mask images, the editing, deletion, and scene replacement of existing images of flying cars are realized, making the test scenes closer to actual working conditions and enabling flexible scene changes, greatly reducing the time and manpower costs required for algorithm testing; Fifth, this method is user-friendly for self-made datasets, and input data can be processed through existing modules without complex preprocessing procedures, greatly saving training, rendering, and testing costs.

[0089] Therefore, this application organically combines an improved neural radiation field network with image inpainting technology to achieve rapid generation of high-quality, multi-view flying car simulation scenes under conditions of small sample size and limited viewpoints. It also allows for flexible editing of scene content and expansion of the dataset, effectively solving the technical problems of traditional NeRF-based scene image generation methods in small sample scenes, such as poor reconstruction quality, slow training and rendering speed, limited viewpoints, and inability to effectively generate large-scale and motion scenes. This provides a feasible method for autonomous driving scene simulation and provides theoretical assistance for the research and testing of intelligent flying cars.

[0090] In addition, refer to Figure 1 As shown, according to a second aspect of this embodiment, a storage medium is provided. The storage medium includes a stored program, wherein, when the program is executed, a processor performs any of the methods described above.

[0091] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that this application is not limited to the described order of actions, as some steps may be performed in other orders or simultaneously according to this application. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions and modules involved are not necessarily essential to this application.

[0092] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods according to the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk), and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, or network device, etc.) to execute the methods described in the various embodiments of this application.

[0093] Example 2 Figure 12 A NeRF-based visual simulation scene generation device for intelligent flying cars according to this embodiment is shown, which corresponds to the method described in Embodiment 1. (Reference) Figure 12 As shown, the device includes: an image acquisition module 1210, used to acquire an input image of a target scene, wherein the target scene is a low-altitude flight scene or a ground driving scene of an intelligent flying car; an initial scene image generation module 1220, used to generate an initial scene image from the target perspective based on the input image using an improved neural radiation field network; wherein the improved neural radiation field network uses an improved CNN pre-trained layer to extract features from the input image and uses multi-resolution hash coding to hash the spatial sampling points from the target perspective, and inputs the extracted image features, the hash-coded spatial sampling point features, and the viewpoint direction of the spatial sampling points into a multilayer perceptron network to predict the color and volume density of the spatial sampling points, and finally generates the initial scene image through volume rendering; and a target scene image generation module 1230, used to identify the region to be repaired in the initial scene image, generate a corresponding mask image, and input the initial scene image and the mask image into a pre-trained image inpainting model to perform content repair on the region to be repaired, generating the final target scene image as scene data for visual simulation of the intelligent flying car.

[0094] It should be noted that the NeRF-based intelligent flying car visual simulation scene generation device provided in this embodiment can realize all the functions and steps in the above method embodiments, solve the same technical problems, and achieve the same technical effects. The similarities will not be repeated here.

[0095] Therefore, according to this embodiment, an improved neural radiation field network (NeRF) achieves efficient representation and rapid rendering of 3D scenes under sparse input image conditions. It can quickly generate initial scene images from a specified target viewpoint based on input images with limited viewpoints. Furthermore, an image inpainting model intelligently repairs image defects caused by viewpoint changes, and allows for selective editing of scene content according to user needs, thereby obtaining high-quality, multi-view final scene images. Through the above technical means, this application achieves rapid generation of high-quality, multi-view flying car simulation scenes under small sample and limited viewpoint conditions, and allows for flexible adjustment and expansion of scene content. This solves the technical problems of traditional NeRF-based scene image generation methods, such as poor reconstruction quality, slow training and rendering speed, limited viewpoints, and inability to effectively generate large-scale and motion scenes in small sample scenarios.

[0096] Example 3 Figure 13 A NeRF-based visual simulation scene generation device for intelligent flying cars according to this embodiment is shown, which corresponds to the method described in Embodiment 1. (Reference) Figure 13 As shown, the device includes: a processor 1310; and a memory 1320 connected to the processor 1310, used to provide the processor 1310 with instructions to process the following steps: acquiring an input image of a target scene, wherein the target scene is a low-altitude flight scene or a ground driving scene of an intelligent flying car; using an improved neural radiation field network to generate an initial scene image from the target perspective based on the input image; wherein the improved neural radiation field network uses an improved CNN pre-trained layer to extract features from the input image, and uses multi-resolution hash coding to hash the spatial sampling points from the target perspective, and inputs the extracted image features, the hash-coded spatial sampling point features, and the viewpoint direction of the spatial sampling points into a multilayer perceptron network to predict the color and volume density of the spatial sampling points, and finally generates the initial scene image through volume rendering; identifying the region to be repaired in the initial scene image, generating a corresponding mask image, and inputting the initial scene image and the mask image into a pre-trained image inpainting model to perform content repair on the region to be repaired, generating the final target scene image as scene data for the visual simulation of the intelligent flying car.

[0097] It should be noted that the NeRF-based intelligent flying car visual simulation scene generation device provided in this embodiment can realize all the functions and steps in the above method embodiments, solve the same technical problems, and achieve the same technical effects. The similarities will not be repeated here.

[0098] Therefore, according to this embodiment, an improved neural radiation field network (NeRF) achieves efficient representation and rapid rendering of 3D scenes under sparse input image conditions. It can quickly generate initial scene images from a specified target viewpoint based on input images with limited viewpoints. Furthermore, an image inpainting model intelligently repairs image defects caused by viewpoint changes, and allows for selective editing of scene content according to user needs, thereby obtaining high-quality, multi-view final scene images. Through the above technical means, this application achieves rapid generation of high-quality, multi-view flying car simulation scenes under small sample and limited viewpoint conditions, and allows for flexible adjustment and expansion of scene content. This solves the technical problems of traditional NeRF-based scene image generation methods, such as poor reconstruction quality, slow training and rendering speed, limited viewpoints, and inability to effectively generate large-scale and motion scenes in small sample scenarios.

[0099] The sequence numbers of the embodiments in this application are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.

[0100] In the above embodiments of this application, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0101] In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. The device embodiments described above are merely illustrative; for example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the displayed or discussed mutual coupling, direct coupling, or communication connection may be through some interfaces; the indirect coupling or communication connection between units or modules may be electrical or other forms.

[0102] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0103] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0104] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), portable hard drive, magnetic disk, or optical disk.

[0105] The above description is only a preferred embodiment of this application. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principle of this application, and these improvements and modifications should also be considered within the scope of protection of this application.

Claims

1. A method for generating visual simulation scenes for intelligent flying cars based on NeRF, characterized in that, include: The input image of the target scene is acquired, wherein the target scene is a low-altitude flight scene or a ground driving scene of the intelligent flying car; An improved neural radiation field network is used to generate an initial scene image from the target viewpoint based on the input image. The improved neural radiation field network uses an improved CNN pre-trained layer to extract features from the input image and uses multi-resolution hashing to hash the spatial sampling points from the target viewpoint. The extracted image features, the hashed spatial sampling point features, and the viewpoint direction of the spatial sampling points are input into a multilayer perceptron network to predict the color and volume density of the spatial sampling points. Finally, the initial scene image is generated through volume rendering. The system identifies the regions to be repaired in the initial scene image, generates corresponding mask images, and inputs the initial scene image and the mask images into a pre-trained image inpainting model to perform content repair on the regions to be repaired, generating the final target scene image as scene data for the visual simulation of intelligent flying cars.

2. The method according to claim 1, characterized in that, The improved neural radiation field network includes: a CNN pre-training layer, a multi-resolution hash coding layer, and a multilayer perceptron network; Furthermore, the operation of generating an initial scene image from the target's perspective based on the input image using an improved neural radiation field network includes: Obtain the camera parameters corresponding to the input image and the target camera parameters used to generate the new perspective; Based on the target camera parameters, light rays are emitted into three-dimensional space, and spatial sampling is performed on the light rays to obtain multiple spatial sampling points; The input image is used to extract features using the CNN pre-trained layer to obtain a feature map; For each spatial sampling point, the spatial sampling point is projected onto the feature map according to the camera parameters corresponding to the input image, and the corresponding image features are extracted by bilinear interpolation. Using the multi-resolution hash coding layer, the three-dimensional coordinates of the spatial sampling points are multi-resolution hash encoded to obtain the spatial sampling point features; The image features, the spatial sampling point features, and the viewpoint direction of the spatial sampling points are stitched together and input into the multilayer perceptron network to predict the color and volume density of the spatial sampling points. Based on the color and volume density of all spatial sampling points, an initial scene image from the target viewpoint is synthesized using volume rendering technology.

3. The method according to claim 2, characterized in that, The CNN pre-training layer uses the residual structure in the ResNet50 network to reduce the number of network parameters and prevent overfitting.

4. The method according to claim 2, characterized in that, The operation of using the multi-resolution hash coding layer to perform multi-resolution hash coding on the three-dimensional coordinates of the spatial sampling points to obtain the spatial sampling point features includes: Configure multiple hash tables with different resolutions, and arrange the multiple hash tables with different resolutions in a geometric progression from high to low; For each resolution level, find the voxel corner point containing the coordinates of the spatial sampling point, and hash the voxel corner point. Using the hash value of the voxel corner point as the key, retrieve the feature vector corresponding to each voxel corner point from the hash table of the corresponding resolution. Based on the position of the spatial sampling point within the voxel, the feature vectors of the retrieved voxel corner points are linearly interpolated to obtain the feature vector at this resolution level. The feature vectors at all resolution levels are concatenated to obtain the spatial sampling point features.

5. The method according to claim 1, characterized in that, The operation of identifying the region to be repaired in the initial scene image and generating the corresponding mask image includes: Based on the offset of the target viewpoint relative to the input image viewpoint, determine the blank or deformed areas at the image edges in the initial scene image caused by the viewpoint change, and mark them as areas to be repaired. Generate a mask image with the same size as the initial scene image. In the mask image, the pixel value of the area to be repaired is set to a first value, and the pixel value of other areas is set to a second value.

6. The method according to claim 1, characterized in that, The pre-trained image inpainting model is a LAMA image inpainting model based on fast Fourier convolution, and the LAMA image inpainting model includes a generator; Furthermore, the operation of inputting the initial scene image and the mask image into a pre-trained image inpainting model to perform content inpainting on the region to be inpainted and generating the final target scene image includes: The initial scene image and the mask image are superimposed and then input into the generator; The generator uses a fast Fourier convolution module to automatically infer and fill the area to be repaired corresponding to the mask image based on the image content of the unoccluded area, thereby generating the repair content for the area to be repaired. The repaired content of the area to be repaired is fused with the unoccluded area in the initial scene image to obtain the final target scene image.

7. The method according to claim 6, characterized in that, The fast Fourier convolution module includes local branches, global branches, and fusion units; Furthermore, the generator, through a Fast Fourier Convolution module, automatically infers and fills the corresponding repaired region in the mask image based on the image content of the unoccluded area, generating the repaired content for the repaired region. This process includes: All channels of the input feature map are allocated to the local branch and the global branch according to a preset ratio. The subset of channels allocated to the local branch is used to extract local detail features, and the subset of channels allocated to the global branch is used to extract global context features. The subset of channels allocated to the local branch is input to the local branch, and local detail features of the image are extracted through conventional convolution operations; The subset of channels assigned to the global branch is input to the global branch, the feature map is transformed to the frequency domain by Fourier transform, the global context features of the image are extracted in the frequency domain, and then transformed back to the spatial domain by inverse Fourier transform to obtain the global context features of the image. The local detail features output by the local branch and the global context features output by the global branch are input into the fusion unit to perform feature fusion and obtain the fused feature map. Based on the fused feature map, the repair content for the region to be repaired is generated.

8. A storage medium, characterized in that, The storage medium includes a stored program, wherein, when the program is executed, the method described in any one of claims 1 to 7 is performed by a processor.

9. A NeRF-based intelligent flying car visual simulation scene generation device, characterized in that, include: The image acquisition module is used to acquire an input image of the target scene, which is either a low-altitude flight scene or a ground driving scene of the intelligent flying car. An initial scene image generation module is used to generate an initial scene image from the target viewpoint based on the input image using an improved neural radiation field network. The improved neural radiation field network uses an improved CNN pre-trained layer to extract features from the input image and employs multi-resolution hashing to hash the spatial sampling points from the target viewpoint. The extracted image features, the hashed spatial sampling point features, and the viewpoint orientation of the spatial sampling points are input into a multilayer perceptron network to predict the color and volume density of the spatial sampling points. Finally, the initial scene image is generated through volume rendering. The target scene image generation module is used to identify the region to be repaired in the initial scene image, generate the corresponding mask image, and input the initial scene image and the mask image into a pre-trained image inpainting model to perform content repair on the region to be repaired, generating the final target scene image as scene data for the visual simulation of the intelligent flying car.

10. A NeRF-based intelligent flying car visual simulation scene generation device, characterized in that, include: processor; A memory, connected to the processor, for providing the processor with instructions to perform the following processing steps: The input image of the target scene is acquired, wherein the target scene is a low-altitude flight scene or a ground driving scene of the intelligent flying car; An improved neural radiation field network is used to generate an initial scene image from the target viewpoint based on the input image. The improved neural radiation field network uses an improved CNN pre-trained layer to extract features from the input image and uses multi-resolution hashing to hash the spatial sampling points from the target viewpoint. The extracted image features, the hashed spatial sampling point features, and the viewpoint direction of the spatial sampling points are input into a multilayer perceptron network to predict the color and volume density of the spatial sampling points. Finally, the initial scene image is generated through volume rendering. The system identifies the regions to be repaired in the initial scene image, generates corresponding mask images, and inputs the initial scene image and the mask images into a pre-trained image inpainting model to perform content repair on the regions to be repaired, generating the final target scene image as scene data for the visual simulation of intelligent flying cars.