A monocular event camera based simultaneous localization and mapping method
By using a simultaneous localization and mapping (SLAM) method based on a monocular event camera, event tensors are generated and images are reconstructed. The camera pose is optimized by combining differentiable scene representations. This solves the pose tracking and mapping problems of monocular event cameras in extreme scenarios, achieving high-quality online SLAM suitable for applications such as robotics, drones, and augmented reality.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TSINGHUA SHENZHEN INTERNATIONAL GRADUATE SCHOOL
- Filing Date
- 2026-04-07
- Publication Date
- 2026-06-16
AI Technical Summary
Existing technologies lack a SLAM solution that can simultaneously perform highly robust pose estimation and high-quality dense map construction online using only a monocular event camera. Especially in high-speed motion, drastic lighting changes, or extremely high dynamic range scenarios, frame cameras are prone to motion blur, overexposure, or underexposure, leading to a decline in pose tracking and map construction performance.
A simultaneous localization and mapping (SMR) method based on a monocular event camera is adopted. By receiving event streams, event tensors are generated, image reconstruction is performed, and a differentiable explicit scene representation method is used to represent the 3D scene. Combined with the current camera pose, a predicted image is generated, the camera pose is optimized, and the 3D scene parameters are optimized by minimizing the basic mapping loss function within a sliding window. High-precision camera trajectory and high-quality 3D map are output.
It achieves online high-precision pose tracking and high-quality map construction under monocular event camera conditions, improves the robustness and reconstruction quality of the system in high-speed and high dynamic range scenarios, solves the problem of insufficient view coverage, and is suitable for application scenarios such as robots, drones and augmented reality.
Smart Images

Figure CN121982243B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer graphics technology, and in particular to a method for simultaneous localization and map building based on a monocular event camera. Background Technology
[0002] Simultaneous Localization and Mapping (SLAM) technology is crucial for mobile platforms to achieve autonomous navigation in unknown environments. Novel explicit differentiable scene representation methods, such as 3D Gaussian Splatting (3DGS), have been incorporated into the visual SLAM (Simultaneous Localization and Mapping) framework, forming 3DGS-SLAM, due to their ability to achieve high-quality, real-time scene rendering. Existing 3DGS-SLAM solutions are primarily based on traditional frame cameras (RGB or RGB-D cameras). However, in scenes with high-speed motion, drastic lighting changes, or extremely high dynamic range, frame cameras suffer from severe motion blur, overexposure, or underexposure, leading to the failure of photometric consistency-based constraints and consequently a significant deterioration in the pose tracking and map building performance of the SLAM system.
[0003] Event cameras are a novel type of bio-inspired visual sensor that asynchronously outputs pixel-level brightness change events. They feature microsecond-level temporal resolution, high dynamic range (HDR), low latency, and low power consumption, making them well-suited to address the aforementioned challenges. Existing technologies utilize event cameras in several ways, such as event visual odometry (e.g., EVO, ESVO) and event-to-image reconstruction methods (e.g., E2VID, FireNet). However, event visual odometry typically only outputs camera trajectories without constructing dense maps; event-to-image reconstruction methods are often disconnected from the SLAM process, resulting in noisy and artifact-ridden reconstructed images, and are mostly processed offline, requiring known or externally provided camera poses.
[0004] Therefore, existing technologies lack a SLAM solution that can simultaneously perform robust pose estimation and high-quality dense map construction online (i.e., in real-time or near real-time) solely relying on a monocular event camera. How to combine the advantages of event cameras with the powerful capabilities of 3DGS-SLAM, and solve a series of problems such as event data reconstruction quality, pose tracking robustness, and map optimization integrity, has become a pressing technical challenge in this field.
[0005] The above background information is provided only to aid in understanding the concept and technical solution of this invention. It does not necessarily belong to the prior art of this patent application. In the absence of clear evidence that the above information was disclosed on the filing date of this patent application, the above background information should not be used to evaluate the novelty and inventiveness of this application. Summary of the Invention
[0006] To address the aforementioned technical problems, this invention proposes a method for simultaneous localization and mapping (SMR) based on a monocular event camera. This method enables online SMR and mapping using only a monocular event camera, while simultaneously outputting high-precision camera trajectories and high-quality 3D maps.
[0007] To achieve the above objectives, the present invention adopts the following technical solution:
[0008] In a first aspect, the present invention discloses a method for simultaneous localization and mapping based on a monocular event camera, comprising the following steps:
[0009] S1: Receives the event stream from the monocular event camera and performs time alignment and windowing to generate an event tensor;
[0010] S2: Reconstruct the event tensor to generate a reconstructed image;
[0011] S3: The 3D scene is represented by a differentiable explicit scene representation method, and the 3D scene is projected and rendered to generate a predicted image in combination with the current camera pose;
[0012] S4: Optimize the calculation of the current camera pose by minimizing the difference between the reconstructed image and the predicted image at the current viewpoint;
[0013] S5: Within a sliding window containing multiple keyframes, the parameters of the 3D scene are optimized in multiple rounds by minimizing the basic mapping loss function, wherein the basic mapping loss function includes the difference between the reconstructed image of each keyframe and the predicted image under the corresponding viewpoint.
[0014] S6: Outputs camera trajectory based on optimized current camera pose, and outputs 3D map based on optimized and updated 3D scene parameters.
[0015] Preferably, step S2 includes performing generative event reconstruction on the event tensor to generate a reconstructed image; wherein, performing generative event reconstruction on the event tensor specifically includes:
[0016] S21: The event tensor is processed by a convolutional neural network to reconstruct the initial pseudo-grayscale image;
[0017] S22: The initial pseudo-grayscale image is thinned using a diffusion model to generate a pseudo-grayscale image, which is the reconstructed image.
[0018] Preferably, the use of a differentiable explicit scene representation method to represent the three-dimensional scene in step S3 specifically includes: based on three-dimensional Gaussian splashing, the three-dimensional scene is parameterized by a three-dimensional Gaussian scene representation unit, wherein the three-dimensional Gaussian scene representation unit includes multiple Gaussian units, each Gaussian unit including position parameters describing spatial location, shape parameters describing geometric shape, and appearance parameters describing visual appearance.
[0019] Preferably, step S3, which involves projecting and rendering the 3D scene to generate a predicted image based on the current camera pose, specifically includes: projecting the 3D scene onto a 2D image plane based on the given current camera pose, and obtaining the predicted image through differentiable rendering.
[0020] Preferably, step S4 specifically includes: calculating photometric consistency loss and photovoltage contrast loss based on the difference between the reconstructed image and the predicted image at the current viewpoint, weighting the photometric consistency loss and photovoltage contrast loss to form a combined loss function, and then optimizing the calculation of the current camera pose by minimizing the combined loss function.
[0021] Preferably, calculating the photometric consistency loss based on the difference between the currently reconstructed image and the predicted image at the current viewpoint specifically includes: calculating the photometric consistency loss based on the photometric difference between the currently reconstructed image and the predicted image at the current viewpoint in the image intensity domain.
[0022] Preferably, calculating the photovoltage contrast loss based on the difference between the currently reconstructed image and the predicted image from the current viewpoint specifically includes:
[0023] Map the image intensities of the reconstructed image and the predicted image from the current viewpoint to the photovoltage domain, respectively.
[0024] A reference event graph is constructed based on the photovoltage changes of two consecutive reconstructed images, and a simulated event graph is constructed based on the photovoltage changes of two consecutive predicted images.
[0025] The photovoltage contrast loss is calculated based on the difference between the simulated event map and the reference event map in the logarithmic brightness variation domain.
[0026] Preferably, step S5 specifically includes: optimizing the updated parameters of the 3D scene by minimizing the basic mapping loss function within a sliding window containing multiple keyframes. The basic mapping loss function includes a photometric consistency loss calculated from the difference between the reconstructed image of each keyframe and the predicted image at the corresponding viewpoint, as well as an isotropic regularization term.
[0027] Preferably, after optimizing and updating the parameters of the 3D scene by minimizing the basic mapping loss function, the method further includes: generating at least one virtual camera pose near the current camera motion trajectory by pose extrapolation, and projecting and rendering the 3D scene in combination with the virtual camera pose to generate a virtual rendered image; denoising the virtual rendered image to generate a pseudo-observation image; and using the loss between the pseudo-observation image and the virtual rendered image to progressively optimize and update the parameters of the 3D scene.
[0028] In a second aspect, the present invention discloses a computer-readable storage medium storing a computer program, wherein the computer program is configured to be run by a processor to perform the simultaneous localization and mapping method based on a monocular event camera as described in the first aspect.
[0029] Compared with existing technologies, the beneficial effects of this invention are as follows: Firstly, this invention generates an event tensor based on the event stream of a monocular event camera, and then performs image reconstruction on the event tensor to generate a reconstructed image. Next, it represents the 3D scene using a differentiable explicit scene representation method, and projects and renders the 3D scene in conjunction with the current camera pose to generate a predicted image. Then, it optimizes the calculation of the current camera pose by minimizing the difference between the current reconstructed image and the predicted image from the current viewpoint. Furthermore, it optimizes the parameters of the 3D scene in multiple rounds by minimizing the fundamental mapping loss function within a sliding window containing multiple keyframes. These technical features work synergistically to solve the problem of difficulty in achieving online, robust, high-quality simultaneous localization and mapping (SLAM) when using only a monocular event camera, due to the sparsity, asynchronicity, noise, and potential insufficient viewpoint coverage of event data. Ultimately, it achieves pure event-driven online SLAM, generating a high-quality, renderable 3D map.
[0030] In a further embodiment, the present invention also has the following beneficial effects:
[0031] (1) This invention obtains high-quality pseudo-grayscale images by generatively reconstructing sparse asynchronous event streams, providing reliable supervision signals for subsequent optimization. Furthermore, this invention effectively improves the quality of pseudo-grayscale images reconstructed from event streams and suppresses noise and artifacts through a two-level image reconstruction architecture combining convolutional neural networks and diffusion models. Among these, generative reconstruction and dual-domain (intensity domain and event mechanism domain) constraints significantly improve the system's pose tracking robustness and reconstruction quality in high-speed, high-dynamic-range scenarios.
[0032] (2) This invention uses 3D Gaussian splashing as scene representation, achieving efficient rendering into predicted images. This allows errors in the image plane to be directly backpropagated to pose and map parameters, supporting joint optimization. Furthermore, by combining the differentiable rendering characteristics of 3D Gaussian splashing with the imaging physics of the event camera, a joint loss function is constructed that simultaneously includes photometric consistency constraints in the image intensity domain and photovoltage contrast constraints in the logarithmic brightness change domain. This function is used to optimize camera pose, making the pose tracking process insensitive to absolute brightness deviations and local reconstruction defects in pseudo-grayscale images. This significantly enhances the robustness of the system in extreme scenarios such as high speed and high dynamic range. In particular, by weighted combination of photometric consistency loss and photovoltage contrast loss, the importance of the two constraints in different scenarios can be more flexibly balanced, optimizing pose tracking performance. Furthermore, by constructing and comparing the simulated event map with the reference event map in the logarithmic brightness variation domain, the imaging process of the event camera is simulated more accurately, making pose tracking more consistent with physical reality. Moreover, by introducing photovoltage contrast loss, the physical generation mechanism of the event camera, logarithmic brightness variation, is incorporated into the optimization objective, improving the characteristic of insensitivity to absolute brightness and enhancing the robustness of the system.
[0033] (3) By using a progressive mapping optimization strategy, pseudo-observations under virtual perspective are generated by pose extrapolation and diffusion model to progressively optimize the 3D scene, which effectively solves the map degradation problem caused by insufficient perspective in low frame rate or large parallax scenes, improves the integrity and consistency of 3D reconstruction, and ensures the geometric consistency and texture integrity of the reconstructed map; among them, the virtual rendering image is refined by diffusion model to generate high-quality virtual pseudo grayscale image, which provides a more reliable supervision signal for progressive mapping optimization and further improves the quality of map construction.
[0034] (4) The system structure of the present invention is clear, the modules work together efficiently, it can run in real time on the GPU platform, it has good engineering feasibility, and it is suitable for various application scenarios such as robots, drones, and augmented reality.
[0035] Other beneficial effects of the embodiments of the present invention will be further described below. Attached Figure Description
[0036] Figure 1 This is a flowchart of a monocular SLAM method based on an event camera and 3D Gaussian splashing, as disclosed in a preferred embodiment of the present invention.
[0037] Figure 2 This is a block diagram of a monocular SLAM method based on an event camera and 3D Gaussian splashing, as disclosed in a preferred embodiment of the present invention. Detailed Implementation
[0038] The embodiments of the present invention will be described in detail below. It should be emphasized that the following description is merely exemplary and is not intended to limit the scope and application of the present invention.
[0039] It should be noted that when a component is referred to as "fixed to" or "set on" another component, it can be directly on or indirectly on that other component. When a component is referred to as "connected to" another component, it can be directly connected to or indirectly connected to that other component. Furthermore, a connection can be used for both fixing and circuit / signal connectivity.
[0040] It should be understood that the terms "length", "width", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", and "outer" indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. They are only for the convenience of describing the embodiments of the present invention and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on the present invention.
[0041] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of embodiments of the present invention, "a plurality of" means two or more, unless otherwise explicitly specified.
[0042] First, the terminology mentioned in the embodiments of this invention will be explained.
[0043] (1) SLAM (Simultaneous Localization and Mapping): refers to the technology of a mobile platform estimating its own pose and building an environmental map in an unknown environment at the same time.
[0044] (2) Event Camera: A new type of visual sensor that does not output full-frame images at a fixed frame rate, but asynchronously outputs "brightness change events" at a microsecond-level time resolution. Each event records pixel coordinates, timestamps and brightness change polarity.
[0045] (3) Event:
[0046] Single event Recorded as:
[0047] (1)
[0048] in, For pixel coordinates, For timestamps, This represents the polarity of brightness variation.
[0049] (4) 3DGS / 3D Gaussian Splatting: A scene representation that uses a set of three-dimensional Gaussian slabs with parameters of position, scale, rotation, color and opacity to represent the scene and achieves fast differentiable rendering through Alpha compositing.
[0050] (5) 3DGS-SLAM: A SLAM framework that uses 3D Gaussian as a unified map representation and estimates camera trajectory and Gaussian map while running online.
[0051] (6) Reconstructed Intensity Image: A grayscale image reconstructed from an event stream, used to replace traditional frame camera images as supervision for 3DGS photometric optimization.
[0052] (7) Diffusion Model: A generative depth model that learns the data distribution through forward "noise addition" and reverse "denoising" processes. It can be used to refine, denoise and add texture to the initial reconstructed image of an event.
[0053] (8) Photovoltage and Logarithmic Brightness Change: In the imaging mechanism of event cameras, the event is triggered by the logarithmic brightness change of pixels. In this invention, pixel intensity... Mapped to "photovoltage":
[0054] (2)
[0055] And construct event comparison constraints in this domain. Photovoltage, This is the scaling factor. For pixel intensity, To prevent small constants from causing numerical problems.
[0056] (9) Progressive Gaussian Mapping: refers to performing multiple rounds of optimization on the Gaussian set within a sliding window, and continuously strengthening the constraints on Gaussian through "new perspective extrapolation + generative pseudo-observation", so that the map geometry and texture gradually converge.
[0057] like Figure 1 As shown, a preferred embodiment of the present invention discloses a monocular SLAM method based on an event camera and 3D Gaussian splashing, comprising the following steps:
[0058] S1: Receives the event stream from the monocular event camera and performs time alignment and windowing to generate an event tensor;
[0059] Specifically, the system continuously receives event streams from a monocular event camera, such as... Figure 2 The input event stream, the event set can be represented as:
[0060] (3)
[0061] in, For pixel position, For timestamps, Let N represent the polarity of the brightness change, and N be the number of events.
[0062] To correspond with 3DGS rendering frames, this embodiment of the invention uses a fixed time window or a fixed number of events window to segment the event stream: (a) Time window scheme: every (e.g., 20ms, 33ms) Retrieve events for the corresponding time period from the circular buffer; (b) Counting window scheme: Each cumulative... Each event forms an event block.
[0063] For each window, an event tensor is formed according to polarity and time encoding. :
[0064] (4)
[0065] In the formula, Represents the set of real numbers. Indicates the tensor height. Indicates the tensor width. This represents the number of tensor channels.
[0066] Examples include positive polarity counting channels, negative polarity counting channels, and time-decay weight channels. Tensor values are normalized before being input into subsequent networks.
[0067] In other embodiments, the event tensor The channel definition can be different, for example, adding features such as time bucketing, polarity integral, latest event time, etc. As long as the basic idea is still "encoding the event stream into a multi-channel tensor and then inputting it into the generator network", it is within the scope of this invention.
[0068] S2: Reconstruct the image from the event tensor to generate a reconstructed image;
[0069] In some embodiments, the event tensor is input into the generative event reconstruction module in this step to generate a pseudo-grayscale image, which is the reconstructed image. The generative event reconstruction module consists of two levels: a convolutional neural network and a diffusion model. The convolutional neural network can perform initial reconstruction, while the diffusion model can further refine the image.
[0070] Step S2 specifically includes:
[0071] S21: The event tensor is processed by a convolutional neural network to reconstruct the initial pseudo-grayscale image;
[0072] Specifically, the event tensor The input is a convolutional neural network with an encoder-decoder structure, which produces an initial pseudo-grayscale image. :
[0073] (5)
[0074] in, For parameters The CNN (Convolutional Neural Network) is used. The encoder extracts the spatiotemporal features of events, and the decoder combines skip connections to reconstruct the grayscale image.
[0075] S22: Use a diffusion model to refine the initial pseudo-grayscale image and generate a pseudo-grayscale image.
[0076] Specifically, to improve image quality, embodiments of the present invention introduce a conditional diffusion model. During the training phase, noise is progressively added to real or high-quality reference images, and the model learns the reverse denoising process; during the inference phase, the initial pseudo-grayscale image is... As a conditional input, a denoised and refined pseudo-grayscale image is obtained in a small number of back-diffusion steps. :
[0077] (6)
[0078] in, Indicates the diffusion time step (which can be a small number of steps). It can also be encoded as a conditional vector and provided to the diffusion model to preserve event structure information.
[0079] refer to Figure 2 As shown, this step utilizes a two-level structure of "CNN coarse reconstruction + diffusion model refinement" to effectively suppress event noise and artifacts, complete missing textures, and provide high-quality supervision for backend 3DGS photometric optimization.
[0080] In other embodiments, in addition to the diffusion model, other types of generative models (such as those based on normalized flow, generative adversarial networks (GAN), energy models, etc.) may be used. As long as they can denoise and refine the image initially reconstructed by CNN and improve the quality of the event reconstruction image, they are considered equivalent alternatives to the present invention.
[0081] S3: The 3D scene is represented by a differentiable explicit scene representation method, and the 3D scene is projected and rendered to generate a predicted image in combination with the current camera pose;
[0082] Furthermore, this step uses a 3D Gaussian splash to represent the 3D Gaussian scene, and projects and renders the 3D Gaussian scene based on the current camera pose to generate a predicted image; step S3 specifically includes:
[0083] S31: Based on three-dimensional Gaussian splashing, a three-dimensional Gaussian scene is parametrically represented by a three-dimensional Gaussian scene representation unit. The three-dimensional Gaussian scene representation unit includes multiple Gaussian units. Each Gaussian unit includes position parameters describing spatial location, shape parameters describing geometric shape, and appearance parameters describing visual appearance.
[0084] Specifically, this embodiment of the invention uses 3DGS as a unified map representation. Gaussian set Recorded as:
[0085] (7)
[0086] Among them, each Gaussian Include:
[0087] Center of mass ;
[0088] Scale vector With rotation matrix Together they determine the 3D covariance. ;
[0089] Opacity ;
[0090] Color parameters (such as spherical harmonic coefficients) .
[0091] In other embodiments, the 3D Gaussian splash can be replaced with other explicit differentiable scene representations, such as Gaussian blended surfaces, voxel-Gaussian blended representations, and hybrid voxel-point cloud representations. As long as the map and pose are still optimized through differentiable rendering and photometric / event constraints, they should be considered reasonable variations under the concept of this invention.
[0092] S32: Project the 3D Gaussian scene onto the 2D image plane based on the given current camera pose, and obtain the predicted image through differentiable rendering.
[0093] Specifically, given the current camera pose Projecting the 3D Gaussian scene onto the image plane yields the 2D covariance. :
[0094] (8)
[0095] in, For the projected Jacobian matrix.
[0096] In pixels The coverage intensity of each Gaussian at each location With color The final pixel value is obtained by sequential alpha synthesis. :
[0097] (9)
[0098] The rendering process is differentiable, which allows photometric errors on the image plane to be backpropagated to the Gaussian parameters and camera pose, enabling joint optimization.
[0099] S4: Optimize the calculation of the current camera pose by minimizing the difference between the current reconstructed image and the predicted image at the current viewpoint;
[0100] Step S4 specifically includes: calculating the photometric consistency loss and photovoltage contrast loss based on the difference between the current pseudo-grayscale image and the predicted image at the current viewpoint, weighting the photometric consistency loss and photovoltage contrast loss to form a combined loss function, and then optimizing the calculation of the current camera pose by minimizing the combined loss function.
[0101] refer to Figure 2 As shown, in the tracking phase, this embodiment of the invention simultaneously utilizes photometric consistency constraints and photovoltage contrast constraints to determine the pose of the current frame. Optimize.
[0102] Step S4 can specifically include:
[0103] S41: Calculate the photometric consistency loss based on the photometric difference between the current pseudo-grayscale image and the predicted image under the current viewpoint in the image intensity domain.
[0104] Specifically, for the current frame pseudo grayscale image , in pose Rendered In the effective pixel set Upper definition of photometric uniformity loss :
[0105] (10)
[0106] To reflect the response of the event camera to "logarithmic brightness change", this invention constructs a simulated event map and a reference event map in the logarithmic brightness domain.
[0107] S42: Calculate the photovoltage contrast loss based on the difference between the current pseudo-grayscale image and the predicted image at the current viewpoint. This step specifically includes:
[0108] S421: Map the image intensities of the current pseudo-grayscale image and the predicted image at the current viewpoint to the photovoltage domain, respectively;
[0109] Specifically, pixel intensity Mapped to "photovoltage":
[0110] (11)
[0111] in, This is the scaling factor. To prevent small constants from causing numerical problems.
[0112] S422: Construct a reference event graph based on the photovoltage changes of two consecutive pseudo-grayscale images, and construct a simulated event graph based on the photovoltage changes of two consecutive predicted images;
[0113] Specifically, for the predicted images rendered in two consecutive frames and Construct a simulated event graph :
[0114] (12)
[0115] Similarly, from two consecutive pseudo-grayscale images , Construct a reference event graph .
[0116] S423: The photovoltage contrast loss is calculated based on the difference between the simulated event map and the reference event map in the logarithmic brightness variation domain.
[0117] Specifically, according to Construct mask Greater weight is given to regions with significant logarithmic brightness changes, and photovoltage contrast loss is defined as follows:
[0118] (13)
[0119] In other embodiments, the photovoltage contrast loss can take other forms related to the event mechanism, such as: directly constructing and aligning binary events on the logarithmic brightness difference threshold; using mutual information or correlation indices based on logarithmic brightness to replace L1 loss; introducing robust loss functions (Huber, Charbonnier) or event contrast constraints based on the frequency domain. As long as the basic idea of jointly aligning the rendering map and pseudo-observations in the "intensity domain + event domain" dual space remains unchanged, it should be included within the scope of protection of this invention.
[0120] In regions lacking texture, sparse event flow leads to incomplete reconstructed images. This approach partially mitigates this problem by introducing photovoltage contrast loss, as this loss depends only on brightness variations and not on absolute texture.
[0121] S43: The photometric consistency loss and photovoltage contrast loss are weighted and combined to form a combined loss function. The current camera pose is then optimized by minimizing the combined loss function.
[0122] Specifically, ultimately tracking losses The combined loss function obtained by weighting the two is:
[0123] (14)
[0124] in, It is the weight that controls the event domain constraints.
[0125] By using gradient descent or Gauss-Newton iteration on the pose parameters, the rendered image is aligned with the pseudo-observation in both the intensity domain and the logarithmic brightness domain, thereby significantly improving the tracking robustness in high dynamic scenes.
[0126] S5: Within a sliding window containing multiple keyframes, the parameters of the 3D scene are optimized in multiple rounds by minimizing the basic mapping loss function, which includes the difference between the reconstructed image of each keyframe and the predicted image at the corresponding viewpoint.
[0127] In this embodiment, the Gaussian set is optimized in multiple rounds within a sliding window, and a progressive strategy of "new perspective extrapolation + diffusion refinement of pseudo-observations" is introduced for low frame rate / large parallax scenarios.
[0128] Step S5 specifically includes:
[0129] S51: Within a sliding window containing multiple keyframes, the parameters of the updated 3D Gaussian scene are optimized by minimizing the basic mapping loss function, which includes the photometric consistency loss calculated from the difference between the pseudo-grayscale image of each keyframe and the predicted image at the corresponding viewpoint, as well as an isotropic regularization term.
[0130] Specifically, refer to Figure 2 In the keyframe window Within, for each keyframe Using photometric consistency loss And add isotropic regularization terms. To suppress Gaussian elongation along the ray, the overall optimization objective (i.e., the basic mapping loss function) is:
[0131] (15)
[0132] in, This represents the pose of keyframes within the window. is the parameter of the Gaussian set.
[0133] In other embodiments, basic optimization can employ Gauss-Newton, LM, Adam, or other first / second-order optimizers; hyperparameters such as sliding window size and keyframe selection criteria can also be adjusted. These are modifications at the engineering implementation level and do not change the overall concept of this invention: "event-generated reconstruction + 3DGS + photometric / event dual-domain constraints + progressive mapping optimization".
[0134] S52: Generate at least one virtual camera pose near the current camera motion trajectory by pose extrapolation, and combine the virtual camera pose to project and render the 3D Gaussian scene to generate a virtual rendering image; denoise the virtual rendering image to generate a pseudo-observation image; use the loss between the pseudo-observation image and the virtual rendering image to perform multiple rounds of progressive optimization and update of the parameters of the 3D Gaussian scene.
[0135] Specifically, to address insufficient view coverage, embodiments of the present invention sample new virtual camera viewpoints near the current frame through pose extrapolation. :
[0136] (16)
[0137] For example, linear extrapolation of the poses of the two most recent frames is performed in the SE(3) Lie algebra space. Control the extrapolation step size. Then, from the virtual camera's perspective... Rendering 3DGS to obtain virtual rendered images Input into diffusion model Perform a few steps of denoising to obtain a pseudo-observation image. :
[0138] (17)
[0139] With pseudo-observation images The "target image" is the same as the virtual rendered image. Comparisons are made to construct the mapping loss:
[0140] (18)
[0141] refer to Figure 2 The gradient is then backpropagated to the Gaussian parameters to achieve multiple rounds of incremental updates. The Gaussian parameters are pre-trained in directions where real observations are sparse or have large parallax, thereby improving the robustness of subsequent real observations.
[0142] In other embodiments, the extrapolation of new perspectives can employ different interpolation / sampling strategies, such as: using spherical linear interpolation (Slerp) or multi-frame fitting trajectories on SE(3); sampling multiple virtual perspectives to form a virtual small window; or directly using the learned perspective proposal network to select pseudo-perspectives based on the current geometric uncertainty. The generation of pseudo-observation images can also be achieved by other generative models, as long as their core function is to provide additional photometric supervision for insufficiently observed directions, all of which are equivalent designs of this invention.
[0143] S6: Outputs camera trajectory based on optimized current camera pose, and outputs 3D map based on optimized and updated 3D scene parameters.
[0144] In a preferred embodiment of the present invention, online 3DGS-SLAM is achieved by using only an event camera through "generative event reconstruction + 3D Gaussian splashing + photometric / event dual-domain tracking + progressive Gaussian mapping optimization".
[0145] Another preferred embodiment of the present invention discloses a simultaneous localization and mapping (SLAM) system based on a monocular event camera. This system can be implemented on a GPU-accelerated platform and corresponds to the aforementioned monocular SLAM method, specifically including:
[0146] The event data preprocessing module is used to receive the event stream from the monocular event camera and perform time alignment and windowing to generate event tensors;
[0147] The image reconstruction module, connected to the event data preprocessing module, is used to reconstruct the event tensor to generate a reconstructed image; further, it is used to perform generative event reconstruction on the event tensor to generate a pseudo-grayscale image, which is the reconstructed image; wherein, this module includes an "event accumulation and CNN reconstruction" submodule and a "diffusion model refinement" submodule, and outputs a stable pseudo-grayscale image sequence.
[0148] The scene representation and rendering module is used to store the 3D scene represented by a differentiable explicit scene representation method, and to project and render the 3D scene to generate a predicted image in combination with the current camera pose; further, it is used to store the 3D Gaussian scene represented by 3D Gaussian splashing, and to project and render the 3D Gaussian scene to generate a predicted image in combination with the current camera pose; this module performs the initialization, insertion, merging and clipping of Gaussian sets to maintain a renderable 3D scene representation.
[0149] The pose tracking module, connected to the image reconstruction module and the scene representation and rendering module, is used to optimize the calculation of the current camera pose by minimizing the difference between the current reconstructed image and the predicted image at the current viewpoint. The module renders an image from the current Gaussian map, calculates the photometric consistency loss and photovoltage contrast loss, and optimizes the current camera pose.
[0150] The progressive mapping optimization module, connected to the image module, the scene representation and rendering module, and the pose tracking module, is used to optimize the parameters of the 3D scene in multiple rounds by minimizing the basic mapping loss function within a sliding window containing multiple keyframes. The basic mapping loss function includes the difference between the reconstructed image and the predicted image at the corresponding viewpoint for each keyframe. This module performs keyframe sliding window management, new viewpoint extrapolation and pseudo-observation generation, and Gaussian parameter optimization under isotropic regularization constraints.
[0151] The results output module, connected to the pose tracking module and the progressive mapping optimization module, is used to output a camera trajectory based on the optimized current camera pose and a 3D map based on the optimized and updated parameters of the 3D scene. This module outputs the camera trajectory and a 3D Gaussian map available for rendering / visualization online, providing an interface for upper-level navigation, AR, or 3D perception tasks.
[0152] The monocular SLAM method and system based on an event camera and 3D Gaussian splashing proposed in the preferred embodiment of the present invention specifically includes the following key technical points:
[0153] (1) Generative event reconstruction framework of “CNN initial reconstruction + diffusion model refinement”: The event stream is first encoded into an event tensor, and then the initial pseudo grayscale image is obtained through CNN. Subsequently, the conditional diffusion model is introduced for refinement and denoising, thereby significantly improving the image quality of the event reconstruction and providing a stable and reliable supervision signal for 3DGS photometric optimization.
[0154] (2) Explicit scene representation and differentiable rendering based on 3D Gaussian splashing: using 3D Gaussian sets As a unified map representation, the use of differentiable alpha synthesis to render images allows photometric errors on the image plane to be directly fed back to the 3D Gaussian parameters and camera pose, which is a key foundation for realizing online 3DGS-SLAM.
[0155] (3) Dual-domain pose tracking loss combining photometric consistency and photovoltage comparison: using photometric loss in the intensity domain Photovoltage contrast loss is used in the logarithmic brightness variation domain. And combined according to weights This significantly improves the robustness of pose estimation in the presence of noise, artifacts, and brightness mismatch.
[0156] (4) Progressive 3D Gaussian mapping optimization for event scenarios: By extrapolating new perspectives in pose space, pseudo-observations are generated using a diffusion model. The Gaussian set is then used as a supervisor to perform multiple rounds of progressive updates, effectively alleviating the problem of insufficient viewpoint in low frame rate / large parallax scenes, reducing artifacts such as Gaussian elongation along rays, and improving the quality of 3D reconstruction.
[0157] (5) The module collaboration mechanism of the overall online 3DGS-SLAM system: including the data flow, interface definition and running order between modules such as event data acquisition and encoding, generative reconstruction, 3DGS map management, dual-domain tracking and progressive mapping optimization, so that the system can run at real-time frequency on the actual platform and realize the integrated function of "pure event input → camera trajectory + 3D Gaussian map output".
[0158] The monocular SLAM method and system based on an event camera and 3D Gaussian splashing proposed in the preferred embodiment of the present invention have the following advantages:
[0159] (1) Pure event-driven online 3DGS-SLAM: Trajectory estimation and dense 3D Gaussian reconstruction can be completed using only a monocular event camera. It does not rely on traditional frame cameras or inertial sensors, which expands the application scenarios of 3DGS-SLAM under extreme conditions such as high speed and strong light difference, and overcomes the problems of poor image quality of event reconstruction and unstable tracking in high-speed and high-dynamic scenes.
[0160] (2) Generative event reconstruction significantly improves observation quality: Through the two-level structure of “CNN preliminary reconstruction + diffusion model refinement”, the sparse and noisy event stream is converted into a high-quality pseudo grayscale image, effectively suppressing artifacts and supplementing texture, providing more reliable supervision for 3DGS photometric optimization.
[0161] (3) Photovoltaic contrast constraint enhances tracking robustness: The physical generation mechanism of events (logarithmic brightness change) is introduced into the tracking loss. The rendering map and pseudo-observations are aligned in the logarithmic brightness domain, making the system insensitive to absolute brightness deviation and local artifacts. It can still maintain stable pose estimation in high-speed motion, low texture and high dynamic range scenes.
[0162] (4) Progressive mapping optimization adapts to low frame rate and large parallax: By generating pseudo-observations through new perspective extrapolation and diffusion model, supervision is provided in advance for directions that have not yet been truly observed, significantly reducing the risk of Gaussian stretching along rays and artifacts, and improving the integrity of the reconstruction model and rendering quality.
[0163] (5) The project is highly feasible: each module has a clear structure and can run in real time on a single GPU to realize online processing of "event input → trajectory + map output", which is convenient for deployment on platforms such as robots, drones and augmented reality.
[0164] Another preferred embodiment of the present invention discloses a computer-readable storage medium storing a computer program, wherein the computer program is configured to be run by a processor to perform the monocular SLAM method based on event camera and 3D Gaussian splashing described in the preferred embodiment above.
[0165] Optionally, the aforementioned storage media may include, but are not limited to, USB flash drives, read-only memory (ROM), random access memory (RAM), portable hard drives, magnetic disks, or optical disks, and other media capable of storing computer programs.
[0166] The background section of this invention may include background information about the problems or circumstances surrounding the invention, rather than a description of prior art by others. Therefore, the content included in the background section is not an admission of prior art by the applicant.
[0167] The above description provides a further detailed explanation of the present invention in conjunction with specific / preferred embodiments, and it should not be construed that the specific implementation of the present invention is limited to these descriptions. For those skilled in the art, various substitutions or modifications can be made to these described embodiments without departing from the concept of the present invention, and all such substitutions or modifications should be considered within the scope of protection of the present invention. In the description of this specification, the reference to terms such as "an embodiment," "some embodiments," "preferred embodiment," "example," "specific example," or "some examples," etc., indicates that the specific features, structures, materials, or characteristics described in connection with that embodiment or example are included in at least one embodiment or example of the present invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Moreover, the specific features, structures, materials, or characteristics described can be combined in any suitable manner in one or more embodiments or examples. Furthermore, those skilled in the art can combine and integrate different embodiments or examples and features of different embodiments or examples described in this specification without contradiction. Although the embodiments of the present invention and their advantages have been described in detail, it should be understood that various changes, substitutions, and modifications can be made herein without departing from the scope defined by the appended claims.
Claims
1. A method for simultaneous localization and mapping based on a monocular event camera, characterized in that, Includes the following steps: S1: Receives the event stream from the monocular event camera and performs time alignment and windowing to generate an event tensor; S2: Reconstruct the event tensor to generate a reconstructed image; S3: The 3D scene is represented by a differentiable explicit scene representation method, and the 3D scene is projected and rendered to generate a predicted image in combination with the current camera pose; S4: Optimize the calculation of the current camera pose by minimizing the difference between the reconstructed image and the predicted image at the current viewpoint; S5: Within a sliding window containing multiple keyframes, the parameters of the 3D scene are optimized in multiple rounds by minimizing the basic mapping loss function, wherein the basic mapping loss function includes the difference between the reconstructed image of each keyframe and the predicted image under the corresponding viewpoint. S6: Output camera trajectory based on optimized current camera pose, and output 3D map based on optimized and updated 3D scene parameters; Specifically, step S4 includes: calculating photometric consistency loss and photovoltage contrast loss based on the difference between the reconstructed image and the predicted image at the current viewpoint; weighting the photometric consistency loss and photovoltage contrast loss to form a combined loss function; and then optimizing the calculation of the current camera pose by minimizing the combined loss function. The calculation of photovoltage contrast loss based on the difference between the reconstructed image and the predicted image from the current viewpoint specifically includes: mapping the image intensities of the reconstructed image and the predicted image from the current viewpoint to the photovoltage domain; constructing a reference event map based on the photovoltage changes of two consecutive frames of the reconstructed image, and constructing a simulated event map based on the photovoltage changes of two consecutive frames of the predicted image; and calculating the photovoltage contrast loss based on the difference between the simulated event map and the reference event map in the logarithmic brightness change domain. After optimizing and updating the parameters of the 3D scene by minimizing the basic mapping loss function, the method further includes: generating at least one virtual camera pose near the current camera motion trajectory by pose extrapolation, and projecting and rendering the 3D scene in combination with the virtual camera pose to generate a virtual rendered image; denoising the virtual rendered image to generate a pseudo-observation image; and using the loss between the pseudo-observation image and the virtual rendered image to progressively optimize and update the parameters of the 3D scene.
2. The simultaneous localization and mapping method according to claim 1, characterized in that, Step S2 includes performing generative event reconstruction on the event tensor to generate a reconstructed image; wherein, performing generative event reconstruction on the event tensor specifically includes: S21: The event tensor is processed by a convolutional neural network to reconstruct the initial pseudo-grayscale image; S22: The initial pseudo-grayscale image is thinned using a diffusion model to generate a pseudo-grayscale image, which is the reconstructed image.
3. The simultaneous localization and mapping method according to claim 1, characterized in that, The specific steps of using a differentiable explicit scene representation method to represent the three-dimensional scene in step S3 include: based on three-dimensional Gaussian splashing, the three-dimensional scene is parameterized by a three-dimensional Gaussian scene representation unit, wherein the three-dimensional Gaussian scene representation unit includes multiple Gaussian units, each Gaussian unit including position parameters describing spatial location, shape parameters describing geometric shape, and appearance parameters describing visual appearance.
4. The simultaneous localization and mapping method according to claim 1, characterized in that, Step S3, which combines the current camera pose to project and render the 3D scene to generate a predicted image, specifically includes: projecting the 3D scene onto a 2D image plane based on the given current camera pose, and obtaining the predicted image through differentiable rendering.
5. The simultaneous localization and mapping method according to claim 1, characterized in that, The calculation of photometric consistency loss based on the difference between the reconstructed image and the predicted image at the current viewpoint specifically includes: calculating the photometric consistency loss based on the photometric difference between the reconstructed image and the predicted image at the current viewpoint in the image intensity domain.
6. The simultaneous localization and mapping method according to claim 1, characterized in that, Step S5 specifically includes: optimizing the updated parameters of the 3D scene within a sliding window containing multiple keyframes by minimizing the basic mapping loss function, wherein the basic mapping loss function includes a photometric consistency loss calculated from the difference between the reconstructed image of each keyframe and the predicted image at the corresponding viewpoint, as well as an isotropic regularization term.
7. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, wherein the computer program is configured to be run by a processor to perform the simultaneous localization and mapping method based on a monocular event camera as described in any one of claims 1 to 6.