A real-time indoor scene visual simultaneous localization and mapping method

The signed distance and color prediction method trained in real time by shallow multilayer perceptron network solves the problems of poor performance and high resource consumption in existing visual synchronous localization and mapping technologies, and realizes lightweight real-time indoor scene 3D reconstruction and generalization capabilities.

CN116721206BActive Publication Date: 2026-06-26ZHEJIANG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHEJIANG UNIV
Filing Date
2023-05-26
Publication Date
2026-06-26

Smart Images

  • Figure CN116721206B_ABST
    Figure CN116721206B_ABST
Patent Text Reader

Abstract

The application discloses a real-time indoor scene visual simultaneous localization and mapping method. First, the camera is used to collect scene images and corresponding depth maps in real time and record as current frame data, then the initial camera pose of the current frame is optimized according to the current frame data, the optimized camera pose of the current frame is obtained and used as the initial camera pose of the next frame; if the current frame is a key frame, the current signed distance and color prediction network is trained and the network is updated, and the prediction depth and color value of each pixel point in the current key frame are obtained; if the current frame is a normal frame, the current signed distance and color prediction network is used for prediction, and then the prediction depth and color value of each pixel point in the current normal frame are obtained, so that the surface geometric structure diagram of the scene in the current view area is constructed, and the surface geometric structure diagram of the whole scene is obtained. The application uses a lightweight network without pre-training, improves the positioning speed and ensures the real-time performance of the simultaneous localization and mapping.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a visual synchronous positioning and mapping method, specifically a real-time indoor scene visual synchronous positioning and mapping method. Background Technology

[0002] Simultaneous localization and mapping (SLR) is a crucial technology in fields such as robotics and autonomous driving. Its goal is to construct an environmental map of the user's environment and determine the user's pose within it. Based on SLR, user systems can rationally decide on their routes and behaviors based on real-time acquired information about the surrounding environment and their own location, ensuring the normal and stable operation of their functions.

[0003] Visual simultaneous localization and mapping (PSMA) is based on computer vision information and methods, and can achieve relatively good results. Conventional PMA methods are mostly based on PTAM, dividing the entire system into two parts: camera tracking and local mapping. However, these methods often struggle with geometric estimation in weakly textured scenes and unobserved regions, and require significant memory resources to store the geometric information of the entire scene.

[0004] The scene-based implicit representation-based visual simultaneous localization and mapping (VSLAM) method uses a multilayer perceptron network with coordinates as input to simultaneously reconstruct the scene's geometry and camera pose. This method avoids the complex operations of image matching and local mapping in traditional VSLAM methods. However, its method of predicting scene depth by integrating the volume density of sampling points can lead to problems such as surface roughness, insufficient texture details, and flaws in the reconstructed geometry.

[0005] One current technique, the method described by Sucar et al. in their paper "iMAP: Implicit mapping and positioning in real-time," uses a series of color and depth maps as input data and employs a multilayer perceptron network to represent the entire scene. However, due to the limitations of the model capacity of a single multilayer perceptron network, it cannot acquire detailed scene geometry information or accurately track camera poses, especially for larger scenes.

[0006] The second existing technology, the method described by Zhu et al. in their paper "NICE-SLAM: Neural Implicit Scalable Encoding for SLAM," also uses a series of color and depth maps as input data. It employs a multi-level feature grid to encode and represent the geometric and appearance information of the scene and introduces a neural decoder pre-trained at different resolutions. This allows for more detailed mapping and accurate localization of larger scenes, and is fast and computationally inexpensive. However, pre-training the decoder on specific datasets makes it difficult to generalize to different types of scenes. Summary of the Invention

[0007] To effectively address the problems of existing visual synchronous localization and mapping (VLR) methods, such as poor performance and accuracy in 3D scene reconstruction, the need for pre-training of network models which prevents generalization to different scenes, and large network models that consume significant memory, this invention proposes a real-time indoor scene VLR method. This method utilizes a shallow multilayer perceptron network that does not require pre-training to construct the 3D geometry of the indoor scene and simultaneously optimize the real-time camera pose, thereby improving scene reconstruction performance and real-time capabilities while reducing the model's requirements for memory and computing resources.

[0008] The technical solution adopted in this invention is:

[0009] S1: Establish a signed distance and color prediction network;

[0010] S2: Use the camera to acquire scene images and corresponding depth maps in real time and record them as the current frame data. Determine whether the current frame is a key frame. If it is a key frame, execute S3; otherwise, record it as a normal frame and execute S4.

[0011] S3: Based on the current keyframe data, train the current signed distance and color prediction network, obtain the trained signed distance and color prediction network and update the network, and obtain the predicted depth and color value of each pixel in the current keyframe; at the same time, optimize the initial camera pose of the current keyframe based on the current keyframe data, obtain the optimized camera pose of the current keyframe and use it as the initial camera pose of the next frame.

[0012] S4: Based on the current normal frame data, use the current signed distance and color prediction network to make predictions and obtain the prediction output of the network. Calculate the predicted depth and color value of each pixel in the current normal frame based on the prediction output of the network. At the same time, optimize the initial camera pose of the current normal frame based on the current normal frame data, obtain the optimized camera pose of the current normal frame and use it as the initial camera pose of the next frame.

[0013] S5: Based on the predicted depth and color values ​​of the current frame and the optimized camera pose, construct a surface geometry map of the scene from the current viewpoint;

[0014] S6: Repeat S2-S5 continuously to obtain surface geometry diagrams of the scene from different perspectives.

[0015] In S1, the signed distance and color prediction network is a shallow multilayer perceptron network.

[0016] In S2, there are several ordinary frames between two adjacent keyframes.

[0017] In S3 or S4, for each pixel in the scene image of the current frame, firstly, a set of sampling points is determined within the camera range along the ray corresponding to each pixel. Then, the coordinates of all sampling points in the sampling point set are calculated based on the ray direction and the sampling point depth. The coordinates of all sampling points are then position-encoded using SIREN (Sinusoidal Activated Position Encoding) and input into the signed distance and color prediction network to obtain the signed distance and color value corresponding to each sampling point. Finally, the predicted depth and color value corresponding to each pixel in the scene image of the current frame are calculated and used as the predicted depth and color value of the current frame.

[0018] The set of sampling points includes sampling points evenly distributed at equal intervals along each segment of the ray, and sampling points normally distributed within the camera depth range, centered on the scene depth determined by the true depth map.

[0019] Compared with the prior art, the beneficial effects of the present invention are as follows:

[0020] This invention uses a shallow multilayer perceptron network to encode the geometric information of indoor scenes, eliminating the need for preprocessing operations on the input images as is common in visual synchronous localization and mapping methods. Furthermore, the lightweight network model reduces the model's memory resource consumption, computational overhead, and speed, thereby improving the real-time performance of the model.

[0021] This invention utilizes scene depth information obtained from a depth camera and employs a signed distance field for scene depth estimation. This avoids the inaccuracies caused by depth estimation using volume density integrals, thereby improving the efficiency and accuracy of indoor scene geometric information restoration and reconstruction.

[0022] The signed distance and color prediction network model used in this invention is trained and optimized in real time during the model operation using scene images and depth map data acquired in real time by a depth camera. It does not require pre-training on a specific dataset and can generalize to different types of scenes. Attached Figure Description

[0023] Figure 1This is a flowchart illustrating the overall process of a real-time indoor scene visual synchronous positioning and mapping method according to an embodiment of the present invention.

[0024] Figure 2 This is a detailed flowchart of a real-time indoor scene visual synchronous positioning and mapping method according to an embodiment of the present invention.

[0025] Figure 3 This is a demonstration of the reconstruction effect of scene geometry in a local area of ​​an indoor scene according to an embodiment of the present invention. Detailed Implementation

[0026] The specific process of the method of the present invention will now be clearly, thoroughly and completely described in conjunction with the accompanying drawings.

[0027] like Figure 1 and Figure 2 As shown, the present invention includes the following steps:

[0028] S1: Establish a signed distance and color prediction network;

[0029] In S1, the signed distance and color prediction network is a shallow multilayer perceptron network. A shallow multilayer perceptron network is a multilayer perceptron network with fewer than 10 hidden layers. In this embodiment, the signed distance and color prediction network is a multilayer perceptron network with 4 hidden layers, and GeLU is used as the activation function for the hidden layers.

[0030] S2: Real-time acquisition of scene images and corresponding depth maps using the camera, recording them as the current frame data. Each set of scene images and depth maps corresponds to a set of pose parameters for the camera. Determine if the current frame is a keyframe. If it is a keyframe, proceed to S3; otherwise, record it as a normal frame and proceed to S4.

[0031] In S2, several ordinary frames are spaced between two adjacent keyframes. In specific implementation, the first frame is used as the first keyframe, which is used to initialize the prediction network. Each keyframe other than the first keyframe dynamically calculates and judges the proportion of the relative error between the predicted depth and the depth obtained from the true depth map in the pixel samples of the current frame that is less than a threshold. The current frame with a proportion greater than the set value is used as the keyframe.

[0032] S3: Based on the current keyframe data, train the current signed distance and color prediction network, obtain the trained signed distance and color prediction network and update the network, and obtain the predicted depth and color value of each pixel in the current keyframe; at the same time, optimize the initial camera pose of the current keyframe based on the current keyframe data, obtain the optimized camera pose of the current keyframe and use it as the initial camera pose of the next frame.

[0033] S4: Based on the current normal frame data, use the current signed distance and color prediction network to make predictions and obtain the prediction output of the network. Calculate the predicted depth and color value of each pixel in the current normal frame based on the prediction output of the network. At the same time, optimize the initial camera pose of the current normal frame based on the current normal frame data, obtain the optimized camera pose of the current normal frame and use it as the initial camera pose of the next frame.

[0034] In S3 or S4, for each pixel in the scene image of the current frame, firstly, a set of sampling points is determined within the camera range along the ray corresponding to each pixel. The set of sampling points includes sampling points evenly distributed at equal intervals along each segment of the ray, and sampling points normally distributed within the camera depth range centered on the scene depth determined by the true depth map. Then, the coordinates of all sampling points in the sampling point set are calculated based on the ray direction and the sampling point depth. The coordinates of all sampling points are then positionally encoded using SIREN (Sinusoidal Activation Position Encoding) and input into the signed distance and color prediction network to obtain the signed distance and color value corresponding to each sampling point. Finally, the predicted depth and color value corresponding to each pixel in the scene image of the current frame are calculated and used as the predicted depth and color value of the current frame. During the training of the signed distance and color prediction network, the depth and color loss of the signed distance and color prediction network are calculated and optimized based on the predicted depth and color values ​​of each pixel in the scene image of the current frame. The network parameters are then backpropagated and updated. The training is iterated until the network converges, and the trained signed distance and color prediction network under the current key frame is obtained, thereby realizing the construction and expression of the three-dimensional geometric structure of the indoor scene.

[0035] Specifically:

[0036] The ray r corresponding to each pixel in the scene image is calculated using the following formula:

[0037] r = T wc K[u,v]

[0038] Where K is the camera intrinsic parameter matrix, T wc Let r be the camera pose, and [u,v] be the pixel coordinates on the camera image corresponding to ray r.

[0039] The three-dimensional coordinates x of the sampling points uniformly distributed at equal intervals on the ray are determined by the following formula. i :

[0040]

[0041] x i =t i r

[0042] Among them, t iThis represents the sampling depth value along the ray direction of sampling point i, where r is the ray corresponding to a pixel in the scene image. For a uniform distribution, t n and t f N represents the lower and upper limits of the depth along ray r of the camera, respectively. c The number of equal intervals that divide the ray within the camera's depth range, i.e., the number of sampling points sampled using this method.

[0043] Within the camera depth range, sampling points distributed normally around the scene depth determined by the true depth map are used to determine their three-dimensional coordinates (x, y) according to the following formula. i :

[0044]

[0045] x i =t i r

[0046] in, The distribution follows a normal pattern, where d[u,v] represents the scene depth at pixel [u,v] obtained by the depth camera, and N is the depth of the scene. f This represents the number of sampling points obtained using this method.

[0047] Prediction depth corresponding to sampling points It is calculated using the following formula:

[0048]

[0049] Among them, s i The signed distance prediction value for the sampling points is derived from the three-dimensional coordinates x of the sampling points. i The signal is encoded by SIREN and then fed into a signed distance and color prediction network to obtain the output.

[0050] The predicted color value corresponding to a pixel in the image is calculated from the probability density σ at the sampling point and the predicted color value c. The probability density σ is calculated using the following formula:

[0051]

[0052] Where σ() represents the probability density value, s is the signed distance prediction value for each sampling point, and tr is the cutoff distance. The color prediction value corresponding to a pixel is calculated using the following formula:

[0053]

[0054] τ(t)=σ(r(t))T(t)

[0055]

[0056] Where T() represents the transparency function, r() represents the direction of the ray where the sampling point is located, t is the sampling point depth along the ray r corresponding to the pixel from the camera origin, τ() is the probability density function, and c is the color prediction value at the sampling point.

[0057] In specific implementation, the scene image is divided into an 8×8 grid, the average rendering loss of each region is calculated, and a pixel sampling strategy is established based on the calculation results. This allows for the extraction of more pixels with more scene details or less accurate and complete scene geometric information restoration in the corresponding region during model optimization, rather than calculating based on all pixels in the image. This approach can reduce the computational overhead and improve the running efficiency of the method implementation process. When optimizing the camera pose, the translation and rotation information contained in the camera pose parameter matrix SE(3) is decoupled into a Cartesian product SO(3)×T(3), and different Adam optimizers and learning rates are used to optimize the rotation and translation during the training optimization process to improve the stability of the system.

[0058] S5: Based on the predicted depth and color values ​​of the current frame and the optimized camera pose, construct a surface geometry map of the scene from the current viewpoint;

[0059] S6: Repeat S2-S5 continuously to obtain surface geometry diagrams of the scene from different perspectives. For example... Figure 3 As shown, this embodiment restores most of the geometric structure information of various areas of the indoor scene, and achieves good results in areas with rich details.

[0060] The present invention proposes a real-time indoor scene visual synchronous localization and mapping method, which can perform localization and tracking in real time based on the color image and depth map data of the indoor scene collected by the depth camera, and synchronously restore and reconstruct the three-dimensional geometric structure information of the indoor scene. The model has the advantages of being lightweight, having low computational overhead, and requiring no pre-training.

[0061] Finally, it should be noted that the above embodiments and descriptions are only used to illustrate the technical solutions of the present invention and not to limit it. Those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the disclosure of the technical solutions of the present invention, and all such modifications and substitutions should be covered within the protection scope of the claims of the present invention.

Claims

1. A real-time indoor scene visual synchronous localization and mapping method, characterized in that, Includes the following steps: S1: Establish a signed distance and color prediction network; S2: Use the camera to acquire scene images and corresponding depth maps in real time and record them as the current frame data. Determine whether the current frame is a key frame. If it is a key frame, execute S3; otherwise, record it as a normal frame and execute S4. S3: Based on the current keyframe data, train the current signed distance and color prediction network, obtain the trained signed distance and color prediction network and update the network, and obtain the predicted depth and color value of each pixel in the current keyframe; at the same time, optimize the initial camera pose of the current keyframe based on the current keyframe data, obtain the optimized camera pose of the current keyframe and use it as the initial camera pose of the next frame. S4: Based on the current normal frame data, use the current signed distance and color prediction network to make predictions and obtain the prediction output of the network. Calculate the predicted depth and color value of each pixel in the current normal frame based on the prediction output of the network. At the same time, optimize the initial camera pose of the current normal frame based on the current normal frame data, obtain the optimized camera pose of the current normal frame and use it as the initial camera pose of the next frame. In S3 or S4, for each pixel in the scene image of the current frame, firstly, a set of sampling points is determined within the camera range along the ray corresponding to each pixel. Then, the coordinates of all sampling points in the sampling point set are calculated according to the ray direction and the sampling point depth. The coordinates of all sampling points are then position-encoded using SIREN (Sinusoidal Activated Position Encoding) and input into the signed distance and color prediction network to obtain the signed distance and color value corresponding to each sampling point. Finally, the predicted depth and color value corresponding to each pixel in the scene image of the current frame are calculated and used as the predicted depth and color value of the current frame. The set of sampling points includes sampling points that are uniformly distributed at equal intervals along each segment of the ray and sampling points that are normally distributed within the camera depth range, centered on the scene depth determined by the true depth map. S5: Based on the predicted depth and color values ​​of the current frame and the optimized camera pose, construct a surface geometry map of the scene from the current viewpoint; S6: Repeat S2-S5 continuously to obtain surface geometry diagrams of the scene from different perspectives.

2. The real-time indoor scene visual synchronous positioning and mapping method according to claim 1, characterized in that, In S1, the signed distance and color prediction network is a shallow multilayer perceptron network.

3. The real-time indoor scene visual synchronous positioning and mapping method according to claim 1, characterized in that, In S2, there are several ordinary frames between two adjacent keyframes.