A method for VI-slam accumulated error elimination

By recording the observable region of keyframes in the SLAM system and establishing a spatiotemporal local map, and optimizing the pose using IMU pre-integration and feature matching, the problem of low efficiency in cumulative error elimination in the VI-SLAM algorithm is solved, achieving efficient error elimination and data association.

CN116242390BActive Publication Date: 2026-06-12NINGBO JUNSHENG INTELLIGENT AUTOMOBILE TECH RES INST CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NINGBO JUNSHENG INTELLIGENT AUTOMOBILE TECH RES INST CO LTD
Filing Date
2022-12-07
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing VI-SLAM algorithms suffer from high computational overhead and low efficiency when eliminating accumulated errors. In particular, loop closure detection is time-consuming in small scenes, and the scope of common view establishment is limited, making it impossible to efficiently reduce errors.

Method used

By recording the observable regions of keyframes in the SLAM system, a spatiotemporal local map is established. The observable regions of new input frames are predicted using IMU pre-integration. Feature matching and pose optimization are performed based on the spatiotemporal local map established by landmarks. Only co-view keyframes are selected for joint optimization.

🎯Benefits of technology

It achieves efficient elimination of accumulated errors with low computational overhead, improves the reliability of data association and the ability to associate historical information, and reduces the amount of computation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116242390B_ABST
    Figure CN116242390B_ABST
Patent Text Reader

Abstract

The application provides a VI-SLAM accumulated error elimination method, comprising the following steps: obtaining observable areas of each historical key frame in a global map based on the global map in a SLAM system; in the case that a new input frame is added to the SLAM system, establishing an observable area of the new input frame; in the case that the observable area of the historical key frame and the observable area of the new input frame overlap, selecting a key frame from the historical key frame, selecting the key frame with the oldest time stamp, obtaining landmark points observed by the key frame with the oldest time stamp, and establishing a space-time domain local map based on the landmark points; performing feature matching on the space-time domain local map and the new input frame, establishing a corresponding relationship, and performing pose optimization on the new input frame; establishing a common view key frame, and performing joint optimization on the common view key frame. The application solves the technical problem that the existing method for reducing accumulated error is time-consuming and inefficient, and achieves the technical effect of improving the elimination efficiency of accumulated error.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision technology, and more specifically, to a method for eliminating cumulative errors in VI-SLAM. Background Technology

[0002] High-precision pose estimation for moving cameras is a crucial task in computer vision, serving as a fundamental function in many applications such as AR devices, robot navigation, and autonomous driving. The VI-SLAM algorithm is one of the better solutions for this task, achieving high-precision localization in 3D space by fusing the inputs from both a camera and an IMU (Integrated Measurement Unit).

[0003] However, in practical applications of the VI-SLAM algorithm, state estimation often suffers from error accumulation. Small errors generated by inter-frame pose estimation can accumulate to an unacceptable level over a long period. Therefore, eliminating accumulated errors has become a crucial issue in the practical implementation of the VI-SLAM algorithm. For smaller application scenarios, without relying on external sensors or prior maps, eliminating accumulated errors can rely on the association and optimization of current and historical data. When the camera moves near a historical trajectory, constraints are established and optimized between current and historical observations to balance errors accumulated over long periods of motion. Existing methods for associating historical data and reducing accumulated errors generally fall into two categories: loop closure detection and establishing a common view.

[0004] The conventional approach to loop closure detection is as follows: First, semantic information of the image is used to detect frames from the same scene; second, data associations are established between frames; then, multi-view geometric correlation methods are used to calculate the relative pose between the two frames; finally, the calculated relative pose constraints are added to the global pose constraints for joint optimization to balance the accumulated error. The main drawbacks of this method are: the accuracy of detecting frames from the same scene cannot be guaranteed; detecting frames from the same scene can be time-consuming; and loop closures are prone to occur frequently in small scenes, while frequent global joint optimization is often time-consuming and inefficient.

[0005] The conventional approach to establishing a shared view is to maintain a global 3D landmark map and map each landmark in the map to an image pixel based on feature descriptions. If two frames can be mapped to the same landmark, a constraint can be established between them, meaning they share a view. In subsequent optimization, frames with good shared views are jointly optimized to eliminate accumulated errors. The main drawbacks of this method are: the scope of the shared view is limited, making it unable to eliminate accumulated errors over a large area; the historical data association capability of the shared view is also limited, making it impossible to establish constraints with all historical data.

[0006] The problem is that existing methods for linking historical data and reducing accumulated errors are time-consuming and inefficient. Summary of the Invention

[0007] This invention solves the technical problems of time-consuming and inefficient existing methods for associating historical data and reducing accumulated errors. It achieves the technical effect of effectively associating historical data with low computational overhead, while selecting only a small amount of historical data for joint optimization, thus efficiently eliminating accumulated errors.

[0008] To address the aforementioned problems, this invention provides a VI-SLAM cumulative error elimination method, comprising: obtaining the observable region of each historical keyframe in the global map of the SLAM system; establishing the observable region of the new input frame when a new input frame is added to the SLAM system; selecting a keyframe from the historical keyframes when the observable regions of the historical keyframes overlap with those of the new input frame, selecting the keyframe with the oldest timestamp based on the keyframe, obtaining the landmark points observed by the keyframe with the oldest timestamp, and establishing a spatiotemporal local map based on the landmark points; performing feature matching between the spatiotemporal local map and the new input frame to establish a correspondence, and optimizing the pose of the new input frame; establishing common-view keyframes and performing joint optimization on the common-view keyframes.

[0009] In one embodiment of the present invention, obtaining the observable region of historical keyframes in a global map includes: representing the observable region of the i-th historical keyframe as a 3D cube region. and They are two three-dimensional vectors. Where λ is an empirical constant, M i V is the mean of the observable landmarks in the i-th historical keyframe. i Let be the variance of observable landmarks in the i-th historical keyframe.

[0010] In one embodiment of the present invention, the mean and variance are calculated according to the following formulas:

[0011]

[0012]

[0013] Where n is the number of 3D landmarks observed in the i-th historical keyframe, and p k S represents the k-th 3D landmark observed in the i-th historical keyframe. i It is the covariance matrix of all 3D landmarks observed in the i-th historical keyframe, and the variance is calculated by selecting three numbers on the diagonal of the covariance matrix.

[0014] In one embodiment of the present invention, when a new input frame is added to the SLAM system, establishing the observable region of the new input frame includes: when a new input frame is added to the SLAM system, predicting the pose through IMU pre-integration to obtain the predicted position corresponding to the new input frame, and establishing the observable region of the new input frame based on the predicted position.

[0015] In one embodiment of the present invention, the IMU pre-integration is calculated according to the following formula:

[0016]

[0017]

[0018]

[0019] Where (p,v,q) represent the translation, velocity, and rotation state variables at different times, respectively, and (w,b) represent the translation, velocity, and rotation state variables at different times. i ,b j ) represent the world frame and the IMU body frame at times i and j, respectively, where a is the acceleration and ω is the angular velocity.

[0020] In one embodiment of the present invention, a keyframe is selected from historical keyframes, and the oldest keyframe with the oldest timestamp is selected based on the keyframe. The landmark points observed by the oldest keyframe with the oldest timestamp are obtained, including:

[0021] Select the first historical keyframe that is closest to the new input frame from the historical keyframes, and select the N historical keyframes that are the oldest relative to the timestamp of the first historical keyframe according to the preset spatiotemporal local map, and obtain the landmark points observed by the N+1 historical keyframes.

[0022] In one embodiment of the present invention, pose optimization of a new input frame includes: performing pre-integration constraints on the new input frame and its corresponding adjacent frames respectively, and performing visual reprojection constraints on the landmark points of the new input frame to optimize the pose.

[0023] In one embodiment of the present invention, the objective function for pose optimization is:

[0024]

[0025] Among them, T j For the pose of the new input frame, E imu E is the residual term constructed for the IMU pre-integration constraints. vis The residual term is constructed for the visual reprojection constraint, where C is the set of all visual observations of the new input frame, ∑ imu Let ∑ be the covariance matrix corresponding to the IMU pre-integration. vis Let ρ be the covariance matrix corresponding to the visual observation.Hub This is the Huber robust kernel function.

[0026] In one embodiment of the present invention, establishing a shared-view keyframe and jointly optimizing the shared-view keyframe includes: taking a historical keyframe that shares view with the new input frame as a first-level shared-view frame, taking a historical keyframe that shares view with the first-level shared-view frame as a second-level shared-view frame, and taking a second-level shared-view frame that does not share view with the new input frame, and jointly optimizing the new input frame, the first-level shared-view frame and the second-level shared-view frame.

[0027] In one embodiment of the present invention, the objective function for joint optimization is:

[0028]

[0029] Where S is the set of first-level common-view frames; X is the set of landmarks contained in the first-level common-view frames; E imu E is the residual term constructed for the IMU pre-integration constraints. vis The residual term is constructed for the visual reprojection constraint, where C is the set of all visual observations of the new input frame, ∑ imu Let ∑ be the covariance matrix corresponding to the IMU pre-integration. vis Let ρ be the covariance matrix corresponding to the visual observation. Hub This is the Huber robust kernel function.

[0030] In summary, by adopting the technical solution of the present invention, the following technical effects can be achieved:

[0031] (1) This invention records the observable regions of keyframes in the VI-SLAM system and introduces the concept of a spatiotemporal local map based on the overlap of the observable regions. The cumulative error is eliminated through the association between the spatiotemporal map and newly added frames and joint optimization. Furthermore, the spatiotemporal local map is established based on the overlap of observable 3D cube regions, which only includes simple geometric relationship verification, requiring less computation and is more efficient.

[0032] (2) The observable region of the new input frame is predicted by IMU prediction, and the possible historical data is determined directly by the overlap of the observable region, which is more reliable than extracting the semantic information of the image;

[0033] (3) The spatiotemporal local map established by the present invention can contain long-term historical information and has a stronger ability to associate historical information compared with the common-view local map established based on adjacent reference keyframes.

[0034] (4) In the process of optimizing by combining historical information, only individual historical keyframes that are shared are used, which requires less computation compared to global optimization. Attached Figure Description

[0035] Figure 1 This is a schematic diagram of the spatiotemporal domain local map construction provided in an embodiment of the present invention.

[0036] Figure 2 This is a schematic diagram illustrating the construction of common-view relationships provided in an embodiment of the present invention. Detailed Implementation

[0037] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, specific embodiments of the present invention will be described in detail below.

[0038] Example 1:

[0039] See Figures 1 to 2 This invention provides a VI-SLAM cumulative error elimination method, comprising:

[0040] Based on the global map in the SLAM system, the observable area of ​​each historical keyframe in the global map is obtained;

[0041] When a new input frame is added to the SLAM system, the observable region of the new input frame is established;

[0042] When the observable regions of historical keyframes overlap with the observable regions of new input frames, a keyframe is selected from the historical keyframes. The keyframe with the oldest timestamp is selected from the keyframes to obtain the landmark points observed by the keyframe with the oldest timestamp. Based on the landmark points, a spatiotemporal local map is built.

[0043] The spatiotemporal local map is matched with the new input frame to establish a correspondence, and the pose of the new input frame is optimized.

[0044] Establish common-view keyframes and perform joint optimization on them.

[0045] Furthermore, when a new frame is input again, the above steps are repeated.

[0046] The cumulative error elimination method of this invention is based on a global map containing historical keyframes and 3D landmarks. The historical keyframes and landmarks in the map are continuously added as new data is input; the first batch of landmarks and historical keyframes is generated during SLAM initialization. Subsequently, existing landmarks in the map continuously attempt to establish matching relationships with newly added frames. If the matching relationship is weak, new keyframes need to be inserted into the map, and new landmarks are generated from these new keyframes through inter-frame triangulation or binocular triangulation. During this cyclical process, keyframes and landmarks establish a bidirectional connection: each keyframe records the landmarks it can observe, and each landmark records which keyframes observed it.

[0047] This invention records the observable regions of keyframes in a VI-SLAM system and introduces the concept of a spatiotemporal local map based on the overlap of these regions. Accumulated errors are eliminated through the association of the spatiotemporal map with newly added frames and joint optimization. Furthermore, the spatiotemporal local map is built based on the overlap of observable 3D cubic regions, which only includes simple geometric relationship verification, requiring less computation and thus achieving higher efficiency in eliminating accumulated errors.

[0048] Furthermore, for each keyframe, its observable region needs to be calculated. Obtaining the observable regions of historical keyframes in the global map includes: representing the observable region of the i-th historical keyframe as a 3D cube region. and They are two three-dimensional vectors. Where λ is an empirical constant, M i V is the mean of the observable landmarks in the i-th historical keyframe. i Let be the variance of observable landmarks in the i-th historical keyframe.

[0049] Specifically, and These are two three-dimensional vectors, representing the upper and lower bounds of the cube, respectively.

[0050] Furthermore, the mean and variance are calculated using the following formulas:

[0051]

[0052]

[0053] Where n is the number of 3D landmarks observed in the i-th historical keyframe, and p k S represents the k-th 3D landmark observed in the i-th historical keyframe. i It is the covariance matrix of all 3D landmarks observed in the i-th historical keyframe, and the variance is calculated by selecting three numbers on the diagonal of the covariance matrix.

[0054] By using the above formulas to calculate the mean and variance, the observable area of ​​each historical keyframe can be obtained, making the calculation convenient.

[0055] Furthermore, when a new input frame is added to the SLAM system, the observable region of the new input frame is established, including: when a new input frame is added to the SLAM system, the pose is predicted by IMU pre-integration to obtain the predicted position corresponding to the new input frame, and the observable region of the new input frame is established based on the predicted position.

[0056] Furthermore, the IMU pre-integration is calculated using the following formula:

[0057]

[0058]

[0059]

[0060] Where (p,v,q) represent the translation, velocity, and rotation state variables at different times, respectively, and (w,b) represent the translation, velocity, and rotation state variables at different times. i ,b j ) represent the world frame and the IMU body frame at times i and j, respectively, where a is the acceleration and ω is the angular velocity.

[0061] Preferably, predicting the observable region of the new input frame using IMU prediction and directly determining possible historical data based on the overlap of the observable regions is more reliable than extracting image semantic information.

[0062] Specifically, for newly added frames, their observable region also needs to be calculated. Since newly added frames are not associated with landmarks in the initial state, their observable region is set to be the same size as the observable region of the nearest keyframe, and the center position of the region needs to be deduced by calculating the offset of the newly added frame relative to the nearest frame through IMU pre-integration.

[0063] Furthermore, a keyframe is selected from the historical keyframes, and the oldest keyframe with the oldest timestamp is chosen from these keyframes to obtain the landmark points observed by the oldest keyframe, including:

[0064] Select the first historical keyframe that is closest to the new input frame from the historical keyframes, and select the N historical keyframes that are the oldest relative to the timestamp of the first historical keyframe according to the preset spatiotemporal local map, and obtain the landmark points observed by the N+1 historical keyframes.

[0065] Specifically, when the observable region of a newly added frame overlaps with the observable region of a historical keyframe, all overlapping historical keyframes are considered as candidate frames for association. Among all candidate frames, the keyframe closest to the current frame is first selected. Then, based on the preset upper limit for the number of keyframes to build a local spatiotemporal map, older timestamps are prioritized. Finally, the landmark points observable by the selected keyframes are obtained as the local spatiotemporal map.

[0066] Specifically, the spatiotemporal domain local map construction method is as follows: Figure 1 As shown, the camera models with padding represent newly added frames F, while the camera models without padding represent historical keyframes {KF}. n-6 KF n-5 ,……,KF n-1 KFn The observable region of a newly added frame is marked with a solid box. The dashed lines represent keyframes whose observable regions overlap with those of the newly added frame; their corresponding observable regions are marked with dashed boxes. For example, assuming there are 3 keyframes for building a spatiotemporal local map, the keyframe KF closest to F is first selected. n Secondly, prioritize keyframes with older timestamps (KF). n-6 and KF n-5 KF n KF n-6 and KF n-5 These are the keyframes used to construct a local spatiotemporal map. The keyframes selected using the above strategy contain not only the latest landmarks but also a wealth of long-term historical information, exhibiting stronger historical information correlation capabilities. The landmarks observed in the selected keyframes can be extracted and used as the local spatiotemporal map of the newly added frame F.

[0067] Furthermore, after constructing the spatiotemporal local map, the current frame can be matched with the local map to establish a connection between the new input frame and historical data. To balance real-time performance and feature matching performance, this invention uses ORB feature points for feature association. Since the new input frame has already obtained a prior pose through IMU integration, the efficiency of data association can be improved by projecting the local map onto the newly added frame. After feature association is completed, the pose of the newly added frame needs further optimization to obtain a high-precision real-time pose.

[0068] Furthermore, pose optimization is performed on the new input frame, including: pre-integration constraints on the new input frame and its corresponding adjacent frames, and visual reprojection constraints on the landmark points of the new input frame, to optimize the pose.

[0069] Furthermore, the objective function for pose optimization is:

[0070]

[0071] Among them, T j For the pose of the new input frame, E imu E is the residual term constructed for the IMU pre-integration constraints. vis The residual term is constructed for the visual reprojection constraint, where C is the set of all visual observations of the new input frame, ∑ imu Let ∑ be the covariance matrix corresponding to the IMU pre-integration. vis Let ρ be the covariance matrix corresponding to the visual observation. Hub This is the Huber robust kernel function.

[0072] The visual observation set C in the above formula is the feature matching between the spatiotemporal local map and the new input frame. Since it contains rich historical information, it can suppress the drift of real-time pose estimation to a certain extent.

[0073] Furthermore, establishing shared-view keyframes and jointly optimizing shared-view keyframes includes: taking historical keyframes that share view with the new input frame as first-level shared-view frames, taking historical keyframes that share view with first-level shared-view frames as second-level shared-view frames, and ensuring that second-level shared-view frames do not share view with the new input frame, and jointly optimizing the new input frame, first-level shared-view frames, and second-level shared-view frames.

[0074] Furthermore, the objective function for joint optimization is:

[0075]

[0076] Where S is the set of first-level common-view frames; X is the set of landmarks contained in the first-level common-view frames; E imu E is the residual term constructed for the IMU pre-integration constraints. vis The residual term is constructed for the visual reprojection constraint, where C is the set of all visual observations of the new input frame, ∑ imu Let ∑ be the covariance matrix corresponding to the IMU pre-integration. vis Let ρ be the covariance matrix corresponding to the visual observation. Hub This is the Huber robust kernel function.

[0077] Specifically, if the current new input frame is selected as a keyframe, a joint optimization among co-visible keyframes can be performed based on co-visibility relationships. The specific operation method is as follows: keyframes that co-visible with the current keyframe are selected as first-level co-visible frames, and keyframes that co-visible with first-level co-visible frames but are not themselves first-level co-visible frames are selected as second-level co-visible frames. The optimization objective is the set S of first-level co-visible frames and the set X of landmarks contained within the first-level co-visible frames. Second-level co-visible frames are also included in the optimization, but they are fixed during the optimization process; that is, strong priors are added to the corresponding blocks of the Hessian matrix during gradient descent to ensure the stability of the optimization. Figure 2The diagram illustrates an example of constructing a co-visibility relationship. KF represents the current keyframe, the points in circle 1 are the landmarks observable by the current keyframe, and KF1 and KF2 are keyframes that co-visible with the current keyframe (i.e., first-level co-visibility frames). The points in circles 2 and 3 are other landmarks observable by the first-level co-visibility frames, and {KF3, ..., KF7} are other keyframes that can observe the points in circles 2 and 3 (i.e., second-level co-visibility frames). Ultimately, {KF, KF1, KF2} is incorporated into the optimization as S, and the three sets of landmarks in circles 1, 2, and 3 are also incorporated into the optimization as X, calculated according to the objective function of the joint optimization described above. In the process of optimizing with joint historical information, only individual historical keyframes with co-visibility are used, requiring less computation compared to global optimization.

[0078] Preferably, the joint optimization described above can be performed each time a new keyframe is inserted. The resulting co-view can combine neighborhood data and long-term historical data simultaneously, which can largely eliminate the accumulated error in the SLAM algorithm, and the computational cost is relatively low.

[0079] While the present invention has been disclosed above, it is not limited thereto. Any person skilled in the art can make various modifications and alterations without departing from the spirit and scope of the invention; therefore, the scope of protection of the present invention should be determined by the scope defined in the claims.

Claims

1. A VI-SLAM accumulated error elimination method, characterized in that, include: Based on the global map in the SLAM system, the observable area of ​​each historical keyframe in the global map is obtained; When a new input frame is added to the SLAM system, the observable region of the new input frame is established; When the observable region of the historical keyframe overlaps with the observable region of the new input frame, a keyframe is selected from the historical keyframes. Based on the keyframe, the keyframe with the oldest timestamp is selected to obtain the landmark points observed by the keyframe with the oldest timestamp. This includes: selecting the first historical keyframe that is closest to the new input frame from the historical keyframes, and selecting N historical keyframes with the oldest timestamp relative to the first historical keyframe based on a preset spatiotemporal local map to obtain N+1 landmark points observed by the historical keyframes; and establishing a spatiotemporal local map based on the landmark points. The spatiotemporal local map is matched with the new input frame to establish a correspondence, and the pose of the new input frame is optimized. Establishing shared-view keyframes and jointly optimizing the shared-view keyframes includes: taking the historical keyframes that share view with the new input frame as first-level shared-view frames, taking the historical keyframes that share view with the first-level shared-view frames as second-level shared-view frames, and taking the second-level shared-view frames that do not share view with the new input frame, and jointly optimizing the new input frame, the first-level shared-view frames, and the second-level shared-view frames.

2. The VI-SLAM cumulative error elimination method according to claim 1, characterized in that, Obtaining the observable region for each historical keyframe in the global map includes: The observable region of the i-th historical keyframe is represented as a 3D cube region. , and They are two three-dimensional vectors. ,in It is an empirical constant. Let be the average value of the observable landmarks in the i-th historical keyframe. Let be the variance of the observable landmarks in the i-th historical keyframe.

3. The VI-SLAM cumulative error elimination method according to claim 2, characterized in that, The mean is calculated using the following formula: ; ; in, Let be the number of 3D landmarks observed in the i-th historical keyframe. This represents the first historical keyframe observed. The aforementioned 3D landmarks The covariance matrix is ​​the result of all the 3D landmarks observed in the i-th historical keyframe, and the variance is calculated by selecting three numbers on the diagonal of the covariance matrix.

4. The VI-SLAM cumulative error elimination method according to claim 1, characterized in that, When a new input frame is added to the SLAM system, establishing the observable region of the new input frame includes: When a new input frame is added to the SLAM system, the pose is predicted by IMU pre-integration to obtain the predicted position corresponding to the new input frame, and the observable region of the new input frame is established based on the predicted position.

5. The VI-SLAM cumulative error elimination method according to claim 1, characterized in that, The pose optimization of the new input frame includes: Pre-integration constraints are applied to the new input frame and its corresponding adjacent frames, and visual reprojection constraints are applied to the landmark points of the new input frame to optimize the pose.

6. The VI-SLAM cumulative error elimination method according to claim 5, characterized in that, The objective function for pose optimization is: ; in, The pose of the new input frame. The residual term constructed for the IMU pre-integration constraints, The residual term is constructed for the visual reprojection constraint, where C is the set of all visual observations of the new input frame. Let be the covariance matrix corresponding to the IMU pre-integration. The covariance matrix corresponding to visual observations. This is the Huber robust kernel function.