Fast relocalization method based on sparse semantic anchor points and inertial sensor tight coupling
By tightly coupling sparse semantic anchors with inertial sensing, the problem of long relocalization time and high resource consumption in visual-inertial SLAM technology in complex scenes is solved, achieving fast and low-overhead relocalization, improving user experience and system reliability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHONGQING AEROSPACE POLYTECHNIC COLLEGE
- Filing Date
- 2026-03-10
- Publication Date
- 2026-06-19
AI Technical Summary
Existing visual-inertial SLAM technology has a large computational load and long time consumption in the relocalization process in complex real-world scenarios, and is sensitive to dynamic environments, resulting in virtual content drift, high consumption of computing resources, and poor user experience.
A method tightly coupled sparse semantic anchors and inertial sensing is adopted. The sparse semantic map is used to track the benchmark, and the inertial data is combined for rapid relocalization. Lightweight sparse semantic anchors and IMU-predicted pose are used for short-term prediction to achieve rapid recovery.
It achieves millisecond-level fast relocation, improves the relocation success rate in complex scenarios, reduces computation and storage overhead, enhances environmental robustness, and improves user experience.
Smart Images

Figure CN122244148A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of visual processing technology, and in particular to a fast relocation method based on tight coupling between sparse semantic anchors and inertial sensing. Background Technology
[0002] With the development of smart devices, embedded cameras, inertial measurement units (IMUs), visual SLAM (vSLAM), and visual-inertial SLAM (viSLAM) technologies are enabling diverse and widespread applications. The core of augmented reality devices lies in accurately and stably overlaying virtual information onto the real world. This relies on the device's ability to know its position and attitude in the environment in real time and with precision—a six-degree-of-freedom pose. Currently, simultaneous localization and mapping (SLAM) technology based on the fusion of vision and IMUs is the mainstream solution. This technology achieves stable tracking by continuously tracking visual feature points in the environment and fusing high-frequency data from the IMU.
[0003] However, existing technologies have significant limitations when facing complex real-world scenarios. First, visual tracking is prone to failure when users turn their heads quickly, pass through areas with weak textures such as white walls / plain desktops, or encounter strong changes in lighting. In such cases, the system needs to relocalize—that is, redetermine its position in the known environment. Traditional relocalization methods rely on global feature matching with a pre-built dense 3D point cloud map. This process is computationally intensive and time-consuming, typically taking hundreds of milliseconds to several seconds, causing virtual content to drift drastically or disappear completely, severely disrupting the user experience. Second, maintaining and matching a large, dense map puts continuous pressure on the computing power and battery life of mobile devices. Finally, dense maps are sensitive to dynamic changes in the environment, which can easily lead to positioning errors. Summary of the Invention
[0004] In view of this, the present invention proposes a fast relocalization method based on tight coupling of sparse semantic anchors and inertial sensing, which can achieve millisecond-level fast recovery after visual tracking loss, has stronger adaptability to dynamic environments, and has lower computation and storage overhead.
[0005] The technical solution of this invention is implemented as follows:
[0006] A fast relocalization method based on tight coupling between sparse semantic anchors and inertial sensing includes the following steps:
[0007] Step S1: Collect environmental image sequences, extract and filter feature points with stable geometric properties and high-level semantic properties in the environment to generate a sparse semantic anchor point set, and construct a lightweight sparse semantic map.
[0008] Step S2: Using the sparse semantic map as the tracking reference, start the main thread of visual inertial SLAM, continuously perform visual feature tracking, inertial data pre-integration and tightly coupled pose optimization, and output the device's six degrees of freedom pose and visual tracking status data in real time.
[0009] Step S3: Calculate multi-dimensional tracking quality indicators based on six-degree-of-freedom pose and visual tracking status data and complete comprehensive health assessment. When tracking abnormality is determined based on multi-dimensional tracking quality indicators or comprehensive health assessment, output a relocation trigger signal.
[0010] Step S4: Based on the relocation trigger signal, downgrade the main thread of visual inertial SLAM and switch to pure IMU dead reckoning mode. Perform short-term pose prediction based on high-frequency data from the inertial measurement unit and output the IMU predicted pose in real time.
[0011] Step S5: Based on the relocation trigger signal, start the fast relocation thread, capture the current visual frame, extract feature points and perform feature matching with sparse semantic anchor points in the sparse semantic map;
[0012] Step S6: When a valid anchor point that meets the preset number requirement is matched, the current visual six-degree-of-freedom pose of the device is calculated by the perspective n-point algorithm.
[0013] Step S7: Using the IMU predicted pose as the basis for state prediction and the visual six-DOF pose as the absolute observation correction value, perform tight coupling fusion of the visual six-DOF pose and the IMU predicted pose, and output the optimal six-DOF pose.
[0014] Preferably, step S1 includes the following steps:
[0015] The system acquires environmental image sequences and performs pixel-level semantic classification on each frame of the environmental image sequence using a lightweight real-time semantic segmentation convolutional neural network. It then outputs the semantic category probability and a highly stable semantic region mask for each pixel.
[0016] Within a region defined by a highly stable semantic region mask, corner detectors and line segment detectors are run in parallel to extract candidate feature points, and a robust visual descriptor with rotation and scale invariance is calculated for each candidate feature point.
[0017] Optical flow tracking is performed on candidate feature points between consecutive image frames to select stable tracked candidate feature points. The three-dimensional spatial position of these points in the world coordinate system is calculated using structure of motion reconstruction (SRM) technology. After removing mismatches and outliers using a random sampling consensus algorithm, stable feature points with three-dimensional coordinates are obtained.
[0018] Based on preset screening criteria, stable feature points with three-dimensional coordinates are screened to obtain sparse semantic anchors. Each sparse semantic anchor is encoded and stored. The encoded information includes three-dimensional world coordinates, semantic category labels, visual descriptors and historical observation perspectives. The encoded data of all sparse semantic anchors together constitute a lightweight sparse semantic map.
[0019] Preferably, the preset screening criteria include:
[0020] Semantic salience: located on clear semantic boundaries or specific semantic objects;
[0021] Tracking stability: Successfully tracked in more than a threshold number of consecutive frames;
[0022] 3D position covariance: Low uncertainty in 3D spatial position estimation.
[0023] Preferably, step S2 includes the following specific steps:
[0024] Using sparse semantic gradient as the tracking benchmark, for each new input frame, feature points are tracked using the LK optical flow method within a preset neighborhood of the feature point location in the previous frame. For feature points that fail to be tracked, full-image re-detection and matching are performed, and the two-dimensional-three-dimensional feature point pairs matched between the current frame and the previous frame are output.
[0025] Acquire raw IMU data synchronized with the image frame timestamps, perform zero bias compensation and integration processing on the raw IMU data, and use IMU pre-integration technology to calculate the pre-integrated quantities of the relative position, velocity, and attitude changes of the IMU data between two adjacent image frames, as well as the corresponding covariance matrix.
[0026] Based on the optimization of 2D-3D feature point pairs, pre-integral quantities, and covariance matrices, a sliding window containing the states of the most recent keyframes is maintained. A nonlinear optimization problem is constructed with the visual reprojection residual and the IMU pre-integral residual as optimization objectives. The optimal pose, velocity, and IMU zero bias of the keyframes within the sliding window are solved by a nonlinear optimization algorithm, and the six-degree-of-freedom pose and visual tracking status data of the device are output in real time.
[0027] Preferably, step S3 includes the following specific steps:
[0028] Multi-dimensional quality indicators are calculated based on visual tracking state data, six-degree-of-freedom pose, and IMU raw data. These multi-dimensional quality indicators include the number of tracked feature points, average reprojection error, feature point distribution uniformity, and IMU-visual pose consistency.
[0029] Different weights are assigned to the multi-dimensional quality indicators. After normalization and weighted calculation, the comprehensive health score is obtained by summing. When the comprehensive health score is lower than the health threshold, or when any quality indicator is lower than the corresponding emergency threshold in two consecutive frames, it is judged as a tracking anomaly, and a relocation trigger signal is output.
[0030] Preferably, the specific steps of step S4 are as follows:
[0031] Upon receiving the relocation trigger signal, the visual inertial SLAM main thread is downgraded, and the six-DOF pose and velocity at the last moment before the downgrade is triggered are used as the initial state for IMU prediction to complete the dead reckoning state initialization.
[0032] Using the initial state parameters as the recursive reference, the original measurement values of the gyroscope and accelerometer are read in each IMU measurement cycle. After completing the zero bias compensation, the device attitude is updated by integrating the quaternion differential equation, and the acceleration in the body coordinate system is transformed to the world coordinate system.
[0033] The acceleration in the world coordinate system is integrated to obtain the velocity increment, and the velocity is updated. The velocity is then integrated to obtain the position increment, and the position is updated. Based on the updated velocity and position, the IMU predicted pose is output in real time.
[0034] Preferably, the specific steps of step S5 are as follows:
[0035] Extract the current frame feature points of the semantic region corresponding to the sparse semantic anchor in the sparse semantic map from the current visual frame, and calculate the current frame visual descriptor of the same type as the sparse semantic anchor.
[0036] A fast approximate nearest neighbor search algorithm is used to quickly match the visual descriptor of the current frame with the visual descriptor of sparse semantic anchor points in the sparse semantic map to generate initial matching pairs.
[0037] By eliminating erroneous matches from the initial matching pairs through bidirectional matching verification, valid matching pairs are obtained. When the number of valid matching pairs is greater than or equal to the preset number, the matching is considered successful, and the valid matching pairs are output.
[0038] Preferably, step S6 includes the following specific steps:
[0039] Data preprocessing is performed on valid matching pairs. The preprocessing includes: back-projecting the two-dimensional pixel coordinates of the current frame in the valid matching pair to the normalized camera coordinate system through the camera intrinsic parameter matrix to obtain normalized planar coordinates.
[0040] Using the preprocessed valid matching pairs as input, the core PnP solver is called to calculate candidate poses based on the random sampling consensus algorithm, the minimum sample set is randomly selected, the number of inliers of each candidate pose is counted, and the candidate pose with the most inliers is selected as the initial pose solution.
[0041] Using the initial pose solution as the optimization benchmark, the initial pose is nonlinearly optimized by minimizing the reprojection error using all interior points to obtain the visual six-DOF pose.
[0042] Preferably, the tightly coupled fusion in step S7 is implemented using an error-state Kalman filter framework, and the specific steps are as follows:
[0043] Filter state definition: Define the nominal state vector of the filter, including device position, velocity, attitude quaternion, IMU gyroscope zero bias, accelerometer zero bias. The error state between the nominal state and the real state is used as the estimation object of the filter, and the IMU predicted pose is used as the recursive reference of the nominal state.
[0044] Filter prediction update: Based on the IMU dynamics model, in each IMU measurement cycle, the nominal state is updated driven by the raw IMU data, and the linearized dynamic equation of the error state is derived to update the covariance matrix of the error state.
[0045] Visual observation update: When the observation values of the visual six-degree-of-freedom pose are obtained, the observation equation is constructed and linearized to obtain the observation matrix based on the updated nominal state, the Kalman gain is calculated, and the estimated value of the error state and the covariance matrix are updated based on the observation residual.
[0046] Error injection and state reset: The updated error state estimate is injected into the nominal state to complete the correction of the nominal state. The error state is reset and the corresponding covariance matrix is adjusted. The corrected nominal state is output as the optimal pose of the six degrees of freedom.
[0047] Preferably, after outputting the optimal pose for six degrees of freedom in step S7, the visual inertial SLAM main thread is notified to reinitialize the visual tracking state and continue the visual feature tracking and tightly coupled pose optimization process.
[0048] Compared with the prior art, the beneficial effects of the present invention are:
[0049] 1. Extremely fast relocation speed: The relocation time is reduced from hundreds of milliseconds in traditional methods to less than tens of milliseconds, achieving instantaneous recovery that is imperceptible to users.
[0050] 2. Significantly enhanced environmental robustness: Sparse semantic anchors are more stable to changes in illumination, motion blur and local scene dynamics, improving the relocalization success rate in complex scenes.
[0051] 3. Low system resource consumption: The data volume of sparse semantic maps is 1-2 orders of magnitude smaller than that of dense point cloud maps, which greatly reduces memory consumption, storage requirements and matching calculations, and is beneficial to the battery life of mobile devices.
[0052] 4. Easier map maintenance: Sparse semantic anchors are more tolerant of environmental changes, and changes in local scenes are less likely to cause the entire map to fail, reducing the frequency and difficulty of map maintenance.
[0053] 5. Improved user experience and system reliability: Effectively reduced screen stuttering, drifting and interruptions caused by tracking loss in augmented reality applications, providing a technical foundation for deploying AR applications in a wider and more dynamic environment. Attached Figure Description
[0054] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only preferred embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0055] Figure 1 This is a flowchart of the fast relocation method based on tight coupling between sparse semantic anchors and inertial sensing according to the present invention;
[0056] Figure 2 This is a hardware and software architecture block diagram of the fast relocation method based on tight coupling of sparse semantic anchors and inertial sensing according to the present invention.
[0057] Figure 3 This is a schematic diagram of sparse semantic anchor point extraction and representation in the fast relocation method based on tight coupling between sparse semantic anchor points and inertial sensing according to the present invention.
[0058] Figure 4 This diagram illustrates a comparison between the fast relocation method based on sparse semantic anchors and tight coupling with inertial sensing, as described in this invention, and the traditional dense feature point relocation process. Detailed Implementation
[0059] To better understand the technical content of this invention, a specific embodiment is provided below, and the invention will be further described in conjunction with the accompanying drawings.
[0060] See Figures 1 to 4 The fast relocation method based on tight coupling between sparse semantic anchors and inertial sensing provided by this invention includes the following steps:
[0061] Step S1: Collect environmental image sequences, extract and filter feature points with stable geometric properties and high-level semantic properties in the environment to generate a sparse semantic anchor point set, and construct a lightweight sparse semantic map. Specific steps include:
[0062] The system acquires environmental image sequences and performs pixel-level semantic classification on each frame of the environmental image sequence using a lightweight real-time semantic segmentation convolutional neural network. It then outputs the semantic category probability and a highly stable semantic region mask for each pixel.
[0063] Within a region defined by a highly stable semantic region mask, corner detectors and line segment detectors are run in parallel to extract candidate feature points, and a robust visual descriptor with rotation and scale invariance is calculated for each candidate feature point.
[0064] Optical flow tracking is performed on candidate feature points between consecutive image frames to select stable tracked candidate feature points. The three-dimensional spatial position of these points in the world coordinate system is calculated using structure of motion reconstruction (SRM) technology. After removing mismatches and outliers using a random sampling consensus algorithm, stable feature points with three-dimensional coordinates are obtained.
[0065] Stable feature points with 3D coordinates are selected based on preset screening criteria to obtain sparse semantic anchors. The preset screening criteria include: semantic saliency: the feature point is located on a clear semantic boundary or a specific semantic object; tracking stability: it is successfully tracked in more than a threshold number of consecutive frames; 3D position covariance: the 3D spatial position covariance obtained by triangulation is less than a preset threshold, and the spatial position estimation uncertainty is low. Each sparse semantic anchor is encoded and stored. The encoded information includes 3D world coordinates, semantic category label, visual descriptor and historical observation view set. The encoded data of all sparse semantic anchors together constitute a lightweight sparse semantic map.
[0066] During system initialization, the device is powered on, and all sensors, cameras, and IMUs are initialized. The user is then guided to scan the surrounding environment. The augmented reality device's visual perception module acquires a sequence of environmental images. For each frame of the environmental image sequence, a lightweight, pre-trained real-time semantic segmentation convolutional neural network based on the MobileNetV3-Small backbone is used for analysis. The network outputs the semantic category probability for each pixel, such as "wall," "door frame," "window," "monitor," and "desktop." Simultaneously, regions corresponding to categories such as "door frame," "window corner," "fixed sign," and "edge of large appliances" are prioritized as highly stable semantic region masks. Then, within the regions defined by these highly stable semantic region masks, two feature detectors run in parallel. The corner detector uses either Shi-Tomasi or FAST corner detectors. The algorithm identifies pixels with high gradient changes in the image. The line segment detector uses the LSD or EDLines algorithm to extract line segment features. The endpoints of the line segments and points with high curvature (such as corners) are excellent candidate points, thus obtaining several candidate feature points. For each candidate feature point, its robust visual descriptor is calculated, with ORB descriptors preferred because they have rotation and scale invariance and high computational efficiency. Then, between consecutive image frames, descriptor-based optical flow tracking or feature matching is used to track these candidate feature points. For candidate feature points that are stably tracked in consecutive frames, the binocular visual disparity motion recovery structure technique is used to calculate their three-dimensional spatial position in the world coordinate system and complete triangulation. During the triangulation process, the RANSAC algorithm is used to remove mismatches and outliers with large reprojection errors to ensure the accuracy of the three-dimensional position estimation.
[0067] Finally, candidate points must meet a pre-defined three-level screening criterion to be identified as sparse semantic anchors. The selected sparse semantic anchors are then encoded and stored, their information encoded into a data structure. Ultimately, the encoded data of all sparse semantic anchors together constitute a lightweight sparse semantic map, such as... Figure 3 It intuitively demonstrates the process of extracting and encoding from the original environment to abstract semantic anchors.
[0068] Left half (scene): Shows an indoor scene containing a door, table, and monitor.
[0069] The right half (map) shows a three-dimensional coordinate system containing three stored sparse semantic anchors (Anchor_A, Anchor_B, Anchor_C), which are no longer a dense cloud of points, but sparsely distributed key locations with semantic labels.
[0070] Step S2: Using the sparse semantic map as the tracking reference, start the main thread of visual inertial SLAM, continuously execute visual feature tracking, inertial data pre-integration and tightly coupled pose optimization, and output the device's six-DOF pose and visual tracking status data in real time. Specific steps include:
[0071] Using a sparse semantic map as the tracking benchmark, for each new input frame, feature points are tracked using the LK optical flow method within a preset neighborhood of the feature point locations in the previous frame. For feature points that fail to be tracked, full-map re-detection and matching are performed, and the two-dimensional-three-dimensional feature point pairs matched between the current frame and the previous frame are output.
[0072] Acquire raw IMU data synchronized with the image frame timestamps, perform zero bias compensation and integration processing on the raw IMU data, and use IMU pre-integration technology to calculate the pre-integrated quantities of the relative position, velocity, and attitude changes of the IMU data between two adjacent image frames, as well as the corresponding covariance matrix.
[0073] Based on the optimization of 2D-3D feature point pairs, pre-integral quantities, and covariance matrices, a sliding window containing the states of the most recent keyframes is maintained. A nonlinear optimization problem is constructed with the visual reprojection residual and the IMU pre-integral residual as optimization objectives. The optimal pose, velocity, and IMU zero bias of the keyframes within the sliding window are solved by a nonlinear optimization algorithm, and the six-degree-of-freedom pose and visual tracking status data of the device are output in real time.
[0074] After constructing the sparse semantic map, the system enters a stable operating state, and the visual SLAM main thread works, while simultaneously maintaining and updating the sparse semantic map, such as adding new reliable anchor points. The visual SLAM main thread continuously performs visual feature tracking, inertial data fusion, and pose calculation. Its core working loop under normal conditions is as follows:
[0075] Visual front end (feature tracking):
[0076] Feature extraction and matching: feature point tracking is performed on each new input frame of the image;
[0077] Front-end output: Outputs the pixel displacement (optical flow) of a series of two-dimensional feature points between the current frame and the previous frame, as well as the known three-dimensional map coordinates corresponding to these feature points, to obtain matching two-dimensional-three-dimensional feature point pairs.
[0078] Inertial data preprocessing:
[0079] The raw IMU data synchronized with the image frame timestamps is integrated. To avoid repeated integration during optimization, IMU pre-integration is used. The raw IMU data is integrated between two adjacent image frames to obtain the pre-integrated quantities of relative position, velocity and attitude changes. At the same time, the covariance matrix of the pre-integrated quantity is calculated to characterize the noise.
[0080] Tightly Coupled Pose Optimization:
[0081] The system maintains a sliding window containing the states of several recent keyframes, such as position, pose, velocity, and IMU bias. Then, based on a graph optimization framework, a nonlinear optimization problem is constructed, including visual residuals and IMU residuals. The visual residuals are calculated as follows: for each pair of matched feature points, the predicted value of its 3D map point projection onto the current frame's pixel coordinates is calculated, and the difference is made between this prediction and the actual tracked pixel coordinates. The IMU residuals are calculated as follows: the difference is made between the state change observed by IMU pre-integration and the state change predicted based on the keyframe states within the sliding window. Finally, nonlinear optimization algorithms such as the Gauss-Newton method or the Levenberg-Marquardt method are used to minimize the weighted sum of squares of the visual and IMU residuals, thereby solving for the optimal pose, velocity, and IMU bias for all keyframes within the sliding window, and outputting the device's six-DOF pose and visual tracking state data.
[0082] Step S3: Calculate multi-dimensional tracking quality indicators based on six-DOF pose and visual tracking status data and complete a comprehensive health assessment. When a tracking anomaly is determined based on the multi-dimensional tracking quality indicators or the comprehensive health assessment, output a relocation trigger signal. Specific steps include:
[0083] Multi-dimensional quality indicators are calculated based on visual tracking state data, six-degree-of-freedom pose, and IMU raw data. These multi-dimensional quality indicators include the number of tracked feature points, average reprojection error, feature point distribution uniformity, and IMU-visual pose consistency.
[0084] Different weights are assigned to the multi-dimensional quality indicators. After normalization and weighted calculation, the comprehensive health score is obtained by summing. When the comprehensive health score is lower than the health threshold, or when any quality indicator is lower than the corresponding emergency threshold in two consecutive frames, it is judged as a tracking anomaly, and a relocation trigger signal is output.
[0085] A comprehensive health assessment is performed using multi-dimensional quality indicators. Visual abnormalities are then identified by combining these quality indicators with the overall health assessment. The multi-dimensional quality indicators include:
[0086] Number of tracked feature points: Count the total number of feature points successfully tracked by optical flow in the current frame, and set a minimum threshold N_min (e.g., 30).
[0087] Average reprojection error: Using the current optimal pose, project the 3D map points of all tracked feature points back into the image, and calculate the average pixel distance between the projected points and the actual tracked points.
[0088] Uniformity of feature point distribution: Divide the image into M x N grids, count the number of feature points in each grid, calculate its entropy or standard deviation, and evaluate whether the feature points are overly concentrated in a certain area.
[0089] IMU-Visual Consistency: Compares the pose changes predicted by pure IMU integration with those obtained by visual optimization over a recent short period.
[0090] By assigning a weight to the above quality indicators, normalizing and weighting them, and then summing them, a comprehensive health score can be obtained. When a tracking anomaly is determined, there are two judgment methods. The first is that the comprehensive health score is lower than the health threshold. The second is that when any quality indicator is lower than the preset emergency threshold within two consecutive frames, a tracking anomaly can be determined, and a relocation trigger signal can be output, thereby enabling a fast relocation process.
[0091] Step S4: Based on the relocation trigger signal, downgrade the main thread of visual-inertial SLAM and switch to pure IMU dead reckoning mode. Perform short-term pose prediction based on high-frequency data from the inertial measurement unit and output the IMU predicted pose in real time. The specific steps are as follows:
[0092] Upon receiving the relocation trigger signal, the visual inertial SLAM main thread is downgraded, and the six-DOF pose and velocity at the last moment before the downgrade is triggered are used as the initial state for IMU prediction to complete the dead reckoning state initialization.
[0093] Using the initial state parameters as the recursive reference, the original measurement values of the gyroscope and accelerometer are read in each IMU measurement cycle. After completing the zero bias compensation, the device attitude is updated by integrating the quaternion differential equation, and the acceleration in the body coordinate system is transformed to the world coordinate system.
[0094] The acceleration in the world coordinate system is integrated to obtain the velocity increment, and the velocity is updated. The velocity is then integrated to obtain the position increment, and the position is updated. Based on the updated velocity and position, the IMU predicted pose is output in real time.
[0095] After issuing the relocation trigger signal, the system enters the abnormal handling mode, which is carried out in parallel in two ways: the main thread degradation and the fast relocation thread. When the main thread is in degradation mode, the visual SLAM main thread pauses the complex visual optimization and switches to the pure IMU prediction mode. It uses the high-frequency data of the inertial measurement unit to provide a continuous but cumulative pose estimate in a short time through dead reckoning in order to maintain the most basic spatial coherence and avoid the user from feeling dizzy.
[0096] The core objective of dead reckoning is to estimate the device's pose state at subsequent moments based on the high-precision pose data of the last frame before the anomaly is triggered. Therefore, it is necessary to use the six-degree-of-freedom pose and velocity at the last moment before the degrade is triggered as the initial state, and then perform recursion in each IMU measurement cycle, including calculating the attitude by integrating the angular velocity and calculating the velocity and position by integrating the acceleration.
[0097] Step S5: Based on the relocation trigger signal, start the fast relocation thread, capture the current visual frame, extract feature points, and perform feature matching with sparse semantic anchor points in the sparse semantic map. The specific steps are as follows:
[0098] Extract the current frame feature points of the semantic region corresponding to the sparse semantic anchor in the sparse semantic map from the current visual frame, and calculate the current frame visual descriptor of the same type as the sparse semantic anchor.
[0099] A fast approximate nearest neighbor search algorithm is used to quickly match the visual descriptor of the current frame with the visual descriptor of sparse semantic anchor points in the sparse semantic map to generate initial matching pairs.
[0100] By eliminating erroneous matches from the initial matching pairs through bidirectional matching verification, valid matching pairs are obtained. When the number of valid matching pairs is greater than or equal to the preset number, the matching is considered successful, and the valid matching pairs are output.
[0101] After the relocation trigger signal is output, another parallel process is the fast relocation thread. This thread captures the current visual frame and starts a fast matching process. It uses a fast approximate nearest neighbor search algorithm to quickly match the feature points extracted from the current frame with the sparse semantic anchors stored in the sparse semantic map and obtain initial matching pairs. Then, it eliminates false matches through bidirectional matching verification and finally obtains valid matching pairs. When the number of valid matching pairs is greater than or equal to the preset number (usually 4), it is considered a successful match. If the match fails, the relocation thread will continue to try for a short period of time. If it still fails after multiple attempts, the system may degrade to relying solely on the IMU or prompt the user.
[0102] Step S6: When a valid anchor point that meets the preset number requirement is matched, the current visual six-degree-of-freedom pose of the device is calculated using the perspective n-point algorithm. Specific steps include:
[0103] Data preprocessing is performed on valid matching pairs. The preprocessing includes: back-projecting the two-dimensional pixel coordinates of the current frame in the valid matching pair to the normalized camera coordinate system through the camera intrinsic parameter matrix to obtain normalized planar coordinates.
[0104] Using the preprocessed valid matching pairs as input, the core PnP solver is called to calculate candidate poses based on the random sampling consensus algorithm, the minimum sample set is randomly selected, the number of inliers of each candidate pose is counted, and the candidate pose with the most inliers is selected as the initial pose solution.
[0105] Using the initial pose solution as the optimization benchmark, the initial pose is nonlinearly optimized by minimizing the reprojection error using all interior points to obtain the visual six-DOF pose.
[0106] Using the 2D pixel coordinates of successfully matched valid pairs and their known 3D world coordinates in the map, the precise 6-DOF visual pose of the device can be directly calculated using the perspective n-point algorithm. After obtaining valid matching pairs, data preprocessing is performed. Due to the possibility of mismatches, the RANSAC framework is used to randomly select the smallest sample set and call the core PnP solver to calculate candidate poses. All 3D points are projected onto the image using the candidate poses, and the number of inliers whose distance from the projected points to the measured 2D points is less than a threshold is counted. After repeating the above process several times, the sample set with the largest number of inliers and its corresponding pose can be selected as the initial pose solution. Finally, using all inliers, the initial pose solution is nonlinearly optimized by minimizing the reprojection error to obtain the visual 6-DOF pose.
[0107] The specific solution process of the core PnP solver is as follows:
[0108] Control point selection: Select 4 non-coplanar virtual control points for all 3D points (e.g., use the centroid of all 3D points plus three principal directions).
[0109] Coordinate representation: All 3D points are represented as a weighted sum of these 4 control points.
[0110] Equation construction: Establish a system of linear equations by projecting the 2D normalized coordinates and 3D points (represented by control points), where the unknowns are the coordinates of the control points in the camera coordinate system.
[0111] Solving and Restoration: Solve the system of linear equations to obtain the coordinates of the control points in the camera coordinate system. Then, by comparing the coordinates of the control points in the world coordinate system and the camera coordinate system, the rotation matrix R and translation vector t of the camera relative to the world coordinate system can be solved.
[0112] Step S7: Using the IMU-predicted pose as the basis for state prediction and the visual six-DOF pose as the absolute observation correction value, perform tight-coupled fusion of the visual six-DOF pose and the IMU-predicted pose, and output the optimal six-DOF pose. The tight-coupled fusion is implemented using an error-state Kalman filter framework. The specific steps are as follows:
[0113] Filter state definition: Define the nominal state vector of the filter, including device position, velocity, attitude quaternion, IMU gyroscope zero bias, accelerometer zero bias. The error state between the nominal state and the real state is used as the estimation object of the filter, and the IMU predicted pose is used as the recursive reference of the nominal state.
[0114] Filter prediction update: Based on the IMU dynamics model, in each IMU measurement cycle, the nominal state is updated driven by the raw IMU data, and the linearized dynamic equation of the error state is derived to update the covariance matrix of the error state.
[0115] Visual observation update: When the observation values of the visual six-degree-of-freedom pose are obtained, the observation equation is constructed and linearized to obtain the observation matrix based on the updated nominal state. This matrix describes how the error state affects the observation residual. The Kalman gain is calculated, and the estimated value of the error state and the covariance matrix are updated based on the observation residual.
[0116] Error injection and state reset: The updated error state estimate is injected into the nominal state to complete the correction of the nominal state. The error state is reset and the corresponding covariance matrix is adjusted. The corrected nominal state is output as the optimal pose of six degrees of freedom. After outputting the optimal pose of six degrees of freedom, the visual inertial SLAM main thread is notified to reinitialize the visual tracking state and continue the visual feature tracking and tightly coupled pose optimization process.
[0117] The calculated visual six-DOF pose is tightly coupled and fused with the IMU predicted pose. By utilizing the global accuracy of visual observation, the accumulated error of the IMU can be effectively corrected, and a high-precision and smooth final optimal six-DOF pose is output. The tight coupling fusion is implemented using an extended Kalman filter. Its specific implementation process includes filter state definition, prediction process, update process, and error injection and reset. After outputting the optimal six-DOF pose, it is output to the rendering engine to ensure that the virtual content can be stabilized immediately. At the same time, the visual inertial SLAM main thread is notified. After receiving the correct pose information, the visual inertial SLAM main thread reinitializes or seamlessly continues the previous tracking state using the current frame and pose, restores the full-function visual inertial SLAM operation, and returns to the normal operation state of step S2.
[0118] The beneficial effects of the present invention are illustrated below through an example:
[0119] This embodiment uses a split-type AR glasses as an example for illustration. The glasses include a binocular camera (global shutter, 30FPS) and a nine-axis IMU (1000Hz). The computing unit is placed in the processing box and is powered by the Qualcomm Snapdragon XR2 platform.
[0120] Hardware configuration:
[0121] Visual sensor: Dual 2-megapixel monochrome global shutter cameras for stereo vision and feature extraction.
[0122] Inertial sensor: Six-axis IMU (gyroscope + accelerometer).
[0123] Processing Units: The Hexagon DSP on the Snapdragon XR2 platform is used to run lightweight visual feature extraction and matching algorithms; the Kryo CPU runs the SLAM main thread, state machine, and fusion algorithm; and the Adreno GPU is used for necessary image preprocessing.
[0124] Software implementation details:
[0125] 1. Sparse semantic anchor extraction:
[0126] A lightweight convolutional neural network is used to perform real-time semantic segmentation on the input image, identifying potential stable regions such as "door frame", "window corner", "fixed sign", and "monitor edge".
[0127] Within these regions, feature points are extracted using an improved ORB feature detector, and their binary descriptors with semantic context information are computed.
[0128] By determining its stable 3D position through multi-frame observation and triangulation, a sparse semantic anchor point containing position, semantic label, and robust descriptor is finally generated.
[0129] 2. Fast thread relocation:
[0130] Set the thread priority to the highest level to ensure timely response.
[0131] During matching, the FLANN (Fast Approximate Nearest Neighbor Search) library is used to perform fast matching of binary descriptors.
[0132] The PnP solution uses the RANSAC+EPnP algorithm to combat possible false matches.
[0133] 3. Tightly coupled sensor fusion:
[0134] A fusion framework based on error-state Kalman filtering is adopted. State variables include position, velocity, attitude, and IMU bias.
[0135] When the relocation thread successfully provides the visual pose, it updates the ESKF as an absolute observation to quickly correct the IMU drift.
[0136] System workflow:
[0137] 1. Start-up and Map Building: After the user puts on the glasses, they slowly look around the room for about 15 seconds. The system completes the initial sparse semantic map construction, which contains about 50-100 sparse semantic anchors.
[0138] 2. Normal Use: The user begins using the AR application. The VIO main thread outputs pose stably at a frequency of 30Hz;
[0139] 3. Sudden rapid head turn: The user suddenly looks at a white wall. The number of tracked feature points drops sharply from 150 to 20 within 3 frames, triggering the "tracking loss" flag.
[0140] 4. Rapid recovery:
[0141] The main thread immediately switches to IMU mode, providing pose prediction at approximately 200Hz.
[0142] The relocation thread is started, the current white wall image is processed, and it is matched with sparse semantic anchors such as "wall-door frame intersection" and "corner" stored in the map. Five anchors are successfully matched within 15ms.
[0143] Calculate the visual pose and fuse it with IMU predictions.
[0144] From triggering to outputting a stable pose, the total time is approximately 35ms. The user only perceives a slight shaking of the screen before it stabilizes instantly, and no obvious drift is observed in the virtual content.
[0145] Resumption of operation: The main thread uses the restored pose to restart visual tracking, and the system returns to normal.
[0146] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A fast relocalization method based on tight coupling between sparse semantic anchors and inertial sensing, characterized in that, Includes the following steps: Step S1: Collect environmental image sequences, extract and filter feature points with stable geometric properties and high-level semantic properties in the environment to generate a sparse semantic anchor point set, and construct a lightweight sparse semantic map. Step S2: Using the sparse semantic map as the tracking reference, start the main thread of visual inertial SLAM, continuously perform visual feature tracking, inertial data pre-integration and tightly coupled pose optimization, and output the device's six degrees of freedom pose and visual tracking status data in real time. Step S3: Calculate multi-dimensional tracking quality indicators based on six-degree-of-freedom pose and visual tracking status data and complete comprehensive health assessment. When tracking abnormality is determined based on multi-dimensional tracking quality indicators or comprehensive health assessment, output a relocation trigger signal. Step S4: Based on the relocation trigger signal, downgrade the main thread of visual inertial SLAM and switch to pure IMU dead reckoning mode. Perform short-term pose prediction based on high-frequency data from the inertial measurement unit and output the IMU predicted pose in real time. Step S5: Based on the relocation trigger signal, start the fast relocation thread, capture the current visual frame, extract feature points and perform feature matching with sparse semantic anchor points in the sparse semantic map; Step S6: When a valid anchor point that meets the preset number requirement is matched, the current visual six-degree-of-freedom pose of the device is calculated by the perspective n-point algorithm. Step S7: Using the IMU predicted pose as the basis for state prediction and the visual six-DOF pose as the absolute observation correction value, perform tight coupling fusion of the visual six-DOF pose and the IMU predicted pose, and output the optimal six-DOF pose.
2. The fast relocation method based on tight coupling between sparse semantic anchors and inertial sensing as described in claim 1, characterized in that, The specific steps of step S1 include: The system acquires environmental image sequences and performs pixel-level semantic classification on each frame of the environmental image sequence using a lightweight real-time semantic segmentation convolutional neural network. It then outputs the semantic category probability and a highly stable semantic region mask for each pixel. Within a region defined by a highly stable semantic region mask, corner detectors and line segment detectors are run in parallel to extract candidate feature points, and a robust visual descriptor with rotation and scale invariance is calculated for each candidate feature point. Optical flow tracking is performed on candidate feature points between consecutive image frames to select stable tracked candidate feature points. The three-dimensional spatial position of these points in the world coordinate system is calculated using structure of motion reconstruction (SRM) technology. After removing mismatches and outliers using a random sampling consensus algorithm, stable feature points with three-dimensional coordinates are obtained. Based on preset screening criteria, stable feature points with three-dimensional coordinates are screened to obtain sparse semantic anchors. Each sparse semantic anchor is encoded and stored. The encoded information includes three-dimensional world coordinates, semantic category labels, visual descriptors and historical observation perspectives. The encoded data of all sparse semantic anchors together constitute a lightweight sparse semantic map.
3. The fast relocation method based on tight coupling between sparse semantic anchors and inertial sensing as described in claim 2, characterized in that, The preset screening criteria include: Semantic salience: located on clear semantic boundaries or specific semantic objects; Tracking stability: Successfully tracked in more than a threshold number of consecutive frames; 3D position covariance: Low uncertainty in 3D spatial position estimation.
4. The fast relocation method based on tight coupling between sparse semantic anchors and inertial sensing as described in claim 1, characterized in that, The specific steps of step S2 include: Using sparse semantic gradient as the tracking benchmark, for each new input frame, feature points are tracked using the LK optical flow method within a preset neighborhood of the feature point location in the previous frame. For feature points that fail to be tracked, full-image re-detection and matching are performed, and the two-dimensional-three-dimensional feature point pairs matched between the current frame and the previous frame are output. Acquire raw IMU data synchronized with the image frame timestamps, perform zero bias compensation and integration processing on the raw IMU data, and use IMU pre-integration technology to calculate the pre-integrated quantities of the relative position, velocity, and attitude changes of the IMU data between two adjacent image frames, as well as the corresponding covariance matrix. Based on the optimization of 2D-3D feature point pairs, pre-integral quantities, and covariance matrices, a sliding window containing the states of the most recent keyframes is maintained. A nonlinear optimization problem is constructed with the visual reprojection residual and the IMU pre-integral residual as optimization objectives. The optimal pose, velocity, and IMU zero bias of the keyframes within the sliding window are solved by a nonlinear optimization algorithm, and the six-degree-of-freedom pose and visual tracking status data of the device are output in real time.
5. The fast relocation method based on tight coupling between sparse semantic anchors and inertial sensing as described in claim 1, characterized in that, The specific steps of step S3 include: Multi-dimensional quality indicators are calculated based on visual tracking state data, six-degree-of-freedom pose, and IMU raw data. These multi-dimensional quality indicators include the number of tracked feature points, average reprojection error, feature point distribution uniformity, and IMU-visual pose consistency. Different weights are assigned to the multi-dimensional quality indicators. After normalization and weighted calculation, the comprehensive health score is obtained by summing. When the comprehensive health score is lower than the health threshold, or when any quality indicator is lower than the corresponding emergency threshold in two consecutive frames, it is judged as a tracking anomaly, and a relocation trigger signal is output.
6. The fast relocation method based on tight coupling between sparse semantic anchors and inertial sensing as described in claim 1, characterized in that, The specific steps of step S4 are as follows: Upon receiving the relocation trigger signal, the visual inertial SLAM main thread is downgraded, and the six-DOF pose and velocity at the last moment before the downgrade is triggered are used as the initial state for IMU prediction to complete the dead reckoning state initialization. Using the initial state parameters as the recursive reference, the original measurement values of the gyroscope and accelerometer are read in each IMU measurement cycle. After completing the zero bias compensation, the device attitude is updated by integrating the quaternion differential equation, and the acceleration in the body coordinate system is transformed to the world coordinate system. The acceleration in the world coordinate system is integrated to obtain the velocity increment, and the velocity is updated. The velocity is then integrated to obtain the position increment, and the position is updated. Based on the updated velocity and position, the IMU predicted pose is output in real time.
7. The fast relocation method based on tight coupling between sparse semantic anchors and inertial sensing as described in claim 1, characterized in that, The specific steps of step S5 are as follows: Extract the current frame feature points of the semantic region corresponding to the sparse semantic anchor in the sparse semantic map from the current visual frame, and calculate the current frame visual descriptor of the same type as the sparse semantic anchor. A fast approximate nearest neighbor search algorithm is used to quickly match the visual descriptor of the current frame with the visual descriptor of sparse semantic anchor points in the sparse semantic map to generate initial matching pairs. By eliminating erroneous matches from the initial matching pairs through bidirectional matching verification, valid matching pairs are obtained. When the number of valid matching pairs is greater than or equal to the preset number, the matching is considered successful, and the valid matching pairs are output.
8. The fast relocation method based on tight coupling between sparse semantic anchors and inertial sensing as described in claim 7, characterized in that, The specific steps of step S6 include: Data preprocessing is performed on valid matching pairs. The preprocessing includes: back-projecting the two-dimensional pixel coordinates of the current frame in the valid matching pair to the normalized camera coordinate system through the camera intrinsic parameter matrix to obtain normalized planar coordinates. Using the preprocessed valid matching pairs as input, the core PnP solver is called to calculate candidate poses based on the random sampling consensus algorithm, the minimum sample set is randomly selected, the number of inliers of each candidate pose is counted, and the candidate pose with the most inliers is selected as the initial pose solution. Using the initial pose solution as the optimization benchmark, the initial pose is nonlinearly optimized by minimizing the reprojection error using all interior points to obtain the visual six-DOF pose.
9. The fast relocation method based on tight coupling between sparse semantic anchors and inertial sensing as described in claim 1, characterized in that, The tightly coupled fusion in step S7 is implemented using an error-state Kalman filter framework. The specific steps are as follows: Filter state definition: Define the nominal state vector of the filter, including device position, velocity, attitude quaternion, IMU gyroscope zero bias, accelerometer zero bias. The error state between the nominal state and the real state is used as the estimation object of the filter, and the IMU predicted pose is used as the recursive reference of the nominal state. Filter prediction update: Based on the IMU dynamics model, in each IMU measurement cycle, the nominal state is updated driven by the raw IMU data, and the linearized dynamic equation of the error state is derived to update the covariance matrix of the error state. Visual observation update: When the observation values of the visual six-degree-of-freedom pose are obtained, the observation equation is constructed and linearized to obtain the observation matrix based on the updated nominal state, the Kalman gain is calculated, and the estimated value of the error state and the covariance matrix are updated based on the observation residual. Error injection and state reset: The updated error state estimate is injected into the nominal state to complete the correction of the nominal state. The error state is reset and the corresponding covariance matrix is adjusted. The corrected nominal state is output as the optimal pose of the six degrees of freedom.
10. The fast relocation method based on tight coupling between sparse semantic anchors and inertial sensing according to claim 1, characterized in that, After outputting the optimal pose for six degrees of freedom in step S7, the visual inertial SLAM main thread is notified to reinitialize the visual tracking state and continue the visual feature tracking and tightly coupled pose optimization process.