A dynamic environment SLAM positioning method based on multi-source sensor data fusion

By fusing multi-source sensor data and constructing semantic skeleton maps, the stability and accuracy issues of SLAM localization in dynamic environments are solved, achieving highly robust localization and map construction, and reducing the probability of mismatches and local drift.

CN122306084APending Publication Date: 2026-06-30HANGZHOU FEIKUO TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HANGZHOU FEIKUO TECHNOLOGY CO LTD
Filing Date
2026-04-29
Publication Date
2026-06-30

Smart Images

  • Figure CN122306084A_ABST
    Figure CN122306084A_ABST
Patent Text Reader

Abstract

This invention discloses a dynamic environment SLAM localization method based on multi-source sensor data fusion, comprising the following steps: acquiring visual, inertial, and laser data and calibrating the buffer; performing HRNet segmentation and ORB-SLAM tracking on the visual data to obtain the initial pose; fusing the inertial and laser results to obtain the current pose; determining and recalculating the divergence of time window data to update the current pose; inserting keyframes and generating a semantic skeleton map; encoding and generating a semantic topology string and associating it with the keyframe index; retrieving loop closure candidate frames and performing geometric verification followed by map optimization to obtain the localization result and map result. This invention achieves highly robust localization and mapping in dynamic environments and improves the accuracy of loop closure recognition and map consistency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of intelligent positioning technology, and in particular to a dynamic environment SLAM positioning method based on multi-source sensor data fusion. Background Technology

[0002] With the development of robots, autonomous mobile devices, and intelligent mapping applications, SLAM localization methods have been widely used in indoor navigation, unmanned inspection, and complex scene perception. In existing technologies, common solutions mainly employ visual SLAM, laser SLAM, or visual-inertial joint localization, achieving pose estimation and map construction through image feature matching, inertial pre-integration, and laser point cloud registration.

[0003] However, existing technologies still have significant shortcomings in dynamic environments. While multi-source sensor data can simultaneously provide visual, inertial, and geometric information, most current fusion methods directly calculate pose based on the current frame. When visual, inertial, and laser results diverge, there is usually a lack of a recovery process for short time windows, leading to the continuous propagation of local pose errors along the trajectory. Existing semantic maps mostly employ pixel label attachment or dense point cloud storage, lacking the extraction and organization of structural region centerlines, skeleton nodes, and skeleton line segments, making it difficult to form map representations suitable for keyframe association and structural expression. Existing loop closure detection typically focuses on comparing already present structural information, underutilizing structural information that should be present but is not, easily leading to mismatches in similar scenarios, affecting positioning accuracy and map quality.

[0004] Therefore, how to provide a dynamic environment SLAM localization method based on multi-source sensor data fusion is a problem that urgently needs to be solved by those skilled in the art. Summary of the Invention

[0005] One objective of this invention is to propose a dynamic environment SLAM localization method based on multi-source sensor data fusion. This invention comprehensively utilizes HRNet semantic segmentation, ORB-SLAM tracking, short-window playback recalculation, semantic skeleton mapping, and semantic topology loop closure detection techniques to achieve localization and map construction in dynamic environments. It has the advantages of high robustness, high localization accuracy, and accurate loop closure recognition.

[0006] A dynamic environment SLAM localization method based on multi-source sensor data fusion according to an embodiment of the present invention includes the following steps:

[0007] Collect visual data, inertial data, and laser data; complete calibration and establish the initial cache.

[0008] Visual data is input into HRNet to obtain semantic segmentation results, ORB features are extracted from static regions, and the initial pose is obtained based on ORB-SLAM tracking.

[0009] Inertial data is pre-integrated to obtain inertial results, laser data is registered to obtain laser results, and the initial pose, inertial results, and laser results are fused to obtain the current pose;

[0010] The time window data in the original cache is subjected to divergence determination. When the triggering condition is met, the time window data is replayed, registration and trajectory recovery are re-executed, candidate trajectories are generated, and the recalculated trajectory is selected to update the current pose.

[0011] Based on the updated pose, keyframes are inserted, structural region centerlines are extracted from semantic segmentation results and discretized into skeleton nodes and skeleton segments, and ORB features are attached to skeleton nodes and skeleton segments to generate a semantic skeleton map.

[0012] A semantic topology string is generated by encoding the arrangement and connectivity order of structural objects within the keyframe view, and then associated with the keyframe index.

[0013] The semantic topology string and historical semantic topology string are used to retrieve loop closure candidate frames. ORB geometric verification is performed, and the verified keyframes are written into the loop closure edge set and SLAM graph optimization is performed to obtain the localization result and map result.

[0014] Optionally, completing the calibration and establishing the original cache specifically includes:

[0015] The system continuously acquires image frames and generates visual data through a visual sensor, continuously acquires angular velocity and acceleration data and generates inertial data through an inertial measurement unit, and continuously acquires spatial scanning points and generates laser data through a lidar.

[0016] Add timestamps corresponding to the acquisition time to the visual data, inertial data, and laser data respectively, and establish a unified time series according to the timestamps;

[0017] Joint calibration is performed on the vision sensor, inertial measurement unit, and lidar to obtain the first extrinsic parameter of the vision sensor relative to the inertial measurement unit and the second extrinsic parameter of the lidar relative to the vision sensor; the vision data, inertial data, and lidar data are time-aligned according to a unified time series, and the coordinate transformation relationship is registered based on the first and second extrinsic parameters;

[0018] Write the data that has been aligned and associated with coordinates into the original cache in chronological order, and save the corresponding timestamp, first external parameter and second external parameter in the original cache.

[0019] Optionally, obtaining the initial pose based on ORB-SLAM tracking specifically includes:

[0020] Input the current image frame from the visual data into the pre-trained HRNet, and output the semantic segmentation result corresponding to the current image frame;

[0021] Dynamic target regions are removed based on semantic segmentation results, while static regions are retained.

[0022] Perform FAST corner detection within a static region, filter candidate feature points according to corner response values, and generate ORB features based on the rotated BRIEF descriptor;

[0023] The ORB features of the current image frame are matched with the ORB features of the previous image frame to obtain the inter-frame feature correspondence.

[0024] The inter-frame feature correspondence is input into the ORB-SLAM front end to perform motion estimation and tracking, generate the pose transformation of the current image frame relative to the previous image frame, and accumulate the continuous pose transformations into the initial pose.

[0025] Optionally, fusing the initial pose, inertial results, and laser results to obtain the current pose specifically includes:

[0026] Extract the inertial data within the time interval corresponding to the current image frame, perform pre-integration on the angular velocity data and acceleration data to obtain the inertial result relative to the previous image frame, and calculate the inertial pose corresponding to the current image frame based on the current pose and inertial result corresponding to the previous image frame.

[0027] Extract the laser data corresponding to the current image frame, perform distortion correction on the laser data, register the distortion-corrected laser data with the local point cloud to obtain the laser result corresponding to the current image frame, and determine the laser pose corresponding to the current image frame based on the laser result.

[0028] Read the initial pose, and transform the initial pose, inertial pose, and laser pose to a unified coordinate system based on the first and second extrinsic parameters;

[0029] The pose residuals are established for the initial pose, inertial pose and laser pose corresponding to the pose to be determined in the current image frame. The average least squares calculation of the multi-source pose based on the logarithmic mapping residual of Lie algebra is performed on the pose residuals to obtain the current pose corresponding to the current image frame.

[0030] Optionally, the divergence determination of the time window data in the original cache specifically includes:

[0031] Taking the current image frame as the endpoint, extract continuous time window data from the original buffer, calculate the relative rotation angle difference and relative translation distance between each pair of the initial pose, inertial pose and laser pose corresponding to the current image frame, and determine that the trigger condition is met when the maximum value of the relative rotation angle difference is greater than the rotation threshold or the maximum value of the relative translation distance is greater than the displacement threshold.

[0032] After the triggering condition is met, the current pose corresponding to the starting frame of the time window is fixed as the reference pose, and the poses corresponding to each image frame within the time window are set as the poses to be determined; the starting frame of the time window is selected as the first image frame in the captured continuous time window data.

[0033] Based on the time window data, the ORB feature correspondence between adjacent image frames within the time window, the inertial results corresponding to each image frame, and the laser results corresponding to each image frame are regenerated, and the laser results corresponding to each image frame are transformed to a unified coordinate system.

[0034] Using the reference pose, ORB feature correspondence, inertial results, and laser results, a joint least squares calculation is constructed to determine the pose of each image frame within the time window, and the time window pose sequence is obtained by solving the problem.

[0035] The pose sequence within the time window is used as the recalculated trajectory, and the pose corresponding to each image frame in the recalculated trajectory is used to replace the current pose in the original pose sequence within the time window.

[0036] Optionally, generating the semantic skeleton map specifically includes:

[0037] Insert the current image frame as a keyframe based on the updated pose;

[0038] Extract the binary mask corresponding to the structural region from the semantic segmentation result, perform connected component labeling on the binary mask and delete connected regions with an area less than the threshold, and retain the remaining connected regions;

[0039] By successively deleting boundary pixels from the retained connected regions while maintaining the connectivity, the center line of the structural region with a single pixel width is obtained.

[0040] Determine the endpoints and intersections along the centerline of the structural region, determine the centerline pixel sequence between the endpoints and intersections as the skeleton path, select sampling points along the skeleton path at a preset step size, and determine the endpoints, intersections, and sampling points as skeleton nodes, and determine the skeleton path between adjacent skeleton nodes as skeleton line segments.

[0041] Transform the ORB features to a unified coordinate system based on the updated pose, and attach them to the corresponding skeleton nodes or skeleton segments according to the principle of minimum distance.

[0042] Based on the updated pose, the skeleton nodes, skeleton segments, and attached ORB features are spliced ​​together to generate a semantic skeleton map.

[0043] Optionally, the step of encoding the semantic topology string according to the arrangement order and connectivity order of structural objects within the keyframe view specifically includes:

[0044] The field of view of the keyframe is determined based on the updated pose of the keyframe and the current image frame. Skeleton nodes and skeleton segments located within the field of view of the keyframe are selected from the semantic skeleton map. Skeleton nodes and skeleton segments with the same structural region category and connected to each other are identified as structural objects.

[0045] Each structural object is projected onto the keyframe image plane and sorted in ascending order according to the horizontal coordinates of the projection center point of each structural object.

[0046] Based on the connection relationship between skeleton nodes and skeleton segments, the connection order is recorded along the connection path between each structural object within the keyframe's field of view to obtain the connectivity order.

[0047] Each structural object is encoded sequentially according to its structural region category, arrangement order, and connectivity order. The encoding results of each structural object are then concatenated sequentially to generate the semantic topology string corresponding to the keyframe.

[0048] The semantic topology string is stored in correspondence with the keyframe number to form a keyframe index.

[0049] Optionally, obtaining the positioning results and map results specifically includes:

[0050] The semantic topology string of the current keyframe is compared with the historical semantic topology string in the keyframe index on an object-by-object basis. The number of matching objects is counted according to the structural region category, arrangement order and connectivity order. Historical keyframes with a number of matching objects reaching the threshold are identified as loop closure candidate frames.

[0051] For each loopback candidate frame, count the structural objects that appear repeatedly in the corresponding historical key frames and adjacent historical key frames. Record the structural objects that appear consecutively to the frequency threshold but do not appear in the loopback candidate frame as missing evidence templates.

[0052] Compare the structural objects that do not appear in the current keyframe with the missing evidence template, and delete the loop closure candidate frames that do not match the missing evidence template;

[0053] ORB features are extracted from the retained loop closure candidate frames and matched with the ORB features of the current keyframe. The relative pose between keyframe pairs is calculated based on the matching results and geometric verification is performed.

[0054] The keyframe pairs that pass geometric verification are written into the loop edge set. Based on the loop edge set and the keyframe pose, SLAM graph optimization is performed to update the keyframe pose and semantic skeleton map, and the localization result and map result are obtained.

[0055] The beneficial effects of this invention are:

[0056] (1) This invention fuses the initial pose, inertial pose, and laser pose, and performs joint recalculation on the time window data in the original buffer when discrepancies occur. This enables the pose sequence to be redefined within a local time period, avoiding the propagation of single-frame errors along a continuous trajectory. Compared to the method of directly outputting the pose based solely on the current frame result, this improves the stability and continuity of the localization process in dynamic environments.

[0057] (2) This invention extracts the structural regions from the semantic segmentation results as center lines, discretizes them into skeleton nodes and skeleton segments, and attaches ORB features to the skeleton nodes and skeleton segments to form a semantic skeleton map. Compared with the method of directly attaching semantic information to pixels or dense point clouds, this forms a map representation that combines structural regions with local features. It retains the structural backbone information of wall, corridor, door frame and stair area, as well as the local feature information used for localization, which is beneficial for map stitching between key frames and subsequent extraction of structural objects.

[0058] (3) This invention generates a semantic topology string by encoding the arrangement and connectivity order of structural objects within the keyframe view, and filters loop closure candidate frames by combining the missing evidence template. On this basis, ORB geometric verification is then performed, which can reduce the probability of mismatch in repetitive structural scenes. Compared with the method of similarity retrieval based solely on existing structural objects, this invention uses structural information that should have appeared but did not appear in historical keyframes to eliminate candidate results, making the writing of loop closure edges more targeted and beneficial to the accurate updating of subsequent SLAM graph optimization results. Attached Figure Description

[0059] The accompanying drawings are provided to further illustrate the invention and form part of the specification. They are used in conjunction with embodiments of the invention to explain the invention and do not constitute a limitation thereof. In the drawings:

[0060] Figure 1 This is a flowchart of a dynamic environment SLAM localization method based on multi-source sensor data fusion proposed in this invention;

[0061] Figure 2 This is a schematic diagram of semantic skeleton mapping for a dynamic environment SLAM localization method based on multi-source sensor data fusion proposed in this invention.

[0062] Figure 3 This is a schematic diagram illustrating the semantic topology string generation of a dynamic environment SLAM localization method based on multi-source sensor data fusion proposed in this invention. Detailed Implementation

[0063] The present invention will now be described in further detail with reference to the accompanying drawings. These drawings are simplified schematic diagrams, illustrating only the basic structure of the invention, and therefore only show the components relevant to the invention.

[0064] refer to Figures 1-3 A dynamic environment SLAM localization method based on multi-source sensor data fusion includes the following steps:

[0065] Collect visual data, inertial data, and laser data; complete calibration and establish the initial cache.

[0066] Visual data is input into HRNet to obtain semantic segmentation results, ORB features are extracted from static regions, and the initial pose is obtained based on ORB-SLAM tracking.

[0067] Inertial data is pre-integrated to obtain inertial results, laser data is registered to obtain laser results, and the initial pose, inertial results, and laser results are fused to obtain the current pose;

[0068] The time window data in the original cache is subjected to divergence determination. When the triggering condition is met, the time window data is replayed, registration and trajectory recovery are re-executed, candidate trajectories are generated, and the recalculated trajectory is selected to update the current pose.

[0069] Based on the updated pose, keyframes are inserted, structural region centerlines are extracted from semantic segmentation results and discretized into skeleton nodes and skeleton segments, and ORB features are attached to skeleton nodes and skeleton segments to generate a semantic skeleton map.

[0070] A semantic topology string is generated by encoding the arrangement and connectivity order of structural objects within the keyframe view, and then associated with the keyframe index.

[0071] The semantic topology string and historical semantic topology string are used to retrieve loop closure candidate frames. ORB geometric verification is performed, and the verified keyframes are written into the loop closure edge set and SLAM graph optimization is performed to obtain the localization result and map result.

[0072] In this embodiment, completing the calibration and establishing the original cache specifically includes:

[0073] The system continuously acquires image frames and generates visual data using a visual sensor, continuously acquires angular velocity and acceleration data and generates inertial data using an inertial measurement unit, and continuously acquires spatial scanning points and generates laser data using a lidar. The visual data is recorded as a continuous sequence of image frames, the inertial data is recorded as a six-axis measurement sequence arranged according to the sampling time, and the laser data is recorded as a point cloud sequence arranged according to the scanning cycle.

[0074] Add timestamps corresponding to the acquisition time to the visual data, inertial data, and laser data respectively, and establish a unified time series according to the timestamps;

[0075] Joint calibration is performed on the vision sensor, inertial measurement unit, and lidar to obtain the first extrinsic parameter of the vision sensor relative to the inertial measurement unit and the second extrinsic parameter of the lidar relative to the vision sensor. The vision data, inertial data, and lidar data are time-aligned according to a unified time series, and the coordinate transformation relationship is registered based on the first and second extrinsic parameters. The joint calibration is completed using a checkerboard calibration board in conjunction with hand-eye calibration. Both the first and second extrinsic parameters include a rotation matrix and a translation vector.

[0076] Write the data that has been aligned and associated with coordinates into the original cache in chronological order, and save the corresponding timestamp, first external parameter and second external parameter in the original cache.

[0077] In this embodiment, obtaining the initial pose based on ORB-SLAM tracking specifically includes:

[0078] The current image frame in the visual data is input into the pre-trained HRNet, which outputs the semantic segmentation result corresponding to the current image frame. HRNet adopts the HRNet-W18 network structure and outputs a pixel-level semantic label map of the same size as the current image frame.

[0079] Dynamic target regions are removed based on semantic segmentation results, while static regions are retained. A mask is generated for the pixel positions in the semantic label image that belong to the dynamic target regions. Corner detection is not performed on the areas covered by the mask during subsequent feature extraction. The semantic categories corresponding to the dynamic target regions are set as pedestrians, vehicles, and cyclists, while the semantic categories corresponding to the static regions are set as walls, ground, door frames, and pillars.

[0080] Perform FAST corner detection within a static region, filter candidate feature points according to corner response values, and generate ORB features based on the rotated BRIEF descriptor;

[0081] The ORB features of the current image frame are matched with the ORB features of the previous image frame to obtain the inter-frame feature correspondence. The similarity of ORB feature matching is calculated using Hamming distance. When the Hamming distance between the nearest ORB features is not greater than 30, the current ORB feature and the nearest ORB feature are determined to meet the threshold condition and are retained. When the Hamming distance is greater than 30, the threshold condition is not met and the feature is removed. Among the multiple feature point pairs pointing to the same feature point in the retained results, the feature point pair with the smallest Hamming distance is retained.

[0082] The inter-frame feature correspondence is input into the ORB-SLAM front end to perform motion estimation and tracking, generate the pose transformation of the current image frame relative to the previous image frame, and accumulate the continuous pose transformations into the initial pose; the motion estimation uses the random sampling consistency method to eliminate mismatched point pairs, and the initial pose is the cumulative pose result of the current image frame.

[0083] In this embodiment, fusing the initial pose, inertial results, and laser results to obtain the current pose specifically includes:

[0084] Inertial data within the time interval corresponding to the current image frame is extracted. Pre-integration is performed on the angular velocity data and acceleration data to obtain the inertial result relative to the previous image frame. The inertial pose corresponding to the current image frame is calculated based on the current pose and inertial result corresponding to the previous image frame. During the pre-integration process, zero bias compensation is first performed on the angular velocity data and acceleration data, and then the rotation increment, velocity increment and displacement increment are accumulated successively according to the sampling interval.

[0085] The laser data corresponding to the current image frame is extracted, distortion correction is performed on the laser data, and the distortion-corrected laser data is registered with the local point cloud to obtain the laser result corresponding to the current image frame. The laser pose corresponding to the current image frame is determined based on the laser result. The distortion correction uses the inertial result corresponding to the laser point sampling time to perform motion compensation for each laser point within the same scanning cycle. During the registration process, the pose transformation of the current laser data relative to the local point cloud is calculated iteratively using the point-to-surface error. The laser result is the pose transformation result obtained from the registration, and the laser pose is the pose corresponding to the current image frame in the laser coordinate system. The laser data corresponding to the previous 5 to 10 consecutive frames are extracted from the original buffer. Based on the current pose and second extrinsic parameter corresponding to each historical frame, the laser data of each historical frame is transformed to a unified coordinate system and superimposed in chronological order to form a local point cloud. When the number of historical frames is less than 5, the local point cloud is constructed using the laser data of all existing historical frames.

[0086] Read the initial pose, and transform the initial pose, inertial pose, and laser pose to a unified coordinate system based on the first and second extrinsic parameters; the unified coordinate system adopts the visual coordinate system;

[0087] The pose residuals are established for the pose to be determined in the current image frame, corresponding to the initial pose, inertial pose, and laser pose, respectively. The pose residuals are then subjected to multi-source pose average least squares calculation based on Lie algebra logarithmic mapping of the residuals to obtain the current pose corresponding to the current image frame. The pose residuals consist of the relative pose differences between the pose to be determined and the initial pose, inertial pose, and laser pose, and are converted into a residual vector of a uniform dimension through Lie algebra logarithmic mapping. The multi-source pose average least squares calculation is performed using an iterative optimization method, with the initial value being the initial pose.

[0088] In this embodiment, the divergence determination of the time window data in the original cache specifically includes:

[0089] Taking the current image frame as the endpoint, continuous time window data is extracted from the original buffer. The relative rotation angle difference and relative translation distance between each pair of the initial pose, inertial pose, and laser pose corresponding to the current image frame are calculated. When the maximum value of the relative rotation angle difference is greater than the rotation threshold or the maximum value of the relative translation distance is greater than the displacement threshold, the trigger condition is determined to be met. The relative rotation angle difference is obtained by transforming the relative rotation matrix between any two poses, and the relative translation distance is obtained by the Euclidean norm of the translation vector difference between any two poses.

[0090] After the triggering condition is met, the current pose corresponding to the starting frame of the time window is fixed as the reference pose, and the poses corresponding to each image frame within the time window are set as the poses to be determined. The starting frame of the time window is selected as the first image frame in the captured continuous time window data. When fixing the reference pose, only the pose of the starting frame of the time window is retained and does not participate in subsequent iteration updates. The poses corresponding to the remaining image frames within the time window are all used as poses to be determined and participate in trajectory recovery calculation.

[0091] Based on the time window data, the ORB feature correspondence between adjacent image frames within the time window, the inertial results corresponding to each image frame, and the laser results corresponding to each image frame are regenerated, and the laser results corresponding to each image frame are transformed to a unified coordinate system.

[0092] Using the reference pose, ORB feature correspondence, inertial results, and laser results, a joint least squares calculation is constructed for the pose to be determined corresponding to each image frame within the time window, and the time window pose sequence is obtained by solving the problem; the joint least squares calculation is the multi-source pose average least squares calculation.

[0093] The pose sequence within the time window is used as the recalculated trajectory, and the pose corresponding to each image frame in the recalculated trajectory is used to replace the current pose in the original pose sequence within the time window.

[0094] In this embodiment, generating a semantic skeleton map specifically includes:

[0095] Insert the current image frame as a keyframe based on the updated pose;

[0096] Binary masks corresponding to structural regions are extracted from the semantic segmentation results. Connectivity labeling is performed on the binary masks, and connected regions with areas smaller than a threshold are deleted, retaining the remaining connected regions. Structural regions are generated from preset static semantic categories in the semantic segmentation results, including wall regions, corridor regions, door frame regions, and stair regions. The semantic segmentation results are traversed pixel by pixel, with pixels belonging to the wall region, corridor region, door frame region, and stair region assigned a value of 1, and other pixels assigned a value of 0, forming a binary mask. Connectivity labeling is performed on the binary mask using an 8-neighborhood connectivity method, and the number of pixels in each connected region is counted. Connected regions with a pixel count smaller than the area threshold are deleted; the area threshold is set to 1% of the total number of pixels in the current image frame.

[0097] The process involves successively deleting boundary pixels from the retained connected regions while maintaining connectivity, resulting in a single-pixel-width structural region centerline. This deletion of boundary pixels is performed using an iterative thinning method. In each round, boundary pixels on the outer contour of the connected region are first searched to determine if deleting them would cause a break in the connected region. Boundary pixels that do not cause a break are deleted, while those that do cause a break are retained. This process of boundary pixel search, break detection, and deletion is repeated until the remaining pixel width is a single pixel, forming the structural region centerline.

[0098] The endpoints and intersections are determined along the centerline of the structural region. The pixel sequence of the centerline between the endpoints and intersections is determined as the skeleton path. Sampling points are selected along the skeleton path at a preset step size. The endpoints, intersections, and sampling points are determined as skeleton nodes. The skeleton path between adjacent skeleton nodes is determined as a skeleton line segment. Pixels with a connectivity of 1 are determined as endpoints, and pixels with a connectivity of more than 2 are determined as intersections. Starting from an endpoint or intersection, the system tracks pixel by pixel along the centerline to the next endpoint or intersection. The pixel sequence along the centerline is determined as a skeleton path. The pixel distance is accumulated along each skeleton path according to the path length for sampling. The sampling points, along with the endpoints and intersections, are recorded as skeleton nodes. The skeleton path between adjacent skeleton nodes is recorded as a skeleton line segment.

[0099] The ORB features are transformed to a unified coordinate system based on the updated pose, and attached to the corresponding skeleton nodes or skeleton segments according to the principle of minimum distance. When the minimum distance corresponds to a skeleton node, the ORB features are attached to the skeleton node. When the minimum distance corresponds to a skeleton segment, the ORB features are attached to the skeleton segment. When the minimum distance is greater than a distance threshold, no attachment is performed. The distance threshold is set to 0.2m.

[0100] Based on the updated pose, the skeleton nodes, skeleton segments, and attached ORB features are spliced ​​together to generate a semantic skeleton map.

[0101] In this embodiment, encoding the semantic topology string according to the arrangement order and connectivity order of structural objects within the keyframe's field of view specifically includes:

[0102] The field of view of a keyframe is determined based on the updated pose of the keyframe and the current image frame. Skeleton nodes and skeleton segments located within the field of view of the keyframe are selected from the semantic skeleton map. Skeleton nodes and skeleton segments with the same structural region category and connected to each other are identified as structural objects. The field of view of the keyframe is jointly determined by the imaging range of the current image frame and the updated pose. The image boundary is back-projected into the view frustum boundary according to the intrinsic parameter matrix of the current image frame. The view frustum boundary is transformed to a unified coordinate system according to the updated pose. Each skeleton node in the semantic skeleton map is checked to see if it is located within the view frustum boundary. Each skeleton segment is checked to see if its two endpoints are located within the view frustum boundary or intersect with the view frustum boundary. Skeleton nodes and skeleton segments that meet the conditions are retained. Skeleton nodes and skeleton segments with the same structural region category and connected by endpoints as a continuous path are merged into the same structural object.

[0103] Each structural object is projected onto the keyframe image plane, and sorted in ascending order according to the horizontal coordinates of the projection center points of each structural object. Projection calculations are performed on the skeleton nodes and skeleton line segment sampling points in each structural object to obtain the corresponding image plane coordinates. The horizontal and vertical coordinates of all projection points of the same structural object are averaged to obtain the projection center point of the structural object. The projection center points are arranged in ascending order according to their horizontal coordinates. When the difference between the horizontal coordinates of two projection center points is less than 5 pixels, they are arranged in ascending order according to their vertical coordinates.

[0104] Based on the connection relationship between skeleton nodes and skeleton segments, the connection order is recorded along the connection path between each structural object within the keyframe's field of view to obtain the connectivity order. A connectivity graph composed of skeleton nodes and skeleton segments is constructed within the keyframe's field of view, with skeleton nodes as graph nodes and skeleton segments as graph edges. Starting from the skeleton node corresponding to the structural object with the highest order, the connectivity graph is traversed segment by segment along the skeleton segments, and the access order is recorded once each new structural object is reached. When there are multiple subsequent connection branches, the skeleton segment with the smallest angle to the current traversal direction is selected first to continue traversing, thus obtaining the connectivity order.

[0105] Each structural object is encoded sequentially according to its structural region category, arrangement order, and connectivity order. The encoding results of each structural object are then concatenated sequentially to generate the semantic topology string corresponding to the keyframe. Category codes are pre-assigned to structural region categories.

[0106] The semantic topology string is stored in correspondence with the keyframe number to form a keyframe index.

[0107] In this embodiment, obtaining the positioning results and map results specifically includes:

[0108] The semantic topology string of the current keyframe is compared with the historical semantic topology string in the keyframe index on an object-by-object basis. The number of matching objects is counted according to the structural region category, arrangement order and connectivity order. Historical keyframes with a number of matching objects reaching the threshold are identified as loop closure candidate frames. The set interval time between historical keyframes and the current keyframe is used.

[0109] For each loopback candidate frame, the structural objects that appear repeatedly in the corresponding historical keyframes and adjacent historical keyframes are counted. Structural objects that appear consecutively a number of times and do not appear in the loopback candidate frames are recorded as missing evidence templates. During the count, structural objects with the same structural region category and a projection position difference less than a preset distance threshold are judged as repeated occurrences of the same structural object. The frequency threshold is set to 3 times. The missing evidence template records the structural region category, corresponding arrangement position, and corresponding connectivity position.

[0110] The structural objects not appearing in the current keyframe are compared with the missing evidence template, and loop closure candidate frames that do not match the missing evidence template are deleted. During the comparison, the structural region categories that do not appear are extracted from the semantic topology string corresponding to the current keyframe and compared with the structural region categories and corresponding arrangement positions in the missing evidence template. When the structural objects in the missing evidence template appear in the current keyframe, or when the structural objects missing in the current keyframe are inconsistent with the records in the missing evidence template, the corresponding loop closure candidate frames are deleted.

[0111] ORB features are extracted from the retained loop closure candidate frames and matched with the ORB features of the current keyframe. The relative pose between keyframe pairs is calculated based on the matching results and geometric verification is performed. Geometric verification uses the random sampling consistency method to remove mismatched point pairs and calculates the relative pose between keyframe pairs based on the retained feature point pairs. When the number of inliers is not less than 30 and the proportion of inliers is not less than 0.6, the geometric verification is deemed to have passed.

[0112] Keyframe pairs that pass geometric verification are written into the loop closure edge set. SLAM graph optimization is performed based on the loop closure edge set and keyframe poses to update the keyframe poses and semantic skeleton map, obtaining the localization and map results. The loop closure edge set records at least the current keyframe number, loop closure candidate frame number, and relative pose. SLAM graph optimization is performed using a pose graph optimization method, with keyframe poses as graph nodes and pose relationships between adjacent keyframes and loop closure edges as graph edges. After graph optimization, the updated keyframe poses are reapplied to the skeleton nodes, skeleton segments, and attached ORB features in the semantic skeleton map.

[0113] Example 1: To verify the feasibility of this invention in practice, it was applied to a mobile device navigation scenario in a large indoor dynamic passageway environment. This environment includes continuous passageways, opening areas, corner areas, and multiple door frames and staircase transition areas. While the static structure is stable, during equipment operation, there are numerous pedestrians passing through, short-term gatherings, cart obstructions, and localized glare. Relying solely on visual features is susceptible to interference from dynamic targets; inertial data accumulates errors over long-term propagation; and laser registration is prone to localized misregistration in repetitive structural areas, leading to pose jumps, loop closure false detections, and map ghosting.

[0114] In this application, mobile devices simultaneously acquire visual, inertial, and laser data, and establish an initial cache after calibration. Visual data is input into HRNet, which outputs semantic segmentation results, masking pedestrians, vehicles, and cyclists, and extracting ORB features only from static areas such as walls, ground, door frames, and pillars. ORB-SLAM is used to perform continuous frame tracking to obtain the initial pose; inertial data undergoes pre-integration to obtain the inertial pose, and laser data is distorted and registered with local point clouds to obtain the laser pose. The initial pose, inertial pose, and laser pose are unified into the visual coordinate system, and the current pose is calculated using the average least squares of the multi-source pose based on the Lie algebra logarithmic mapping residual, achieving multi-source fusion.

[0115] When the device passes through densely populated or obstructed areas, the three pose results may diverge. This invention does not directly use the current frame output, but instead uses the current frame as the endpoint, extracts short time window data from the original buffer, fixes the starting frame pose as the reference pose, and regenerates the ORB feature correspondence, inertial results, and laser results for the image frames within the window, jointly solving the pose sequence using least squares. This recalculated trajectory utilizes multi-frame information to recover local continuous pose, effectively suppressing jumps and drifts.

[0116] During the mapping phase, this invention extracts binary masks for wall, corridor, door frame, and staircase areas from the semantic segmentation results, deletes small connected regions, and refines the structural region centerlines. Endpoints, intersections, and sampling points are extracted to generate skeleton nodes and skeleton segments. ORB features are then attached to the corresponding nodes or segments to form a semantic skeleton map. This map preserves the main structure of the scene and integrates local features, improving the efficiency of keyframe association in repetitive structural regions and subsequent object extraction.

[0117] In the loop closure detection phase, structural objects within the keyframe's field of view are extracted and a topology string is generated based on their horizontal arrangement and skeleton connectivity paths. When retrieving loop closure candidate frames, structural objects that have appeared frequently in historical keyframes and adjacent frames but are missing from candidate frames are simultaneously compiled to form a missing evidence template. ORB geometric verification is only performed after no structural object in the current frame matches the missing template, effectively compressing the candidate set and reducing false detections. After successful verification, the edges are written into the loop closure edge set, and graph optimization is performed to update the keyframe pose and semantic skeleton map.

[0118] The comparison was conducted across three types of dynamic scenarios: regular traffic scenarios, high-density dynamic interference scenarios, and scenarios with significant repetitive structures, corresponding to scenarios 1, 2, and 3, respectively. The comparison methods included visual tracking only, visual-inertial fusion, and visual-laser fusion. The test metrics included average positioning error, trajectory integrity rate, number of loop closure false detections, repositioning success rate, map integrity, and average processing latency.

[0119] Table 1: Data Comparison of Dynamic Scene Localization and Mapping

[0120] Scene type method Average positioning error Trajectory completeness Number of loopback false detections Relocation success rate Map completeness Average processing latency 1 Visual tracking 0.21 91.8% 3 88.6% 90.4% 42 1 Vision and Inertia 0.16 94.2% 2 91.3% 92.7% 47 1 Vision and Laser 0.14 95.1% 2 93.8% 94.6% 55 1 Method of the present invention 0.07 98.9% 0 98.4% 98.1% 63 2 Visual tracking 0.43 84.9% 6 72.5% 83.6% 45 2 Vision and Inertia 0.31 89.7% 5 81.4% 86.9% 51 2 Vision and Laser 0.24 92.8% 4 86.2% 89.1% 59 2 Method of the present invention 0.11 98.6% 1 97.8% 97.4% 67 3 Visual tracking only 0.38 87.6% 8 76.3% 84.8% 43 3 Vision and Inertia 0.29 90.5% 6 83.7% 87.5% 48 3 Vision and Laser 0.19 93.3% 4 89.4% 91.2% 58 3 Method of the present invention 0.12 97.5% 1 97.1% 96.8% 66

[0121] As shown in Table 1, the method of this invention demonstrates superior positioning accuracy and trajectory integrity compared to the control method in three different dynamic environments. In normal traffic scenarios, the average positioning error of this invention is only 0.07 meters, and the trajectory integrity rate reaches 98.9%, lower than the error levels of visual or fusion methods alone. In high-density dynamic interference scenarios, this invention can effectively suppress dynamic target interference, reducing the average positioning error to 0.11 meters, with only 1 loopback false detection and a repositioning success rate as high as 97.8%. In scenarios with significant repetitive structures, this invention maintains a trajectory integrity rate of 97.5% and a map integrity of 96.8%, significantly reducing local drift and mismatch phenomena of traditional methods. These data fully demonstrate the effectiveness and reliability of joint recalculation of time windows, semantic skeleton map construction, and loopback screening of missing evidence templates in dynamic and complex environments.

[0122] The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any equivalent substitutions or modifications made by those skilled in the art within the scope of the technology disclosed in the present invention, based on the technical solution and inventive concept of the present invention, should be covered within the scope of protection of the present invention.

Claims

1. A dynamic environment SLAM localization method based on multi-source sensor data fusion, characterized in that, Includes the following steps: Collect visual data, inertial data, and laser data; complete calibration and establish the initial cache. Visual data is input into HRNet to obtain semantic segmentation results, ORB features are extracted from static regions, and the initial pose is obtained based on ORB-SLAM tracking. Inertial data is pre-integrated to obtain inertial results, laser data is registered to obtain laser results, and the initial pose, inertial results, and laser results are fused to obtain the current pose; The time window data in the original cache is subjected to divergence determination. When the triggering condition is met, the time window data is replayed, registration and trajectory recovery are re-executed, candidate trajectories are generated, and the recalculated trajectory is selected to update the current pose. Based on the updated pose, keyframes are inserted, structural region centerlines are extracted from semantic segmentation results and discretized into skeleton nodes and skeleton segments, and ORB features are attached to skeleton nodes and skeleton segments to generate a semantic skeleton map. A semantic topology string is generated by encoding the arrangement and connectivity order of structural objects within the keyframe view, and then associated with the keyframe index. The semantic topology string and historical semantic topology string are used to retrieve loop closure candidate frames. ORB geometric verification is performed, and the verified keyframes are written into the loop closure edge set and SLAM graph optimization is performed to obtain the localization result and map result.

2. The dynamic environment SLAM localization method based on multi-source sensor data fusion according to claim 1, characterized in that, The process of completing calibration and establishing the original cache specifically includes: The system continuously acquires image frames and generates visual data through a visual sensor, continuously acquires angular velocity and acceleration data and generates inertial data through an inertial measurement unit, and continuously acquires spatial scanning points and generates laser data through a lidar. Add timestamps corresponding to the acquisition time to the visual data, inertial data, and laser data respectively, and establish a unified time series according to the timestamps; Joint calibration is performed on the vision sensor, inertial measurement unit, and lidar to obtain the first extrinsic parameter of the vision sensor relative to the inertial measurement unit and the second extrinsic parameter of the lidar relative to the vision sensor; the vision data, inertial data, and lidar data are time-aligned according to a unified time series, and the coordinate transformation relationship is registered based on the first and second extrinsic parameters; Write the data that has been aligned and associated with coordinates into the original cache in chronological order, and save the corresponding timestamp, first external parameter and second external parameter in the original cache.

3. The dynamic environment SLAM localization method based on multi-source sensor data fusion according to claim 2, characterized in that, The initial pose obtained based on ORB-SLAM tracking specifically includes: Input the current image frame from the visual data into the pre-trained HRNet, and output the semantic segmentation result corresponding to the current image frame; Dynamic target regions are removed based on semantic segmentation results, while static regions are retained. Perform FAST corner detection within a static region, filter candidate feature points according to corner response values, and generate ORB features based on the rotated BRIEF descriptor; The ORB features of the current image frame are matched with the ORB features of the previous image frame to obtain the inter-frame feature correspondence. The inter-frame feature correspondence is input into the ORB-SLAM front end to perform motion estimation and tracking, generate the pose transformation of the current image frame relative to the previous image frame, and accumulate the continuous pose transformations into the initial pose.

4. The dynamic environment SLAM localization method based on multi-source sensor data fusion according to claim 3, characterized in that, The process of fusing the initial pose, inertial results, and laser results to obtain the current pose specifically includes: Extract the inertial data within the time interval corresponding to the current image frame, perform pre-integration on the angular velocity data and acceleration data to obtain the inertial result relative to the previous image frame, and calculate the inertial pose corresponding to the current image frame based on the current pose and inertial result corresponding to the previous image frame. Extract the laser data corresponding to the current image frame, perform distortion correction on the laser data, register the distortion-corrected laser data with the local point cloud to obtain the laser result corresponding to the current image frame, and determine the laser pose corresponding to the current image frame based on the laser result. Read the initial pose, and transform the initial pose, inertial pose, and laser pose to a unified coordinate system based on the first and second extrinsic parameters; The pose residuals are established for the initial pose, inertial pose and laser pose corresponding to the pose to be determined in the current image frame. The average least squares calculation of the multi-source pose based on the logarithmic mapping residual of Lie algebra is performed on the pose residuals to obtain the current pose corresponding to the current image frame.

5. The dynamic environment SLAM localization method based on multi-source sensor data fusion according to claim 4, characterized in that, The specific steps for determining divergence in the time window data in the original cache include: Taking the current image frame as the endpoint, extract continuous time window data from the original buffer, calculate the relative rotation angle difference and relative translation distance between each pair of the initial pose, inertial pose and laser pose corresponding to the current image frame, and determine that the trigger condition is met when the maximum value of the relative rotation angle difference is greater than the rotation threshold or the maximum value of the relative translation distance is greater than the displacement threshold. After the triggering condition is met, the current pose corresponding to the starting frame of the time window is fixed as the reference pose, and the poses corresponding to each image frame within the time window are set as the poses to be determined; the starting frame of the time window is selected as the first image frame in the captured continuous time window data. Based on the time window data, the ORB feature correspondence between adjacent image frames within the time window, the inertial results corresponding to each image frame, and the laser results corresponding to each image frame are regenerated, and the laser results corresponding to each image frame are transformed to a unified coordinate system. Using the reference pose, ORB feature correspondence, inertial results, and laser results, a joint least squares calculation is constructed for the pose to be determined corresponding to each image frame within the time window, and the time window pose sequence is obtained by solving the problem. The pose sequence within the time window is used as the recalculated trajectory, and the pose corresponding to each image frame in the recalculated trajectory is used to replace the current pose in the original pose sequence within the time window.

6. The dynamic environment SLAM localization method based on multi-source sensor data fusion according to claim 5, characterized in that, The generation of the semantic skeleton map specifically includes: Insert the current image frame as a keyframe based on the updated pose; Extract the binary mask corresponding to the structural region from the semantic segmentation result, perform connected component labeling on the binary mask and delete connected regions with an area less than the threshold, and retain the remaining connected regions; By successively deleting boundary pixels from the retained connected regions while maintaining the connectivity, the center line of the structural region with a single pixel width is obtained. Determine the endpoints and intersections along the centerline of the structural region, determine the centerline pixel sequence between the endpoints and intersections as the skeleton path, select sampling points along the skeleton path at a preset step size, and determine the endpoints, intersections, and sampling points as skeleton nodes, and determine the skeleton path between adjacent skeleton nodes as skeleton line segments. Transform the ORB features to a unified coordinate system based on the updated pose, and attach them to the corresponding skeleton nodes or skeleton segments according to the principle of minimum distance. Based on the updated pose, the skeleton nodes, skeleton segments, and attached ORB features are spliced ​​together to generate a semantic skeleton map.

7. A dynamic environment SLAM localization method based on multi-source sensor data fusion according to claim 6, characterized in that, The process of encoding semantic topology strings according to the arrangement and connectivity order of structural objects within the keyframe view specifically includes: The field of view of the keyframe is determined based on the updated pose of the keyframe and the current image frame. Skeleton nodes and skeleton segments located within the field of view of the keyframe are selected from the semantic skeleton map. Skeleton nodes and skeleton segments with the same structural region category and connected to each other are identified as structural objects. Each structural object is projected onto the keyframe image plane and sorted in ascending order according to the horizontal coordinates of the projection center point of each structural object. Based on the connection relationship between skeleton nodes and skeleton segments, the connection order is recorded along the connection path between each structural object within the keyframe's field of view to obtain the connectivity order. Each structural object is encoded sequentially according to its structural region category, arrangement order, and connectivity order. The encoding results of each structural object are then concatenated sequentially to generate the semantic topology string corresponding to the keyframe. The semantic topology string is stored in correspondence with the keyframe number to form a keyframe index.

8. The dynamic environment SLAM localization method based on multi-source sensor data fusion according to claim 7, characterized in that, The obtained positioning and map results specifically include: The semantic topology string of the current keyframe is compared with the historical semantic topology string in the keyframe index on an object-by-object basis. The number of matching objects is counted according to the structural region category, arrangement order and connectivity order. Historical keyframes with a number of matching objects reaching the threshold are identified as loop closure candidate frames. For each loopback candidate frame, count the structural objects that appear repeatedly in the corresponding historical key frames and adjacent historical key frames. Record the structural objects that appear consecutively to the frequency threshold but do not appear in the loopback candidate frame as missing evidence templates. Compare the structural objects that do not appear in the current keyframe with the missing evidence template, and delete the loop closure candidate frames that do not match the missing evidence template; ORB features are extracted from the retained loop closure candidate frames and matched with the ORB features of the current keyframe. The relative pose between keyframe pairs is calculated based on the matching results and geometric verification is performed. The keyframe pairs that pass geometric verification are written into the loop edge set. Based on the loop edge set and the keyframe pose, SLAM graph optimization is performed to update the keyframe pose and semantic skeleton map, and the localization result and map result are obtained.