A visual slam method and system for dynamic scenes and a storage medium

By using a lightweight semantic segmentation network and the Delaunay triangulation method in a visual SLAM system to detect and remove dynamic feature points, high-precision pose estimation is achieved, solving the problem of localization error in dynamic scenes and improving the robustness of robot autonomous operation.

CN116385538BActive Publication Date: 2026-06-12BEIJING INFORMATION SCI & TECH UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING INFORMATION SCI & TECH UNIV
Filing Date
2023-04-04
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

In dynamic scenes, existing technologies can cause positioning errors in visual positioning systems due to dynamic objects, affecting the robustness and accuracy of robots' autonomous operations.

Method used

The lightweight semantic segmentation network PP-LiteSeg is used for image segmentation. Combined with the Delaunay triangulation method, dynamic feature points are detected and removed, and pose estimation is performed using absolute static feature points.

🎯Benefits of technology

It improves the positioning accuracy and stability of the visual SLAM system in dynamic scenes, and enhances the robot's autonomous operation capability in dynamic environments.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116385538B_ABST
    Figure CN116385538B_ABST
Patent Text Reader

Abstract

The application relates to a dynamic scene-oriented visual SLAM method, a system and a storage medium, which comprises the following steps: dividing objects in each acquired frame image into dynamic objects and static objects according to semantic information to obtain corresponding image masks; performing ORB feature extraction on the acquired images, performing feature matching on key points of adjacent two frame images according to the extracted feature points, and simultaneously performing motion state detection according to the image masks; determining the motion attribute of the feature points according to whether the coordinates of the feature points fall in the segmentation region, finally determining the motion attribute of the feature points by combining a Delaunay triangulation method, detecting potential dynamic feature point outliers, and obtaining absolute static feature points; and performing pose estimation by using the absolute static feature points to obtain a high-precision positioning result. The application can solve the influence of dynamic objects in an unknown environment on a visual positioning system, and can be applied in the field of robot visual positioning.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of robot visual localization technology, and in particular to a visual SLAM (simultaneous localization and mapping) method, system and storage medium for dynamic scenes. Background Technology

[0002] Simultaneous Localization and Mapping (SLAM) is a core technology for robot autonomous navigation. Robots utilize sensors such as LiDAR and cameras to acquire environmental perception information, estimate their own pose in real time, and simultaneously construct a sparse point cloud map. This improves the robot's environmental adaptability and meets the application requirements of autonomous operations. In recent years, with the rapid development of artificial intelligence technology, robots have become an important means of liberating human labor and are widely used in various fields of production and life. Autonomous robot operation in dynamic scenarios has become a challenging research hotspot.

[0003] The publicly disclosed ORB-SLAM2 is an open-source SLAM framework that supports monocular, binocular, and RGB-D cameras. It can calculate camera pose in real time and simultaneously perform sparse 3D reconstruction of the surrounding environment. It can perform real-time loop closure detection and relocalization on the CPU with extremely high localization accuracy, down to the centimeter level. ORB-SLAM2 utilizes methods such as Random Sample Consensus (RANSAC) and Bundle Adjustment to reduce errors caused by mismatches during feature matching. Both methods are based on the premise of absolute stillness, but when the system is in a dynamic environment, the intervention of dynamic feature points can introduce significant errors. Therefore, distinguishing between dynamic and static feature points and improving the robustness and reliability of the SLAM system in dynamic scenes is crucial for ensuring efficient, robust, and autonomous operation of mobile robots. In contrast, the fully semantic segmentation process disclosed in existing technologies consumes substantial computational resources. By adaptively determining whether semantic segmentation is needed for the current frame based on the motion level information of feature points and the proportion of dynamic feature points, the computational cost is greatly reduced, and the system can run in real time. While using a single Gaussian model (SGM) to reflect grayscale changes in an image sequence, distinguishing foreground and background regions based on grayscale values ​​and ignoring feature points in the foreground region to estimate camera pose, the SGM is susceptible to noise interference, resulting in poor system stability. YOLOv4 is employed to determine the presence of potential moving objects in the current image. It segments the foreground and background within the bounding box using the maximum inter-class variance algorithm, filtering out potential dynamic feature points, thus improving relative pose accuracy by nearly 90%. However, in special scenarios where dynamic targets constitute a significant proportion, the system may experience tracking failures.

[0004] Therefore, how to solve the positioning error caused by dynamic objects in unknown environments to the visual positioning system has become an urgent technical problem to be solved. Summary of the Invention

[0005] To address the aforementioned problems, the purpose of this invention is to provide a visual SLAM method, system, and storage medium for dynamic scenes, which can resolve the impact of dynamic objects in unknown environments on the positioning error of the visual positioning system.

[0006] To achieve the above objectives, the present invention adopts the following technical solution: a visual SLAM method for dynamic scenes, comprising: dividing objects in each frame of an acquired image into dynamic objects and static objects based on semantic information, and obtaining corresponding image masks; performing ORB feature extraction on the acquired images, performing feature matching on key points of adjacent frames of images based on the extracted feature points, and simultaneously performing motion state detection based on the image masks; determining the motion attributes of feature points based on whether the coordinates of the feature points fall within the segmented region, and finally determining the motion attributes of the feature points by combining the Delaunay triangulation method, detecting potential dynamic feature point outliers, and obtaining absolute static feature points; and using the absolute static feature points for pose estimation to obtain high-precision localization results.

[0007] Furthermore, based on semantic information, objects in each acquired frame of image are divided into dynamic objects and static objects, including:

[0008] The lightweight semantic segmentation network PP-LiteSeg is used for segmentation;

[0009] In the feature points extracted from the acquired image, dynamic and static objects are marked, and all channels are aligned to form a single channel, resulting in a segmentation mask for dynamic and static objects in the scene image.

[0010] Furthermore, the motion attributes of feature points are finally determined by combining the Delaunay triangulation method, including:

[0011] The feature points in the image are connected into triangles, and the feature points of dynamic objects are determined by comparing the sides of the corresponding triangles in two adjacent frames. The feature points belonging to dynamic objects are removed from the connected graph constructed by the triangulation method, so as to remove the dynamic regions in the image.

[0012] Furthermore, the feature points of the dynamic object are determined, including:

[0013] Dynamic vector detection is performed by back-projecting the feature points of the current frame corresponding to the two map points connected by the vector edge and generating new vector edges formed by connecting the 3D points.

[0014] Based on the dynamic vector detection results, the dynamic degree of change of vector edges is obtained to detect map points of dynamic objects.

[0015] Furthermore, dynamic vector detection is implemented by a dynamic vector detection function, which is:

[0016]

[0017] In the formula, MN represents the vector sides of the triangle, and M'N' represents the vector sides formed by connecting 3D points generated by the back projection of feature points in the current frame. This is a dynamic vector detection function.

[0018] Furthermore, the degree of dynamic change of the vector edge is obtained, including: using the L2 norm to calculate the length change of the vector edge, and using cosine similarity to calculate the angle change of the vector edge, so as to obtain the degree of dynamic change of the vector edge.

[0019] If no dynamic changes occur, the criteria based on the L2 norm should be met;

[0020] If no change in the angle of the vector edge occurs, the judgment condition based on cosine similarity should be met;

[0021] Dynamic vectors are determined based on two criteria. If a vector edge is marked as a dynamic vector, the dynamic weights of the two map points connected by that dynamic vector are increased.

[0022] Furthermore, the criterion based on the L2 norm is:

[0023]

[0024] The criteria for determining cosine similarity are:

[0025]

[0026] In the formula, ε dmax , ε dmin Let ε be the maximum and minimum values ​​of the interval between the ratios of the vectors' 2-norms. cos This is the cosine similarity threshold.

[0027] A visual SLAM system for dynamic scenes includes: a first processing module that divides objects in each frame of an acquired image into dynamic and static objects based on semantic information, and obtains corresponding image masks; a second processing module that performs ORB feature extraction on the acquired images, performs feature matching on key points of adjacent frames based on the extracted feature points, and performs motion state detection based on the image masks; a third processing module that determines the motion attributes of feature points based on whether the coordinates of the feature points fall within the segmented region, and finally determines the motion attributes of the feature points using the Delaunay triangulation method, detects potential dynamic feature point outliers, and obtains absolute static feature points; and an output module that uses the absolute static feature points to perform pose estimation to obtain high-precision localization results.

[0028] A computer-readable storage medium storing one or more programs, the one or more programs including instructions that, when executed by a computing device, cause the computing device to perform any of the methods described above.

[0029] A computing device includes: one or more processors, a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, and the one or more programs include instructions for performing any of the methods described above.

[0030] The present invention has the following advantages due to the adoption of the above technical solutions:

[0031] This invention proposes a real-time SLAM system for dynamic environments based on ORB-SLAM2. This system utilizes semantic information obtained from the deep learning network PP-LiteSeg combined with triangulation to filter out dynamic feature points in the scene, and uses absolutely static feature points for pose estimation. Experiments on the TUM and KITTI datasets demonstrate that this invention significantly improves camera pose and localization accuracy. The system exhibits high stability and real-time performance, which is of great significance for the localization system of mobile robots, effectively improving the accuracy and efficiency of SLAM systems. Attached Figure Description

[0032] Figure 1 This is a flowchart of a visual SLAM method for dynamic scenes in an embodiment of the present invention;

[0033] Figure 2 This is an embodiment of the present invention.

[0034] Figure 3a This is a comparison diagram of ORB-SLAM2 on the fr3_walking halfsphere sequence in an embodiment of the present invention;

[0035] Figure 3b This is a comparison diagram of ORB-SLAM2 on the fr3_walking xyz sequence in an embodiment of the present invention;

[0036] Figure 3c This is an ATE comparison diagram of the method of the present invention on the fr3_walking halfsphere sequence in an embodiment of the present invention;

[0037] Figure 3d This is an ATE comparison diagram of the method of the present invention on the fr3_walking xyz sequence in an embodiment of the present invention;

[0038] Figure 4a This refers to the relative positional probability (RPE) between the pose calculation result of ORB-SLAM2 on the fr3_walking halfsphere sequence and the true value in this embodiment of the invention.

[0039] Figure 4b This refers to the relative positional probability (RPE) between the pose calculation result of ORB-SLAM2 on the fr3_walking xyz sequence and the true value in this embodiment of the invention.

[0040] Figure 4c This refers to the relative positional accuracy (RPE) between the pose calculation result and the true value of the method of the present invention on the fr3_walking halfsphere sequence in this embodiment of the invention.

[0041] Figure 4d This refers to the relative positional probability (RPE) between the pose calculation result and the true value of the method of the present invention on the fr3_walking xyz sequence in this embodiment of the invention. Detailed Implementation

[0042] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. All other embodiments obtained by those skilled in the art based on the described embodiments of the present invention are within the scope of protection of the present invention.

[0043] It should be noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of exemplary embodiments according to the invention. As used herein, the singular form is intended to include the plural form as well, unless the context clearly indicates otherwise. Furthermore, it should be understood that when the terms "comprising" and / or "including" are used in this specification, they indicate the presence of features, steps, operations, devices, components, and / or combinations thereof.

[0044] To address the issues of low localization accuracy and poor robustness in visual SLAM systems for dynamic scenes, this invention proposes a visual SLAM method, system, and storage medium for dynamic scenes, combining a lightweight semantic segmentation network with motion information. Specifically, it includes: introducing a PP-LiteSeg neural network to extract commonly present dynamic targets in dynamic scenes and further extracting features internally; simultaneously, combining spatial geometry principles to further determine dynamic features; and constructing absolutely static initial key targets after screening. Based on this, the system is initialized and the key targets are tracked to solve the camera pose. Experiments were conducted on the public datasets TUM and KITTI, and the results show that the proposed algorithm improves both localization accuracy and real-time performance in dynamic environments compared to the original ORB-SLAM2.

[0045] In one embodiment of the present invention, such as Figure 1 As shown, a visual SLAM method for dynamic scenes is provided. In this embodiment, based on the original ORB-SLAM2, semantic segmentation is added and a triangulation algorithm is used for dynamic object detection during tracking to reduce the interference of dynamic feature points on visual odometry. Specifically, the method includes the following steps:

[0046] 1) Based on semantic information, the objects in each frame of the acquired image are divided into dynamic objects and static objects to obtain the corresponding image mask;

[0047] 2) Perform ORB feature extraction on the acquired images, perform feature matching on key points of adjacent two frames based on the extracted feature points, and perform motion state detection based on the image mask.

[0048] 3) Determine the motion attributes of feature points based on whether their coordinates fall within the segmented region. Combine this with the Delaunay triangulation method to make the final determination of the motion attributes of feature points, detect potential dynamic feature point outliers, and obtain absolute static feature points.

[0049] 4) Use absolute static feature points for pose estimation to obtain high-precision positioning results.

[0050] In step 1) above, objects in each frame of the acquired image are divided into dynamic objects and static objects based on semantic information, including the following steps:

[0051] 1.1) The lightweight semantic segmentation network PP-LiteSeg is used for segmentation;

[0052] In this embodiment, the PP-LiteSeg network is used to perform preliminary semantic segmentation on potential moving objects;

[0053] 1.2) In the feature points extracted from the acquired image, dynamic objects and static objects are marked, and all channels are aligned to form a single channel, thus obtaining a segmentation mask for dynamic and static objects in a scene image.

[0054] In some feasible implementations, due to the widespread application of deep learning in semantic segmentation, instance segmentation, or object detection of images, a series of image processing networks have emerged, such as SegNet, Mask R-CNN, and YOLO v1-v5. Among these, the YOLO series of object detection networks are relatively fast, but candidate boxes may simultaneously contain both the target and part of the background; therefore, additional background processing is required in this embodiment. The SegNet instance segmentation network can directly obtain relatively accurate contour information, but instance segmentation requires further differentiation between different individuals of the same category, which is more time-consuming. Therefore, this embodiment employs the lightweight semantic segmentation network PP-LiteSeg to ensure both high accuracy and high processing frame rate.

[0055] The input to PP-LiteSeg is a raw RGB image. For example, considering that dynamic objects in an indoor dynamic scene are mainly people and tables and chairs, the network is fine-tuned using labels for people and tables and chairs. To include most dynamic objects in the semantic segmentation labels, a convolutional neural network first segments categories that might be considered dynamic or movable, such as pedestrians and tables and chairs. Assuming the input is an m×n×3 RGB image, the network output is an m×n×L matrix, where L is the number of objects in the image. For each output channel I∈L, a binary mask of the object is obtained. In the image dataset, people and chairs are primarily selected. Among the feature points extracted from this environment, feature points belonging to people and chairs are labeled as dynamic feature points, and other feature points are labeled as static points. Finally, all channels are aligned to form a single channel, resulting in a segmentation mask for people and chairs in the scene image.

[0056] In step 3) above, the motion attributes of feature points are finally determined using the Delaunay triangulation method, and fully dynamic feature points are eliminated. Specifically:

[0057] The feature points in the image are connected into triangles, and the feature points of dynamic objects are determined by comparing the sides of the corresponding triangles in two adjacent frames. The feature points belonging to dynamic objects are removed from the connected graph constructed by the triangulation method, so as to remove the dynamic regions in the image.

[0058] In some feasible implementations, most dynamic objects can be accurately segmented using the PP-LiteSeg network. However, the segmentation effect on potentially dynamic objects is not ideal. For example, a book held in a person's hand or a chair being sat on is initially classified as a static object and retained, but in a real-world scenario, these objects may move with the person's movement. To address this issue, this embodiment utilizes a triangulation method to further examine whether the features are dynamic. Regardless of whether they are dynamic or not, geometric constraints are satisfied: the geometric relationship between feature points on static and dynamic objects in an unknown environment remains unchanged, but when a dynamic object in the environment moves, the geometric relationship between dynamic and static feature points changes accordingly.

[0059] In some feasible implementations, determining the feature points of a dynamic object includes the following steps:

[0060] 3.1) Dynamic vector detection is performed by back-projecting the feature points of the current frame corresponding to the two map points connected by the vector edge and generating new vector edges formed by connecting the 3D points.

[0061] 3.2) Based on the dynamic vector detection results, obtain the dynamic degree of change of the vector edges to detect the map points of dynamic objects.

[0062] In some embodiments, since the vector edges constituting the triangulation can represent the geometric relationship between the two map points connected by the vector edges, dynamic vector detection is performed by connecting new vector edges formed by back-projecting 3D points generated from the feature points of the current frame corresponding to the two map points connected by the vector edges. For example... Figure 2 As shown, this is a back-projection model of the previous frame to the current frame. It is assumed that the feature points (m2, n2, k2) of the i-th frame and the feature points (m1, n1, k1) of the previous frame are matched by projection of map points m, n, k. It is also assumed that m and n belong to static objects and k belongs to dynamic objects. The 3D points generated by back-projection of the feature points p2, r2, q2 of the i-th frame are m'n'k'. Since the map points m and n belong to static objects and k belongs to dynamic objects, the 3D points m' and n' generated by back-projection coincide with m and n respectively, while k' does not coincide with k. Therefore, a vector that can represent the geometric relationship between map points is used to detect dynamic feature points.

[0063] In step 3.1) above, dynamic vector detection is implemented by a dynamic vector detection function, which is:

[0064]

[0065] In the formula, MN represents the vector sides of the triangle, and M'N' represents the vector sides formed by connecting 3D points generated by the back projection of feature points in the current frame. This is a dynamic vector detection function.

[0066] In triangle ΔMNK, MK, MN, and NK are the three vector sides of the triangle. The back projection of the feature points in the current frame generates triangle ΔM'N'K', whose three vector sides are M'K', M'N', and N'K'. Clearly, vector MN coincides with the vector M'N' generated by the back projection of the corresponding feature points in the current frame, indicating that the motion of the dynamic object does not change vector side MN. However, due to the motion of point K, both vector sides MK and NK are altered.

[0067] In step 3.2) above, obtaining the degree of dynamic change of the vector edge specifically includes: using the L2 norm to calculate the length change of the vector edge, using the cosine similarity to calculate the angle change of the vector edge, and obtaining the degree of dynamic change of the vector edge.

[0068] 3.2.1) If no dynamic changes occur, the determination condition based on the L2 norm should be met;

[0069] The L2 norm is defined as follows:

[0070]

[0071] The significance of the L2 norm lies in determining dynamic vectors by restricting the changes in the length and angle of the sides of two vectors.

[0072] 3.2.2) If no change occurs in the angle of the vector edge, the judgment condition based on cosine similarity should be met;

[0073] Cosine similarity is defined as follows:

[0074]

[0075] In the formula, θ is the angle of the vector side;

[0076] 3.2.3) Determine the dynamic vector based on two judgment conditions. If a vector edge is marked as a dynamic vector, then increase the dynamic weight of the two map points connected by the dynamic vector.

[0077] In step 3.2.1) above, the criterion based on the L2 norm is:

[0078]

[0079] In step 3.2.2) above, the judgment criterion based on cosine similarity is:

[0080]

[0081] In the formula, ε dmax , ε dminFor the maximum and minimum values ​​of the ratio interval of the vector's second norm, in this embodiment, it is preferably [ε]. dmin ,ε dmax ] = [0.999, 1.001]; ε cos The cosine similarity threshold is preferably 0.999 (corresponding to an angle of 1°) in this embodiment. If it is less than this threshold, it indicates that the direction vector of the vector has deflected by more than 1°.

[0082] In some embodiments, the dynamic vector is determined based on two criteria:

[0083]

[0084] When vector MN is labeled as a dynamic vector, the dynamic weights of the two map points connected by this dynamic vector are increased:

[0085]

[0086] In the formula, ω m and ω n These are the dynamic weights of the two endpoints of vector MN. Assuming the similarity between vector MN and vector N exceeds a set threshold, the vector M′N′ generated by backprojecting the feature points of the current frame is considered a dynamic vector, and the dynamic weights of the two map points connected to vector MN are increased. Similarly, M′N′ is also considered a dynamic vector. After traversing the edges of the triangle vectors, the dynamic weights of map points M and N in the triangle are both 1, and the dynamic weight of map point K is 2. Therefore, map point K, which belongs to a dynamic object, can be detected, thus achieving the detection of dynamic feature points between adjacent frames.

[0087] Example 1, Accuracy Experiment: The experimental platform used in this example is an ASUS laptop running a 64-bit Ubuntu 18.04 system, with an Intel Core i5 6200U processor, a maximum CPU frequency of 2.8GHz, and 4GB of memory. The public dataset TUM was used for comparison with the original system ORB-SLAM2, and the performance of the laptop was kept as consistent as possible in each experiment. The experimental results were the average of five runs. The evaluation tools used were the Python script files evaluate_ate.py and evaluate_rpe.py, where the evaluation metrics included absolute trajectory error (ATE) and relative pose error (RPE).

[0088] The TUM dataset, released by the Computer Vision Laboratory at the Technical University of Munich, consists of images captured by a Kinect depth camera at a resolution of 640×480, after a series of preprocessing steps. It contains three sequences: fr1 and fr2 are static scene datasets, while fr3 is a dynamic scene dataset. The fr3_walking sequence shows two people walking around a table, with significant scene changes; therefore, this paper selected the fr3_walking sequence as the experimental data. This sequence consists of four sets of camera motions: camera fixed (r3_walking static), camera moving along the xyz axis (fr3_walking xyz), camera moving along a hemispherical trajectory (fr3_walking halfsphere), and camera rotating along the roll-pitch-yaw axis (fr3_walking rpy).

[0089] like Figure 3a and Figure 3b The figures shown are ATE comparison plots of ORB-SLAM2 on the fr3_walking halfsphere and fr3_walkingxyz sequences, respectively. Figure 3c and Figure 3d The image shows a comparison of the ATE (Object Parameter) of the improved algorithm on the corresponding sequences. The ground truth (ground truth) of the camera trajectory is shown as a dashed line, and the estimated camera trajectory (KeyFrameTrajectory) is shown as a solid line. The shorter the distance between the ground truth and the estimated value, the higher the system accuracy. It can be seen that ORB-SLAM2 has a large error in pose estimation on both sequences. The reason for this is that the feature points used by the ORB-SLAM2 algorithm in the pose estimation stage include a large number of outliers, causing significant drift in its localization trajectory.

[0090] like Figure 4a and Figure 4b The figures show the relative accuracy (RPE) between the ORB-SLAM2 pose calculation results and the ground truth values ​​on the fr3_walking halfsphere and fr3_walkingxyz sequences, respectively. Figure 4c and Figure 4dThe figures represent the Relative Pose Errors (RPEs) of the method in this embodiment on the corresponding sequences. The horizontal axis represents the time consumed in seconds (s), and the vertical axis represents the relative pose error of the algorithm at a certain moment in meters (m). Experimental results show that the error range of the improved system has decreased from the original [0.1m~1.2m] to the current [0.02m~0.09m], demonstrating higher accuracy compared to the standard ORB-SLAM2 system. Around the 8th and 15th seconds of the sequence, both systems exhibited significant fluctuations. Analysis of the corresponding sequence images revealed that this was due to a sudden and significant change in the objects within the field of view, causing system fluctuations. The improved system demonstrated higher system stability compared to the original system.

[0091] As shown in Tables 1 to 3, the first column represents the name of the data sequence, and the first row represents the name of the algorithm. It is easy to see that the method in this embodiment significantly improves the pose estimation accuracy compared to the original ORB-SLAM2 system. In high dynamic scenes, the root mean square error (RMSE) of the absolute trajectory error of the pose can be improved by 92% to 97%, the RMSE of the translation component of the relative pose error can be improved by 89% to 94%, and the RMSE of the rotation component can be improved by 88% to 91%.

[0092] Table 1. Comparison of absolute trajectory errors between the method of this invention and the ORB-SLAM2 algorithm (unit: m)

[0093]

[0094] Table 2. Comparison of relative translation trajectory errors between the method of this invention and the ORB-SLAM2 algorithm (unit: m)

[0095]

[0096] Table 3. Comparison of relative rotation trajectory errors between the method of this invention and the ORB-SLAM2 algorithm (unit: °)

[0097]

[0098] Example 2: Ablation Experiment: An experiment was conducted using the outdoor large-scale dynamic scene dataset KITTI tracking. The method of this invention was compared with the ORB-SLAM2 algorithm and the ATE of the DynaSLAM algorithm, as shown in Table 4. The results show that the localization accuracy of the method of this invention on the KITTI dataset is significantly better than the above two algorithms, proving that the algorithm has a certain degree of generalization ability.

[0099] The method of this invention includes semantic segmentation and geometric determination, both of which can be run independently. To verify the impact on positioning accuracy, ablation experiments were conducted on the method of this invention, and the results are shown in Table 5. The mean positioning error of either the semantic segmentation module or the geometric module used alone is much greater than the 1.76 of the method of this invention, indicating the importance of combining semantic segmentation and geometric determination in the method of this invention.

[0100] The method of this invention also lists the running speeds of several dynamic scene SLAM algorithms on the KITTI dataset, as shown in Table 6. It can be seen that the running speed of the method of this invention is approximately 3 to 4 times that of DynaSLAM and DS-SLAM, essentially achieving real-time operation, proving that the selected PP-liteSeg network has found a good balance between performance and speed.

[0101] Table 4. Comparison of the method of the present invention with experimental results of ORB-SLAM2 and DynaSLAM.

[0102]

[0103] Table 5 Ablation Experiment Results

[0104]

[0105] Table 6 Comparison of Calculation Speeds

[0106]

[0107] In summary, through experiments on highly dynamic dataset sequences and comparisons with the traditional ORB-SLAM2 system, it was found that the method of this invention has higher pose estimation accuracy and greater system stability in dynamic scenes.

[0108] This invention proposes a real-time SLAM method for dynamic environments based on ORB-SLAM2. It utilizes semantic information obtained from the deep learning network PP-LiteSeg combined with a triangulation algorithm to filter out dynamic feature points in the scene, using absolutely static feature points for pose estimation. Experiments on the TUM and KITTI datasets demonstrate that this invention significantly improves camera pose and localization accuracy, exhibits high operational stability and a certain degree of real-time performance, and is of great significance for the localization system of mobile robots.

[0109] In one embodiment of the present invention, a visual SLAM system for dynamic scenes is provided, comprising:

[0110] The first processing module divides the objects in each frame of the acquired image into dynamic objects and static objects based on semantic information, and obtains the corresponding image mask.

[0111] The second processing module performs ORB feature extraction on the acquired image, performs feature matching on key points of adjacent two frames of images based on the extracted feature points, and performs motion state detection based on the image mask.

[0112] The third processing module determines the motion attributes of feature points based on whether their coordinates fall within the segmented region. It then uses the Delaunay triangulation method to make the final determination of the motion attributes of the feature points, detects potential dynamic feature point outliers, and obtains absolute static feature points.

[0113] The output module uses absolute static feature points to perform pose estimation and obtain high-precision positioning results.

[0114] In the above embodiments, objects in each frame of the acquired image are divided into dynamic objects and static objects based on semantic information, including:

[0115] The lightweight semantic segmentation network PP-LiteSeg is used for segmentation;

[0116] In the feature points extracted from the acquired image, dynamic and static objects are marked, and all channels are aligned to form a single channel, resulting in a segmentation mask for dynamic and static objects in the scene image.

[0117] In the above embodiments, the final determination of the motion attributes of feature points using the Delaunay triangulation method includes:

[0118] The feature points in the image are connected into triangles, and the feature points of dynamic objects are determined by comparing the sides of the corresponding triangles in two adjacent frames. The feature points belonging to dynamic objects are removed from the connected graph constructed by the triangulation method, so as to remove the dynamic regions in the image.

[0119] Among them, determining the feature points of a dynamic object includes:

[0120] Dynamic vector detection is performed by back-projecting the feature points of the current frame corresponding to the two map points connected by the vector edge and generating new vector edges formed by connecting the 3D points.

[0121] Based on the dynamic vector detection results, the dynamic degree of change of vector edges is obtained to detect map points of dynamic objects.

[0122] In the above embodiments, dynamic vector detection is implemented by a dynamic vector detection function, which is:

[0123]

[0124] In the formula, MN represents the vector sides of the triangle, and M'N' represents the vector sides formed by connecting 3D points generated by the back projection of feature points in the current frame. This is a dynamic vector detection function.

[0125] In the above embodiments, obtaining the degree of dynamic change of the vector edge includes: calculating the length change of the vector edge using the L2 norm, calculating the angle change of the vector edge using cosine similarity, and obtaining the degree of dynamic change of the vector edge.

[0126] If no dynamic changes occur, the criteria based on the L2 norm should be met;

[0127] If no change in the angle of the vector edge occurs, the judgment condition based on cosine similarity should be met;

[0128] Dynamic vectors are determined based on two criteria. If a vector edge is marked as a dynamic vector, the dynamic weights of the two map points connected by that dynamic vector are increased.

[0129] In the above embodiments, the determination criterion based on the L2 norm is:

[0130]

[0131] The criteria for determining cosine similarity are:

[0132]

[0133] In the formula, ε dmax , ε dmin Let ε be the maximum and minimum values ​​of the interval between the ratios of the vectors' 2-norms. cos This is the cosine similarity threshold.

[0134] The system provided in this embodiment is used to execute the above-described method embodiments. For specific processes and details, please refer to the above embodiments, which will not be repeated here.

[0135] In one embodiment of the present invention, a computing device is provided, which can be a terminal and may include: a processor, a communication interface, memory, a display screen, and an input device. The processor, communication interface, and memory communicate with each other via a communication bus. The processor provides computing and control capabilities. The memory includes a non-volatile storage medium and internal memory. The non-volatile storage medium stores an operating system and a computer program. When the computer program is executed by the processor, it implements a visual SLAM method for dynamic scenes. The internal memory provides an environment for the operation of the operating system and computer program in the non-volatile storage medium. The communication interface is used for wired or wireless communication with external terminals. Wireless communication can be achieved through Wi-Fi, a management network, NFC (Near Field Communication), or other technologies. The display screen can be a liquid crystal display or an e-ink display. The input device can be a touch layer covering the display screen, or buttons, a trackball, or a touchpad mounted on the casing of the computing device, or an external keyboard, touchpad, or mouse. The processor can call logical instructions stored in the memory.

[0136] Furthermore, the logical instructions in the aforementioned memory can be implemented as software functional units and sold or used as independent products, and can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0137] In one embodiment of the present invention, a computer program product is provided, the computer program product including a computer program stored on a non-transitory computer-readable storage medium, the computer program including program instructions, and when the program instructions are executed by a computer, the computer is able to perform the methods provided in the above-described method embodiments.

[0138] In one embodiment of the present invention, a non-transitory computer-readable storage medium is provided, which stores server instructions that cause a computer to perform the methods provided in the above embodiments.

[0139] The computer-readable storage medium provided in the above embodiments has a similar implementation principle and technical effect to the above method embodiments, and will not be described again here.

[0140] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0141] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0142] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0143] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A visual SLAM method for dynamic scenes, characterized in that, include: Based on semantic information, objects in each frame of the acquired image are divided into dynamic objects and static objects, and corresponding image masks are obtained. ORB feature extraction is performed on the acquired images, and feature matching is performed on key points of adjacent two frames based on the extracted feature points. At the same time, motion state detection is performed based on the image mask. The motion attributes of feature points are determined by whether their coordinates fall within the segmented region. The Delaunay triangulation method is then used to make the final determination of the motion attributes of feature points, detect potential dynamic feature point outliers, and obtain absolute static feature points. This includes connecting feature points in the image into triangles and determining the feature points of dynamic objects by comparing the sides of corresponding triangles in two adjacent frames. Feature points belonging to dynamic objects are removed from the connected graph constructed by the triangulation method to remove dynamic regions from the image. The process of determining the feature points of dynamic objects includes: using the feature points of the current frame corresponding to the two map points connected by the vector edge to back-project and generate new vector edges formed by connecting the 3D points to perform dynamic vector detection; based on the dynamic vector detection results, obtaining the dynamic degree of change of the vector edge to detect the map points of the dynamic objects; and using absolute static feature points to perform pose estimation to obtain high-precision positioning results. The dynamic vector detection is implemented by a dynamic vector detection function, which is as follows: ; In the formula, MN Let the vectors be the sides of the triangle. M'N' Vector edges formed by connecting 3D points generated by backprojection of feature points in the current frame. This is a dynamic vector detection function; To obtain the degree of dynamic change of a vector edge, the following methods are used: calculate the length change of the vector edge using the L2 norm and calculate the angle change of the vector edge using cosine similarity, thus obtaining the degree of dynamic change of the vector edge. Based on semantic information, objects in each frame of the acquired image are divided into dynamic objects and static objects, including: The lightweight semantic segmentation network PP-LiteSeg is used for segmentation; In the feature points extracted from the acquired image, dynamic and static objects are marked, and all channels are aligned to form a single channel, thus obtaining a segmentation mask for dynamic and static objects in a scene image. To obtain the degree of dynamic change of vector edges, including: If no dynamic changes occur, the criteria based on the L2 norm should be met; If no change in the angle of the vector edge occurs, the judgment condition based on cosine similarity should be met; Dynamic vectors are determined based on two criteria. If a vector edge is marked as a dynamic vector, the dynamic weights of the two map points connected by that dynamic vector are increased.

2. The visual SLAM method for dynamic scenes as described in claim 1, characterized in that, The criteria for determining L2 norm are: ; The criteria for determining cosine similarity are: ; In the formula, Let be the maximum and minimum values ​​of the interval between the ratios of the vector's 2-norm. This is the cosine similarity threshold.

3. A visual SLAM system for dynamic scenes, used to implement the visual SLAM method for dynamic scenes as described in any one of claims 1 to 2, characterized in that, include: The first processing module divides the objects in each frame of the acquired image into dynamic objects and static objects based on semantic information, and obtains the corresponding image mask. The second processing module performs ORB feature extraction on the acquired image, performs feature matching on key points of adjacent two frames of images based on the extracted feature points, and performs motion state detection based on the image mask. The third processing module determines the motion attributes of feature points based on whether their coordinates fall within the segmented region. It then uses the Delaunay triangulation method to make the final determination of the motion attributes of the feature points, detects potential dynamic feature point outliers, and obtains absolute static feature points. The output module uses absolute static feature points to perform pose estimation and obtain high-precision positioning results.

4. A computer-readable storage medium for storing one or more programs, characterized in that, The one or more programs include instructions that, when executed by a computing device, cause the computing device to perform any of the methods described in claims 1 to 2.

5. A computing device, characterized in that, include: One or more processors, a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing any of the methods described in claims 1 to 2.