A vision slam method based on image matching
By combining ORB features and neural network image matching, dynamic points are eliminated and the RANSAC process is optimized, solving the problems of feature sparsity and matching distortion in visual SLAM in dynamic environments, and achieving high-precision and stable pose estimation and map construction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HEFEI UNIV OF TECH
- Filing Date
- 2026-03-24
- Publication Date
- 2026-06-19
Smart Images

Figure CN122244444A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer vision technology, and more specifically to a visual SLAM method based on image matching. Background Technology
[0002] For dynamic scenes in real-world environments, mainstream dynamic SLAM methods eliminate dynamic points through semantic segmentation or motion consistency detection. Methods such as DynaSLAM and DS-SLAM combine deep learning networks to identify dynamic regions, thereby improving the accuracy of static point matching. However, under conditions of large angles or high-speed motion, these methods often fail in dynamic detection and feature tracking due to severe image distortion and increased parallax, causing keyframe matching drift or sparsity. While deep semantic detection models can identify dynamic objects, they are computationally expensive, and directly embedding them into SLAM systems reduces real-time performance; while geometric methods based on optical flow consistency or fundamental matrix constraints are prone to misclassification in low-texture regions.
[0003] The prior art document CN119863619A discloses a visual SLAM method suitable for dynamic environments. This application discloses a visual SLAM method based on semantic information and suitable for dynamic environments, including: using a deep learning model to perform semantic segmentation of images to identify dynamic regions; performing feature matching based on the confidence of SuperPoint feature points; ensuring the accuracy and robustness of subsequent pose estimation through constant velocity motion model tracking, keyframe reference tracking, 2D and 3D tracking and relocalization based on SuperGlue; and constructing a sparse map of the 3D environment.
[0004] The monocular vision model using semantic segmentation employed in the aforementioned patents struggles to dynamically adjust matching strategies for varying texture densities, resulting in unstable matching even in scenes with large corners. This fails to address the issues of "feature sparsity + matching distortion" in dynamic environments. The method proposed in this case study achieves dynamic enhancement of local features, integrating traditional ORB features with neural network image matching features, balancing real-time performance and robustness. By combining YOLO detection with multi-view geometric reprojection to remove dynamic points, matching distortion is effectively reduced. Summary of the Invention
[0005] To address the problems existing in the background technology, the present invention aims to propose a visual SLAM method based on image matching. By utilizing the prior knowledge of dynamic objects in YOLOv5 and combining it with the image matching results of neural networks, the method enhances the feature recognition and matching capabilities of visual SLAM in dynamic motion scenes, effectively maintains the operating efficiency of neural networks, and ensures the operating accuracy of the dynamic SLAM system while possessing certain real-time performance and versatility.
[0006] A visual SLAM method based on image matching includes the following steps: S1: Preprocess the image, including but not limited to distortion correction, noise suppression, constructing an image pyramid, and performing ORB feature extraction. Then, match the feature points extracted from the reference frame and the current frame using Hamming distance to obtain the matching point set M. ORB ; S2: Run the YOLOv5 neural network to obtain the prior dynamic objects in the image sequence, remove the dynamic object points, and calculate the initial pose changes. S3: Divide the image grid, establish the feature point spatial density distribution map D(x,y) of the reference frame, identify sparse regions and find the corresponding projection matching regions, call JamMa image matching, enhance the feature sparse regions, perform non-maximum suppression and redundancy detection, convert the JamMa image matching pairs into ORB features based on corner points, and merge candidate matches. S4: Use NPnP pose estimation, use the improved RANSAC method to calculate the weighted projection error, remove matching points with large reprojection errors, and re-optimize until convergence or matching is stable. S5: Perform post-processing operations, including local mapping and loop closure detection, to complete the mapping and localization functions. Further, in S1, constructing the image pyramid involves building a multi-level image pyramid for each frame based on the multi-scale principle of pyramids, preserving rich structural information at different scales.
[0007] Furthermore, in S1, the ORB feature extraction of the image includes: extracting feature points and feature descriptors in each layer of the image using the ORB algorithm, and recording the scale, orientation and response value of the features.
[0008] Furthermore, S2 includes the following steps: using a YOLOv5 deep neural network model to perform target detection on the image sequence, obtaining prior region information of dynamic objects in the image, removing feature points falling into the dynamic target region from the feature matching results, and retaining reliable matching points in the static background region to reduce error interference introduced by the motion of dynamic objects, using the retained static feature point pairs, combined with the matching results, and using the PnP or EPnP algorithm to perform initial pose estimation, to obtain the preliminary pose transformation of the current frame relative to the reference frame.
[0009] Furthermore, S3 includes the following steps: S31: Divide the input reference frame image into N x ×N y For each local grid cell, a feature density distribution model D(x,y) is established. The global average feature density P of the reference frame is calculated, and a feature density threshold P is set. min ; S32: Based on the ORB feature matching results, count the number of matching points in each grid. When the local matching density is lower than the set threshold P, min When this occurs, the grid is labeled as a sparse region G. ri , i=1,2,…,k; S33: Based on the initial pose estimation (R0, T0) in S2 and the camera intrinsic parameters K, perform geometric projection on the sparse region to obtain the corresponding aligned region G in the current frame. ti , i=1,2,…,k; S34: Extract image features from the reference frame and the current frame, and call the JamMa neural matching module based on the ORB feature density distribution model D(x,y) to perform feature alignment and matching, obtaining the matching point set M. JM and its confidence matrix; S35: Perform non-maximum suppression on the neural matching results and filter dynamic object regions using the target detection model YOLOv5. On this basis, appropriately relax the suppression conditions for feature-sparse regions to retain more high-confidence static matching points. S36: Perform spatial redundancy detection between the JamMa and ORB matching point sets. For each JamMa matching point, find the nearest neighbor in the ORB space. Determine the consistent matching points of both channels through Euclidean distance constraints to form the matching point set M. RD ; S37: Re-encode the neural network image matching results into a feature format consistent with the ORB descriptor, and fuse them based on confidence weights to form a unified matching point set M. FS Used for subsequent pose optimization and map updates.
[0010] Furthermore, S34 includes the following steps: SS1: Image data is fed into the JamMa network backbone module based on the ConvNeXtV2 structure to extract multi-scale low-resolution feature maps to characterize image structure information at different spatial scales. The ORB feature density distribution model D(x,y) is aligned to obtain the guiding weight d(x,y). SS2: The two-dimensional feature map and its corresponding guiding weights are expanded into a one-dimensional sequence along the four directions of horizontal and vertical (positive and negative) so as to perform global sequence modeling through the state space model; SS3: The feature sequence is input into the multi-layer state space model for feature propagation and residual fusion. The output result is reconstructed into a two-dimensional space by weighted recombination of ORB prior weights d(x,y). Cross-layer feature adaptive enhancement is achieved through gated linear units. SS4: Flatten the enhanced features into a feature vector set, calculate the cosine similarity matrix between the features of the two frames, select the index corresponding to the maximum response to generate matching point pairs, and filter out low-confidence matches based on the confidence threshold.
[0011] Furthermore, the improved RANSAC in S4 includes the following steps: S41: Define a continuous adjustment factor , when the fusion matching point set M RD When the number of matching points N≥600, =1, JamMa network and ORB detection exhibit good commonality and robustness; conversely, =1, JamMa network and ORB detection exhibit good commonality and robustness. <1; S42: Set the number of samples According to the fusion matching point set M RD Calculate the adaptive adjustment factor based on the number of matching points N. Based on this, the sampling strategy is adjusted: in the set of fused matching points, the sampling is performed proportionally. Prioritize sampling the set of matching points detected simultaneously by JamMa and ORB. The remaining samples are from M FS Randomly select from the remaining matching points; based on the sampled matching points, solve the initial camera pose using the NPnP method; S43: Project the 3D map points corresponding to the fused matching points onto the current camera coordinate system, and calculate the reprojection error of each observation point. ; S44: Based on the reprojection error threshold To determine the geometric consistency of matching points, when When this happens, the matching point is considered an interior point; S45: To further integrate JamMa and ORB feature information, a feature source and adjustment factor-based approach is introduced. Fusion confidence weights Where: for matching points detected by both JamMa and ORB, their weights vary with... The weighting increases, while the weighting decreases for matching points detected by a single method. This is achieved by minimizing the weighted projection error function. This enables optimized estimation of camera pose. in, This represents the total error of the weighted projection; Indicates the first The fusion confidence weight of each matching point; Indicates the first The three-dimensional spatial points corresponding to each matching point; Represents rigid body transformation ; Represents the camera projection function; Indicates the first The actual observed pixel coordinates of each matching point in the current image; This represents the Euclidean norm, used to calculate the reprojection error between the projection point and the observation point. S46: Update the camera pose based on the result of minimizing the weighted projection error, and repeat the above process until the pose update amount is less than the set threshold or the maximum number of iterations is reached.
[0012] Furthermore, in S5, the local mapping includes inputting the optimized pose of the current frame and its corresponding map points into the local mapping module, performing triangulation reconstruction on the newly generated feature points, and filtering out unstable points based on the principles of disparity and visibility. The local mapping module uses the local BA method to jointly optimize the current key frame and its co-view key frames to minimize the reprojection error between multiple frames, thereby improving the accuracy and consistency of the local map structure.
[0013] Furthermore, in S5, the loop closure detection includes: performing global similarity retrieval on the keyframe sequence using a bag-of-features model or a deep learning feature encoding method; when a loop closure candidate frame is detected, using geometric consistency verification to determine whether there is a true loop closure relationship between the two frames; if the verification passes, triggering the loop closure correction module to construct a global pose graph; and eliminating the cumulative loop closure error through global graph optimization to achieve consistent alignment and drift correction of the global map.
[0014] The beneficial effects of this invention are as follows: This invention utilizes ORB features extracted from visual SLAM tasks and neural network image matching information. It performs local mesh partitioning and feature density modeling of the image based on ORB feature points, selectively enhancing features in sparse regions. This effectively improves the matching accuracy of dynamic SLAM systems without sacrificing real-time performance. Furthermore, this invention fully leverages the visual SLAM workflow, organically combining features required for localization with features needed for dynamic object removal, optimizing the RANSAC process, and achieving high accuracy in pose estimation and stable map construction.
[0015] This invention solves the matching distortion problem that traditional SLAM is prone to in large corners or low-texture areas by using a fused matching mechanism, which significantly improves the robustness and reliability of visual SLAM systems in dynamic scenes. Attached Figure Description
[0016] Figure 1 This is an overall flowchart of a visual SLAM method based on image matching according to the present invention; Figure 2 The result is the distribution of feature points extracted from KITTI image data using ORB features; Figure 3 The result of feature point distribution matching between KITTI image data and JamMa image; Figure 4The result is the feature point matching of KITTI image data using JamMa image matching. Detailed Implementation
[0017] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.
[0018] A visual SLAM method based on image matching, such as Figure 1 As shown, it includes the following steps: S1: Preprocessing operations are performed on the input image sequence, including distortion correction, brightness normalization, and noise suppression. Based on the multi-scale pyramid principle, a multi-level image pyramid is constructed for each frame to retain rich structural information at different scales. Feature points and feature descriptors are extracted from each layer using the ORB algorithm, and the scale, orientation, and response value of the features are recorded. For the ORB feature sets extracted from the current frame and the reference frame, the similarity between feature descriptors is calculated using Hamming distance to obtain the matching point set M. ORB The distribution of feature points is as follows Figure 2 As shown, this provides a foundation for feature associations in subsequent pose estimation.
[0019] S2: A YOLOv5 deep neural network model is used to perform target detection on the image sequence to obtain prior region information of dynamic objects in the image. By removing feature points falling into the dynamic target region from the feature matching results, reliable matching points in the static background region are retained to reduce error interference introduced by the motion of dynamic objects. Using the retained static feature point pairs, combined with the matching results, the PnP or NPnP algorithm is used to perform initial pose estimation to obtain the preliminary pose transformation of the current frame relative to the reference frame.
[0020] S3: Divide the image into A feature density distribution model D(x,y) is established using regular grid cells, and the number of matching points within each grid cell is calculated. When the number of matching points within a grid cell is less than a preset threshold, the region is marked as a sparse region. Image data and their corresponding feature density distribution models from the reference frame and the current frame are extracted, and the JamMa algorithm is called for high-precision image matching. The JamMa matching results enhance the features of sparse regions, remove duplicate response points through non-maximum suppression, and filter out matching pairs that overlap with or are redundant with the original ORB matching results, retaining non-redundant matching pairs. The JamMa matching results are converted to ORB description format based on corner points, and assigned corresponding weight scores according to the matching confidence. Finally, the ORB and JamMa matching results are weighted and fused to form an enhanced matching candidate set M. FS .
[0021] S4: A multi-view geometric consistency method is employed to further detect and eliminate potential dynamic matching points. Based on the candidate matching results, PnP or NPnP algorithms are used for initial pose estimation, followed by pose refinement based on a weighted reprojection error model. The confidence scores of JamMa and ORB feature matching are used as weighting factors to weight and sum the reprojection errors of each matching point, constructing a weighted least squares optimization objective function. The Levenberg-Marquardt iterative algorithm is used to optimize the pose of the current frame. During iteration, the damping coefficient is adjusted according to error changes, gradually eliminating matching points with large reprojection errors until the reprojection errors converge or the matching set stabilizes, resulting in high-precision and robust camera pose estimation.
[0022] S5: After completing weighted pose optimization, the system enters the post-processing stage, performing local mapping and loop closure detection operations to establish a consistent global map. The specific steps are as follows:
[0023] Local mapping: The optimized pose of the current frame and its corresponding map points are input into the local mapping module. The newly generated feature points are triangulated and reconstructed, and unstable points are filtered out based on disparity and visibility principles. The local mapping module uses the local BA method to jointly optimize the current keyframe and its co-view keyframes, minimizing the reprojection error between multiple frames to improve the accuracy and consistency of the local map structure.
[0024] Loop closure detection: Global similarity retrieval of keyframe sequences is performed using a bag-of-words model or deep learning feature encoding methods. When a loop closure candidate frame is detected, geometric consistency verification is used to determine whether a true loop closure relationship exists between the two frames. If the verification passes, the loop closure correction module is triggered to construct a global pose graph, and global graph optimization is used to eliminate accumulated loop closure errors, achieving consistent alignment and drift correction of the global map.
[0025] Global Map: Records the pose, observable map points, and their confidence distribution for each keyframe. During the localization phase, new input frames undergo fast ORB feature extraction and matching, and relocalization is performed in the global map. PNP-RANSAC-based pose estimation is then used to quickly recover the camera pose. When the system detects that the number of matches with existing keyframes exceeds a threshold, the current frame is upgraded to a new keyframe, triggering a local map update mechanism.
[0026] S3 includes the following steps: S31: Divide the input reference frame image into N x ×N y A feature density distribution network model D(x,y) is established using local grid cells. The global average feature density P of the reference frame is calculated, and a feature density threshold P is set. min .
[0027] S32: Read the matching point set M output by the ORB feature matching module. ORB The system then maps each matching point to the reference frame grid based on its coordinates. The system counts the number of matching points (n) contained in each grid cell. i and with the preset threshold P min The comparison is performed. If the number of matches for a certain grid is less than a threshold, the region is considered sparse and is marked as a sparse region set G. ri , i=1,2,….,k.
[0028] S33: Based on the initial camera pose change (R0, T0) estimated by the system and the camera intrinsic parameter matrix K, the sparse region G of the reference frame is... ri Projecting onto the current frame's image coordinate system, the resulting projection transformation yields the set of aligned regions G for the current frame. ti .
[0029] S34: Extract image data from the reference frame and the current frame, input the image information and the ORB feature density distribution network D(x,y) into the feature matching module of the JamMa model, perform feature extraction, state propagation and similarity calculation processes, and obtain the matching point set M of the region. JM Its confidence matrix mconf, and the distribution of feature points are as follows: Figure 3 As shown.
[0030] S35: To M JM The matching results are subjected to non-maximum suppression, retaining only the matching points with the highest local confidence. The suppression conditions are appropriately relaxed for sparse feature regions and their corresponding regions to increase the number of effective matches. The system calls the dynamic YOLOv5 object detection module to identify movable object regions in the image, using the detected dynamic masks to filter out static matching pairs. The filtered results form a stable neural matching point set M. JM Image matching results are as follows Figure 4 As shown.
[0031] S36: Perform spatial redundancy detection on the ORB matching point set and the JamMa matching point set. The system calculates the Euclidean distance d(p) between adjacent point pairs in the two matching sets. i ,q j When the distance is less than a preset threshold, the matching point pair is considered redundant, and the redundant matching points are formed into a matching point set M. RD .
[0032] S37: The matching point set M output by S37 JMThe data is re-encoded into a data structure consistent with the ORB feature description format, making it recognizable by the SLAM system. The unified matching points contain location, scale, and confidence information, and are weighted and merged with the ORB matching results to form the final fused matching point set M. FS .
[0033] S34 includes the following steps: N1: The reference frame image and the current frame image are input into the backbone module of the JamMa network. This module is based on the ConvNeXtV2 architecture and consists of multiple layers of convolutional, normalization, and nonlinear activation units, capable of extracting deep features with semantic information from the input image. The system performs downsampling operations on the input image, progressively extracting local structure and global texture information. The backbone network outputs two sets of low-resolution feature maps, denoted as feat0 for the corresponding reference frame and feat1 for the corresponding current frame, with a resolution of 1 / 8 of the original image. A linear projection layer is used to perform channel mapping and normalization on these two sets of features to keep their channel number consistent. In this process, the feature depth of the feature density distribution network D(x,y) is spatially aligned with the image data to obtain the guiding weight d(x,y) so that the result can be mapped to the feature map.
[0034] N2: The features feat0 and feat1 obtained in N1 are subjected to directional expansion and serialization processing along with their guiding weights d(x,y). The system performs row-by-row and column-by-column traversal operations on each feature map, sequentially expanding the original two-dimensional features into a one-dimensional feature sequence. The system expands simultaneously in four directions: the horizontal left-to-right scan sequence and its reverse scan sequence, and the vertical top-to-bottom scan sequence and its reverse scan sequence. Subsequently, the four directional sequences are merged to form a unified input sequence set feat0. s .
[0035] N3: The direction sequence output by N2, feature sThe JamMa state-space module MambaBlock is used for multi-layer propagation and fusion. This module consists of 8 state-space units, each containing a normalization layer, a linear mapping layer, and a gated fusion unit, used to perform dynamic feature propagation in the sequence dimension. During operation, the module normalizes the input sequence to stabilize the feature distribution in different directions. Feature information is passed in the time step dimension through the state-space propagation layer to achieve global dependency modeling. The output of each propagation layer is added to the input features through the residual path to form the enhanced output. After multi-layer propagation, the model obtains comprehensive features that fuse information from multiple directions and recombines them into a two-dimensional feature map structure. In the stage after restoring the two-dimensional features, the system uses the synchronously unfolded and aligned density distribution map d(x,y) to perform pixel-by-pixel weighted enhancement of the features, so that sparse feature regions obtain higher responses. The multi-layer output results are weighted and fused through the gated linear unit to obtain the enhanced feature feat. c0 with feat c1 .
[0036] N4: Feature of the reference frame enhanced by N3 c0 With current frame features c1 The feature vectors are expanded into a set, and the similarity matrix S between them is calculated. During runtime, linear projection and normalization operations are first performed on the features of the two frames to ensure scale consistency across different channel dimensions. The cosine similarity of the features between the two frames is calculated using matrix multiplication, resulting in a three-dimensional similarity matrix S, where each element represents the degree of similarity between a point in the reference frame and a point in the current frame. The system applies Softmax normalization to the similarity matrix to obtain a confidence matrix mconf. For each feature point, the position with the highest confidence is selected as the matching result, and low-confidence matches are filtered out based on a threshold thr. The filtered matching pairs are mapped back to the feature plane through indexing to obtain the matching point coordinates mkpts0 and mkpts1, which together constitute the matching point set.
[0037] At this point, feature point matching pairs suitable for pose estimation have been completed. The improved NPnP+RANSAC method in S4 includes the following steps: S41: In the front-end matching and fusion stage, a continuous adjustment factor is introduced. This is used to adaptively adjust the participation of JamMa and ORB matching results in the overall matching constraints. The set of matching points detected simultaneously by JamMa and ORB is... The number of matching points is ,when season At this point, the joint matching points can provide stable and high-confidence geometric constraints, which can dominate the pose estimation process; when season .
[0038] S42: Based on the adjustment factor An adaptive sampling strategy is adopted: given a total number of samples of s, the sampling rate is adjusted proportionally. Prioritize using sets of highly consistent matching points Sampling is performed in [the sampled area] to enhance the geometric reliability of the sampled point set; the rest... A sample from The remaining matching points are randomly selected to ensure the diversity of their distribution in the image space. Based on the fused matching points obtained from the sampling and their corresponding 3D map points, the initial camera pose estimation for the current frame is solved using the NPnP method.
[0039] S43: After obtaining the initial camera pose, fuse the 3D map points corresponding to the matching points. Projected onto the current camera imaging plane, the predicted pixel position is obtained. and the location of the matching point observed in two dimensions. Compare and calculate the reprojection error for each matching point. .
[0040] S44: Set reprojection error threshold A consistency check is performed on all fused matching points. When the reprojection error of a matching point satisfies... If the point is not an interior point, it is considered an interior point and used for subsequent pose optimization; otherwise, it is considered an exterior point and its influence is reduced or it is removed during the optimization process.
[0041] S45: To further integrate the complementary advantages of JamMa and ORB features, a fusion confidence weight is introduced for each matching point during the pose optimization process. Among them, for matching points detected by both JamMa and ORB, their weights vary with the adjustment factor. The weight of a match point is increased to highlight its dominant role in the optimization process; for match points detected by a single method, their weight is reduced accordingly to minimize the interference of low-confidence observations on pose estimation. Finally, fine-grained optimization of the camera pose is achieved by minimizing the following weighted projection error function: .
[0042] S46: Update the current camera pose based on the result of minimizing the weighted projection error, and repeat the reprojection error calculation, interior point selection and weighted optimization process until the pose update amount of two adjacent iterations is less than the preset threshold, or the maximum number of iterations is reached, then terminate the iteration and output the final camera pose estimation result.
[0043] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.
Claims
1. A visual SLAM method based on image matching, characterized in that, Includes the following steps: S1: Preprocess the image, including but not limited to distortion correction, noise suppression, constructing an image pyramid, and performing ORB feature extraction. Then, match the feature points extracted from the reference frame and the current frame using Hamming distance to obtain the matching point set M. ORB ; S2: Run the YOLOv5 neural network to obtain the prior dynamic objects in the image sequence, remove the dynamic object points, and calculate the initial pose changes. S3: Divide the image grid, establish the feature point spatial density distribution map D(x,y) of the reference frame, identify sparse regions and find the corresponding projection matching regions, call JamMa image matching, enhance the feature sparse regions, perform non-maximum suppression and redundancy detection, convert the JamMa image matching pairs into ORB features based on corner points, and merge candidate matches. S4: Use NPnP pose estimation, use the improved RANSAC method to calculate the weighted projection error, remove matching points with large reprojection errors, and re-optimize until convergence or matching is stable. S5: Perform post-processing operations, including local mapping and loop closure detection, to complete the mapping and positioning functions.
2. The visual SLAM method based on image matching according to claim 1, characterized in that, In S1, constructing the image pyramid involves building a multi-level image pyramid for each frame of image based on the principle of multi-scale pyramids, thus preserving rich structural information at different scales.
3. The visual SLAM method based on image matching according to claim 1, characterized in that, In S1, the ORB feature extraction of the image includes: extracting feature points and feature descriptors in each layer of the image using the ORB algorithm, and recording the scale, orientation and response value of the features.
4. The visual SLAM method based on image matching according to claim 1, characterized in that, S2 includes the following steps: using a YOLOv5 deep neural network model to perform target detection on the image sequence, obtaining prior region information of dynamic objects in the image, removing feature points falling into the dynamic target region from the feature matching results, retaining reliable matching points in the static background region to reduce error interference introduced by the motion of dynamic objects, using the retained static feature point pairs, combined with the matching results, and using the PnP or EPnP algorithm to perform initial pose estimation, obtaining the preliminary pose transformation of the current frame relative to the reference frame.
5. The visual SLAM method based on image matching according to claim 1, characterized in that, S3 includes the following steps: S31: Divide the input reference frame image into N x ×N y For each local grid cell, a feature density distribution model D(x,y) is established. The global average feature density P of the reference frame is calculated, and a feature density threshold P is set. min ; S32: Based on the ORB feature matching results, count the number of matching points in each grid. When the local matching density is lower than the set threshold P, min When this occurs, the grid is labeled as a sparse region G. ri , i=1,2,…,k; S33: Based on the initial pose estimation (R0, T0) in S2 and the camera intrinsic parameters K, perform geometric projection on the sparse region to obtain the corresponding aligned region G in the current frame. ti , i=1,2,…,k; S34: Extract image features from the reference frame and the current frame, and call the JamMa neural matching module based on the ORB feature density distribution model D(x,y) to perform feature alignment and matching, obtaining the matching point set M. JM and its confidence matrix; S35: Perform non-maximum suppression on the neural matching results and filter dynamic object regions using the target detection model YOLOv5. On this basis, appropriately relax the suppression conditions for feature-sparse regions to retain more high-confidence static matching points. S36: Perform spatial redundancy detection between the JamMa and ORB matching point sets. For each JamMa matching point, find the nearest neighbor in the ORB space. Determine the consistent matching points of both channels through Euclidean distance constraints to form the matching point set M. RD ; S37: Re-encode the neural network image matching results into a feature format consistent with the ORB descriptor, and fuse them based on confidence weights to form a unified matching point set M. FS Used for subsequent pose optimization and map updates.
6. The visual SLAM method based on image matching according to claim 5, characterized in that, S34 includes the following steps: SS1: Image data is fed into the JamMa network backbone module based on the ConvNeXtV2 structure to extract multi-scale low-resolution feature maps to characterize image structure information at different spatial scales. The ORB feature density distribution model D(x,y) is aligned to obtain the guiding weight d(x,y). SS2: The two-dimensional feature map and its corresponding guiding weights are expanded into a one-dimensional sequence along the four directions of horizontal and vertical (positive and negative) so as to perform global sequence modeling through the state space model; SS3: The feature sequence is input into the multi-layer state space model for feature propagation and residual fusion. The output result is reconstructed into a two-dimensional space by weighted recombination of ORB prior weights d(x,y). Cross-layer feature adaptive enhancement is achieved through gated linear units. SS4: Flatten the enhanced features into a feature vector set, calculate the cosine similarity matrix between the features of the two frames, select the index corresponding to the maximum response to generate matching point pairs, and filter out low-confidence matches based on the confidence threshold.
7. The visual SLAM method based on image matching according to claim 5, characterized in that, The improved RANSAC in S4 includes the following steps: S41: Define a continuous adjustment factor , when the fusion matching point set M RD When the number of matching points N≥600, =1, JamMa network and ORB detection exhibit good commonality and robustness; conversely, =1, JamMa network and ORB detection exhibit good commonality and robustness. <1; S42: Set the number of samples According to the fusion matching point set M RD Calculate the adaptive adjustment factor based on the number of matching points N. Based on this, the sampling strategy is adjusted: in the set of fused matching points, the sampling is performed proportionally. Prioritize sampling the set of matching points detected simultaneously by JamMa and ORB. The remaining samples are from M FS Randomly select from the remaining matching points; based on the sampled matching points, solve the initial camera pose using the NPnP method; S43: Project the 3D map points corresponding to the fused matching points onto the current camera coordinate system, and calculate the reprojection error of each observation point. ; S44: Based on the reprojection error threshold To determine the geometric consistency of matching points, when When this happens, the matching point is considered an interior point; S45: To further integrate JamMa and ORB feature information, a feature source and adjustment factor-based approach is introduced. Fusion confidence weights Where: for matching points detected by both JamMa and ORB, their weights vary with... Increase the weighting and improve the performance; for matching points detected by a single method, the weighting is reduced accordingly by minimizing the weighted projection error function. This enables optimized estimation of camera pose. in, This represents the total error of the weighted projection; Indicates the first The fusion confidence weight of each matching point; Indicates the first The three-dimensional spatial points corresponding to each matching point; Represents rigid body transformation ; Represents the camera projection function; Indicates the first The actual observed pixel coordinates of each matching point in the current image; This represents the Euclidean norm, used to calculate the reprojection error between the projection point and the observation point. S46: Update the camera pose based on the result of minimizing the weighted projection error, and repeat the above process until the pose update amount is less than the set threshold or the maximum number of iterations is reached.
8. The visual SLAM method based on image matching according to claim 1, characterized in that, In S5, the local mapping includes inputting the optimized pose of the current frame and its corresponding map points into the local mapping module, performing triangulation reconstruction on the newly generated feature points, and filtering out unstable points based on the principles of disparity and visibility. The local mapping module uses the local BA method to jointly optimize the current key frame and its co-view key frames to minimize the reprojection error between multiple frames, thereby improving the accuracy and consistency of the local map structure.
9. The visual SLAM method based on image matching according to claim 1, characterized in that, In S5, the loop closure detection includes: performing global similarity retrieval on the key frame sequence using a bag-of-features model or a deep learning feature encoding method; when a loop closure candidate frame is detected, using geometric consistency verification to determine whether there is a real loop closure relationship between the two frames; if the verification passes, triggering the loop closure correction module to construct a global pose graph; and eliminating the cumulative error of loop closure through global graph optimization to achieve consistent alignment and drift correction of the global map.