A SLAM method, apparatus, and medium based on ORB description operator enhancement

By enhancing the ORB descriptor with a lightweight neural network and combining global geometric and representational information, the problems of excessive computational requirements and insufficient feature matching accuracy in SLAM systems are solved, achieving more efficient and reliable feature matching and environmental awareness.

CN118411532BActive Publication Date: 2026-06-30TONGJI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TONGJI UNIV
Filing Date
2024-04-22
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In existing SLAM systems, deep learning-based descriptor replacement may lead to excessive computational demands and insufficient hardware support. Furthermore, traditional feature matching methods ignore the geometric and hue information of key points in the image, resulting in insufficient matching accuracy and robustness.

Method used

A lightweight neural network model is used to enhance the ORB descriptor. By combining global geometric information and representation information through feature self-enhancement and mutual enhancement, the performance of the feature descriptor is improved, and the matching accuracy is optimized by using a composite loss function.

Benefits of technology

It improves the matching accuracy and robustness of the SLAM system, reduces the computational burden, enables efficient operation in practical applications, and enhances the system's reliability and robustness.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118411532B_ABST
    Figure CN118411532B_ABST
Patent Text Reader

Abstract

This invention relates to a SLAM method, apparatus, and medium based on ORB descriptor operator enhancement. The method includes the following steps: acquiring adjacent image sequences; performing homography transformation on the images and converting them to the HLS color space; extracting image feature points and their corresponding feature descriptors using the ORB feature extraction algorithm in 3D reconstruction; performing self-enhancement and mutual enhancement processing on the extracted feature descriptors using a descriptor operator enhancement network; matching feature points in adjacent frames using the enhanced feature descriptors; and performing a SLAM task based on the matched feature point pairs to achieve environmental perception and localization. Compared with existing technologies, this invention has advantages such as low computational power requirement and good SLAM tracking stability.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision technology, and in particular to a SLAM method, apparatus and medium based on ORB description operator enhancement. Background Technology

[0002] Extracting sparse keypoints or local features from images is an integral part of computer vision tasks. These tasks include Structure-for-Motion (SfM), Simultaneous Localization and Mapping (SLAM), and visual localization. Keypoint matching plays a crucial role across various images, and feature descriptors are used to represent these matches. Feature descriptors can be real-valued descriptors or binary descriptors.

[0003] Early descriptor operators were often handcrafted, but with the development of machine learning and deep learning, learning-based descriptor operators have become increasingly popular. For example, SuperPoint is a trained descriptor operator that has demonstrated excellent performance in various computer vision tasks. Compared to handcrafted descriptor operators, deep learning-based descriptor operators are better suited to keypoint matching on different images and achieve better results in challenging scenarios. However, since descriptor operators are already integrated into real-world systems, replacing them with entirely new ones can present several challenges. First, new descriptor operators may require significantly more computational power, potentially exceeding the capabilities of existing hardware. Second, the change in descriptor type may necessitate substantial modifications to the framework code to accommodate the new descriptor type. Furthermore, traditional feature matching methods typically consider only the local geometric attributes and visual descriptor operators of keypoints, neglecting information about other keypoints in the image, as well as the geometric and hue information of the keypoints themselves. Therefore, the matching process still generates matching relationships that ignore geometric and hue information. Summary of the Invention

[0004] The purpose of this invention is to provide a SLAM method, device, and medium based on ORB descriptor enhancement. It proposes a lightweight neural network model to improve the performance of ORB descriptor, minimize computational overhead, and make full use of existing descriptor operators to improve matching accuracy and robustness.

[0005] The objective of this invention can be achieved through the following technical solutions:

[0006] A SLAM method based on ORB description operator enhancement includes the following steps:

[0007] S1, acquire the adjacent image sequence, perform homography transformation on the image and convert it to the HLS color space;

[0008] S2, using the ORB feature extraction algorithm in 3D reconstruction to extract image feature points and their corresponding feature descriptors;

[0009] S3, the extracted feature descriptor operators are subjected to self-enhancement and mutual enhancement processing using a descriptor enhancement network;

[0010] S4 utilizes the enhanced feature description operator to perform feature point matching between adjacent frames, and performs SLAM tasks based on the matched feature point pairs to achieve environmental perception and localization.

[0011] Step S1 includes the following steps:

[0012] S11, the points of the image on the first viewing plane are projected onto the second viewing plane using the homography transformation matrix H, wherein the homography transformation matrix H is a 3x3 homogeneous matrix;

[0013] S12 converts the image after homography transformation from RGB space to HLS space:

[0014] Let the RGB color of the image be represented as (L R ,L G ,L B ), let L max For L R L G and L B The maximum value in, L min For L R L G and L B The minimum value in L R L G L B The value range of is [0,1];

[0015] 1) Calculate the brightness L:

[0016] L=(l max +l min ) / 2

[0017] When l max =l min At that time, l max =l R =l G =L B =L min This indicates that the color is gray, at which point S = 0, and H does not represent any color;

[0018] 2) Calculate the saturation S:

[0019] If the brightness l ≤ 0.5, then S = (l max -L min ) / (lmax +l min );

[0020] When the brightness L > 0.5, then S = (L max -L min ) / (2-L max -L min );

[0021] 3) Calculate hue H:

[0022] When L max =L R At that time, H = 60 × (L) G -l B ) / (L max +L min Its color is between yellow and magenta;

[0023] When L max =L G At that time, H = 120 + 60 × (L) G -L B ) / (L max +L min Its color is between cyan and yellow;

[0024] When L max =l B At that time, H = 240 + 60 × (L) G -L B ) / (L max +L min Its color is between magenta and cyan;

[0025] If the calculated result of hue H is negative, then add 360 to the original calculated result to obtain the final hue.

[0026] Step S2 includes the following steps:

[0027] S21, Feature Extraction: Extracting feature points and corresponding feature descriptors from an image based on a feature extraction algorithm;

[0028] S22, Image Pair Matching: Based on exhaustive matching, sequence matching, spatial matching and transitive matching, the correspondence between images is determined from different perspectives. Based on the matching results, image pairs are established. Each image pair includes a reference image and a target image.

[0029] S23, Match feature points in the matched image pairs: Based on the KD tree nearest neighbor search algorithm, calculate the distance or similarity between the descriptor vectors of feature points to perform feature point matching;

[0030] S24, based on geometric verification, filter out feature points that are mismatched;

[0031] S25, perform sparse reconstruction and incremental reconstruction of the image, estimate the camera pose and reconstruct a sparse representation of the scene;

[0032] S26. Based on the sparse representation of the camera pose and scene, generate depth maps and normal maps, perform dense reconstruction, and obtain depth images for each image.

[0033] S27. Save the camera pose of each image as a trajectory ground truth value according to the timestamp, and use it as a reference pseudo-ground truth value for each training of the augmentation descriptor.

[0034] The feature descriptor is used to characterize the feature information of an image. The feature information of the image includes independent information and set information of feature points. The independent information includes global geometric information and feature point representation information. The set information includes the relative positional relationship of the feature point set. The descriptor enhancement network includes a feature self-enhancement network based on representation information and a feature mutual enhancement network based on feature point set.

[0035] The feature self-enhancing network, based on the MLP model, fuses geometric information, feature information, and representation information into the extracted feature descriptor, thereby enhancing the descriptor:

[0036]

[0037]

[0038] Where, d i This refers to the feature descriptor operator extracted in step S2, MLP. desc MLP models that represent feature information. This represents the enhanced description operator, MLP. geo MLP models representing geometric information, p i =(x i ,y i ,c i ,θ i ) represents all available geometric information for keypoint i, MLP hl The MLP model representing the information, q i =(h i ,l i ,s i ) represents all representational information of key point i, using MLP geo and MLP hl (p i It maps the global geometric and representational information of the descriptor operator into a high-dimensional space.

[0039] The feature mutual enhancement network captures spatial contextual cues of sparse local features extracted from the same image based on the Transformer model, and is represented as follows:

[0040]

[0041] Here, Trans represents the Transformer operation. This represents the enhanced description operator;

[0042] The Transformer model employs an AFT network, whose input consists of N local features within the same image, and whose output is a feature descriptor that enhances each other in the feature space.

[0043] The loss function of the described operator-enhanced network is expressed as:

[0044]

[0045] in, This represents the feature-maximizing average precision loss function. Let represent the description operator lifting loss function, and λ be the weights that adjust the description operator lifting loss function;

[0046]

[0047]

[0048] in, Denotes the enhanced descriptor, d i This represents the extracted original feature descriptor, AP represents the average precision, and N represents the number of descriptor operators.

[0049] The average accuracy of each descriptor in the loss function of the descriptor-enhanced network is calculated based on the differentiable method FastAP:

[0050] Given the transformed features d in the first image tr and the feature set in the second image FastAP computes pairwise distance vector Z∈R by matching pairs of basic true labels M={M+,M-}. N The range Ω is then quantized into a finite set of b elements Ω = {z1, z2, ... z} using distance quantization. b}, reformulate precision and recall as functions of Euclidean distance z:

[0051]

[0052] Among them, P(M +∣Z<z) represents the positive match M conditioned on Z<z + 's prior distribution, P(Z<z|M + ) represents the cumulative distribution function of Z;

[0053] Use the area under the precision-recall curve to represent the average precision:

[0054]

[0055] Among them, the matching point M ground truth label is obtained by using the ground truth pose and depth map.

[0056] A SLAM device enhanced based on the ORB descriptor includes a memory, a processor, and a program stored in the memory. When the processor executes the program, the method described above is implemented.

[0057] A storage medium stores a program that implements the method described above when executed.

[0058] Compared with the prior art, the present invention has the following beneficial effects:

[0059] (1) The present invention considers the global geometric information of feature points, their own representation information, and the geometric interaction information with other feature points, improving the reliability and robustness of the system.

[0060] (2) The present invention constructs a composite loss function, making the average precision of the descriptor higher and the evaluation system more complete.

[0061] (3) The present invention introduces a feature enhancement network and constructs a lightweight feature enhancement network. The system can make full use of the global context information in the image, and at the same time it does not introduce too much computational burden and can run efficiently in practical applications. BRIEF DESCRIPTION OF THE DRAWINGS

[0062] Figure 1 is the flowchart of the method of the present invention;

[0063] Figure 2 is the schematic diagram before and after the homography transformation. Among them, (2a) represents the original RGB image, and (2b) represents the image after the homography transformation;

[0064] Figure 3 is the schematic diagram of the image comparison before and after the homography transformation and the HLS color space conversion;

[0065] Figure 4 is the schematic diagram of the structure of the feature self-enhancement network. Among them, (4a) is the overall structure, and (4b) is the encoder structure;

[0066] Figure 5The diagrams illustrate different attention calculation methods, where (5a) represents the dot product attention calculation method and (5b) represents the approximate dot product attention calculation method.

[0067] Figure 6 The diagram shows the structure of the feature mutual enhancement network, where (6a) is the overall structure and (6b) is the specific structure of Attention aggregation.

[0068] Figure 7 A schematic diagram of the overall structure of the SLAM system with description operator-enhanced network for the application;

[0069] Figure 8 A comparison chart of the matching effects of ORB, SIFT, and the method of the present invention under small angle changes;

[0070] Figure 9 A comparison chart showing the matching effects of ORB, SIFT, and the method of this invention under large angle changes;

[0071] Figure 10 This is a comparison chart showing the backend mapping effects of ORB and the method of this invention. Detailed Implementation

[0072] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments. These embodiments are based on the technical solution of the present invention and provide detailed implementation methods and specific operating procedures. However, the scope of protection of the present invention is not limited to the following embodiments.

[0073] This embodiment provides a SLAM method based on ORB description operator enhancement, such as... Figure 1 As shown, it includes the following steps:

[0074] S1: Obtain the sequence of adjacent images, perform homography transformation on the images, and convert them to the HLS color space.

[0075] Specifically, step S1 includes the following steps:

[0076] S11, using the homography transformation matrix H, the points of the image on the first viewing plane are projected onto the second viewing plane.

[0077] Homography refers to the process of projecting a point on one view plane onto another view plane using a 3D homogeneous matrix H. This homogeneous matrix H is also called the homography transformation matrix. Taking a two-dimensional planar image as an example, if a point p1(x1,y1,1) in the image is projected onto a point p2(x2,y2,1) in a new image through homography, the projection relationship can be expressed as:

[0078] p2=Hp1 (1)

[0079] Represented in matrix form as follows:

[0080]

[0081] By rearranging the third term in equation (2) and incorporating it into the first two terms, and then performing the rearrangement operation, we obtain:

[0082]

[0083] Equation (3) can be written as the product of a matrix and a vector, i.e.

[0084]

[0085] Where, h = [h 11 ,h 12 ,h 13 ,h 21 ,h 22 ,h 23 ,h 31 ,h 32 ,h 33 ] T is a 9-dimensional column vector. If we let

[0086]

[0087] Then equation (5) can be written as

[0088] Ah = 0 (6)

[0089] Here A∈R 2×9 This is simply a matrix A calculated from a pair of points. Since homogeneous coordinates are used to represent points on the plane, there exists a non-zero scalar s such that b1 = sHa. T with b=sHa T Both represent the same point b. If let but for:

[0090]

[0091] As can be seen from equation (7), only 8 variables are needed to determine the homography transformation matrix H. Therefore, at least 4 point pairs need to be determined to obtain the homography transformation matrix. Furthermore, the pixels in the original image can be projected onto the new coordinates using equation (7). Similarly, the feature points in the image should also satisfy this projection relationship. Therefore, the correct matching point pairs can be selected through the projection error. The image before and after homography transformation is as follows: Figure 2 As shown.

[0092] S12 converts the image after homography transformation from RGB space to HLS space.

[0093] The HLS color space (Hue, Lightness, Saturation / Chroma) is a way of describing color. Unlike the common RGB and HSV color models, it decomposes color into three main attributes: hue, lightness, and saturation.

[0094] Hue indicates the position of a color in the spectrum; it describes the basic characteristics of a color, such as red, orange, yellow, green, cyan, blue, and violet. Hue values ​​typically range from 0 to 360 degrees. Brightness describes the lightness or darkness of a color, with values ​​ranging from 0 to 100, where 0 represents black and 100 represents white. Changes in brightness are usually achieved by adjusting the lightness or darkness of the color. Saturation indicates the vividness or purity of a color, with values ​​typically ranging from 0 to 100, where 0 represents gray and 100 represents the most vivid or pure color.

[0095] Specifically, the value of a pixel in the HLS color space can be obtained by transforming the coordinates of the corresponding pixel in the RGB color space. Let the RGB color representation of an image be (L... R ,L G ,L B ), let L max For L R L G and L B The maximum value in, L min For L R L G and L B The minimum value in L R L G L B The value range of is [0,1]; (L R ,L G ,L B In the HLS color space, H represents hue, L represents lightness, and S represents saturation.

[0096] 1) Calculate the brightness L:

[0097]

[0098] When L max =L min At that time, L max =L R =L G =L B =L min This indicates that the color is gray, at which point S = 0, and H does not represent any color.

[0099] 2) Calculate the saturation S:

[0100] If the brightness L ≤ 0.5, then S = (Lmax -L min ) / (L max +L min );

[0101] When the brightness l > 0.5, then S = (L max -l min ) / (2-L max -l min );

[0102] 3) Calculate hue H:

[0103] When L max =L R At that time, H = 60 × (L) G -l B ) / (l max +l min Its color is between yellow and magenta;

[0104] When l max =l G At that time, h = 120 + 60 × (l G -l B ) / (L max +L min Its color is between cyan and yellow;

[0105] When L max =L B At that time, H = 240 + 60 × (L) G -L B ) / (L max +L min Its color is between magenta and cyan.

[0106] Based on the periodicity of H, if the calculated result of hue H is negative, then 360 is added to the original calculated result to obtain the final hue.

[0107] Based on the above conversion principle, an image can be converted from the RGB color space to the HLS color space. An image processed using homography and the HLS color space looks like this. Figure 3 As shown.

[0108] S2 uses the ORB feature extraction algorithm in 3D reconstruction to extract image feature points and their corresponding feature descriptors.

[0109] This embodiment uses COLMAP for three-dimensional reconstruction.

[0110] Specifically, step S2 includes the following steps:

[0111] S21, Feature Extraction: Extract feature points and corresponding feature descriptors from the image based on feature extraction algorithms.

[0112] S22, Image Pair Matching: Based on exhaustive matching, sequence matching, spatial matching and transitive matching, the correspondence between images is determined from different perspectives. Based on the matching results, image pairs are established, and each image pair includes a reference image and a target image.

[0113] S23, Match feature points in the matched image pairs: Based on the KD-tree nearest neighbor search algorithm, calculate the distance or similarity between the descriptor vectors of feature points to perform feature point matching.

[0114] S24, based on geometric verification, filters out mismatched feature points.

[0115] In this embodiment, geometric verification refers to constraining three-dimensional points using epipolar geometry. The geometric constraint method involves randomly selecting 8 pairs of matching points from an image pair, solving the fundamental matrix using the normalized eight-point algorithm, then counting the number of point pairs that satisfy the epipolar geometric constraints, repeating the above steps within a set number of times, and selecting the matching with the largest number of point pairs that satisfy the conditions as the refined matching result.

[0116] S25 performs sparse and incremental image reconstruction, estimates camera pose, and reconstructs a sparse representation of the scene.

[0117] After completing the feature matching and geometric verification described above, preliminary reconstruction (i.e., image sparse reconstruction) can begin. First, reconstruction initialization needs to be completed. In initialization, two matched images are selected, and the pose of one image is treated as an identity matrix. Then, the eigenvalue matrix E is calculated using the point pairs between the two images; E is then considered as the pose of the other image. After obtaining the poses of the two images, the matched point pairs can be converted into a 3D point cloud using triangulation.

[0118] After the initial sparse reconstruction is completed, incremental reconstruction is performed. Incremental reconstruction includes the following steps:

[0119] 1) Obtain the best matching image with the most matching point pairs next.

[0120] 2) Estimate the intrinsic matrix using the matched feature points, and then estimate the pose of the image.

[0121] 3) Triangulation is performed to generate three-dimensional spatial points. Triangulation is to use the coordinates and pose of two matched images and the intrinsic parameter matrix of the camera to find the three-dimensional coordinates of the matched points.

[0122] 4) Perform BA optimization using the Ceres library for all generated 3D points and estimated poses. Points with errors exceeding a threshold are removed by minimizing the reprojection error.

[0123] 5) Finally, perform BA optimization on all data, i.e., global optimization.

[0124] S26. Based on the sparse representation of the camera pose and scene, a depth map and a normal map are generated, and dense reconstruction is performed to obtain a depth image for each image.

[0125] After reconstructing the sparse representation of the scene and the camera pose of the input image, a denser geometric scene can be recovered. COLMAP has an integrated dense reconstruction pipeline that generates depth and normal maps for all registered images and fuses the depth and normal maps to merge a sparse point cloud into a dense point cloud. Finally, a Poisson or Delaunay triangulation reconstruction method is used to estimate a dense surface from the fused point cloud. The specific dense reconstruction consists of the following four parts:

[0126] 1) Restore the image by using camera intrinsic parameters to remove image distortion.

[0127] 2) Calculate the depth map and normal map of the registered image.

[0128] 3) Integrate depth and normal maps into the point cloud.

[0129] 4) Generate a dense point cloud and output it.

[0130] S27. Save the camera pose of each image as a trajectory ground truth value according to the timestamp, and use it as a reference pseudo-ground truth value for each training of the augmentation descriptor.

[0131] The above operations yield the camera pose and depth image for each image. Then, the camera pose for each input image is saved as a trajectory ground truth value according to the timestamp, serving as a reference pseudo-ground truth value for each training iteration of the augmentation descriptor.

[0132] S3 utilizes a descriptor enhancement network to perform self-enhancement and mutual enhancement processing on the extracted feature descriptors.

[0133] This embodiment proposes a lightweight neural network to improve the recognizability of ORB feature points. The network only requires the feature point as input, the corresponding pixel hue and saturation values ​​in the HSL color space, and the global geometric information of the feature point. After inputting this information into the descriptor enhancement network, the network can generate new descriptor operators end-to-end. These new descriptor operators have stronger recognizability and matching robustness than the original descriptor operators. Furthermore, this neural network does not need to process image information, only the ORB information extracted by SLAM, making the network more lightweight and efficient. In addition, this network can be easily embedded and integrated into monocular SLAM systems.

[0134] Feature descriptors are used to characterize the feature information of an image. This feature information includes independent and set information of feature points. The independent information includes global geometric information and feature point representation information, while the set information includes the relative positional relationships of the feature point set. In this embodiment, the descriptor enhancement network includes a feature self-enhancement network based on representation information and a feature mutual enhancement network based on feature point sets.

[0135] S31, Feature Self-Enhancement

[0136] In the sequence frames, each detected keypoint can be represented by the visual descriptor d. i The descriptors are described, where d is a D-dimensional binary vector. These feature descriptors are obtained to compute their similarity across different images, ultimately determining the relationships between these images. Therefore, robust descriptors should remain robust to changes in viewpoint and lighting conditions.

[0137] This embodiment uses the SIFT algorithm as an example. The traditional similarity measure is Euclidean distance, but this measure does not always accurately capture the similarity between images. To solve this problem, Hellinger distance is used to measure the similarity of SIFT descriptors to obtain better matching performance.

[0138] Using Helling distance as a similarity metric essentially projects the original descriptors from one space to another. To achieve this projection, a Multi-Layer Perceptron (MLP), with its strong non-linear mapping and learning capabilities, can be used to map the descriptors. By training the MLP model layer by layer, the original features can be transformed into enhanced features, with the expectation of better measuring similarity and matching performance in the new feature space.

[0139] According to Cybenko's theorem, the MLP is a generalized approximator. Therefore, the MLP can be used to approximate the item function, i.e., the MLP... desc For keypoint i, the transformed descriptor It is the extracted descriptor d i Nonlinear projection:

[0140]

[0141] Given that the network training phase is guided by a loss function with Euclidean and Hamming distance constraints, this MLP-based model allows the transformed descriptor to adapt well to similarity measurement tasks in both Euclidean and Hamming spaces, especially for handcrafted descriptors. However, this projection does not fully utilize the geometric and representational information of keypoints. Therefore, this embodiment introduces both types of information, namely MLP. geo and MLP hl This involves embedding geometric and representational information into high-dimensional vectors to further improve the performance of the descriptive operator.

[0142]

[0143] Where, d i This refers to the feature descriptor operator extracted in step S2, MLP. desc MLP models that represent feature information. This represents the enhanced description operator, MLP. geo MLP models representing geometric information, p i =(x i ,y i ,c i ,θ i ) represents all available geometric information for keypoint i, MLP hl The MLP model representing the information, q i =(h i ,l i ,s i ) represents all representational information of key point i, using MLP geo and MLP hl (P i It maps the global geometric and representational information of the descriptor operator into a high-dimensional space.

[0144] like Figure 4 As shown, the feature self-enhancing network includes a geometric encoder, a feature encoder, a representation encoder, a feature fusion layer, a Dropout layer, and a linear spatial projection layer. The geometric encoder, feature encoder, and representation encoder are used to encode geometric information, feature information, and representation information of different dimensions, respectively, with each having an output dimension of 128. The feature decoder has two layers and is used to fuse the feature information, reducing the input dimension to 128. The geometric decoder has four layers; the first three layers expand the dimension of the geometric information from 4 to 128, and the last layer fuses this information. The representation decoder has two layers and is used to expand the dimension of the representation information from 6 to 128. The feature fusion layer fuses the outputs of the three encoders and then sequentially inputs them into the Dropout layer and the linear spatial projection layer.

[0145] Feature self-enhancement algorithms independently enhance the descriptor of each feature point without considering the correlation between feature points. However, this method does not utilize the spatial relationships between different feature points, while spatial contextual cues can significantly improve matching ability. Therefore, the enhanced features obtained from the feature self-enhancement stage still perform poorly in some challenging environments (such as repetitive patterns or weakly textured scenes). To address this issue, these descriptor operators are further subjected to mutual enhancement processing.

[0146] S32, Feature Mutual Enhancement

[0147] In this embodiment, the feature enhancement algorithm uses a Transformer to capture spatial contextual cues of sparse local features extracted from the same image. These features are represented by projection as follows:

[0148]

[0149] Here, Trans represents the Transformer operation. This represents the enhanced descriptor.

[0150] The Transformer model employs an AFT network, taking N local features within the same image as input and outputting a feature descriptor that enhances each other in the feature space. Compared to MLP-based projection, the Transformer-based projection method can process all feature points within the same image simultaneously.

[0151] By introducing an attention mechanism, Transformer enables global context aggregation of local features in the input. This allows for better capture of the relationships between features and spatial information from local to global perspectives. By aggregating information from all features, it provides richer feature representations, enabling d i The recognizability and distinguishability are improved. Especially when extracting features from repetitive environments, traditional methods are easily affected by repetitive patterns, while the Transformer can adaptively adjust the weights of features, enabling better differentiation of repetitive local feature points. This approach not only improves the accuracy of local features extracted from repetitive environments but also makes the entire feature descriptor more robust and discriminative.

[0152] The Transformer layer consists of two sub-layers: an attention layer and a positionally fully connected feedforward network. (VanillaTransformer)

[51] Multi-Head Attention (MHA) is used. For a given input X, MHA can weighted aggregate the feature vectors of keypoints. MHA consists of multiple attention heads, each with its own parameter matrix for feature transformation and similarity calculation. This increases the model's ability to learn from different points of interest.

[0153] Specifically, for input X∈R N×D In the h-th attention head, feature transformation is first performed through matrix multiplication and a nonlinear activation function to obtain the feature representation after linear transformation:

[0154]

[0155] in Let be the Q-weight matrix of the h-th attention head, where Let K be the weight matrix, where Let V be the weight matrix. Then the h-th head attention of X is defined as:

[0156]

[0157] in, D is a linear projection of the head h. K is the scaling factor for the feature dimension. Then, the original feature X is weighted and summed using the attention scores to obtain the output of the h-th attention head. Next, the output features of all attention heads are concatenated:

[0158] MultiHead(Q,K,V)=Concact(head1,…,head h W O (14)

[0159] Among them W O The weight matrix represents the different attention heads. Finally, a fully connected feedforward network is used to implement position-based fully connected feedforward propagation, and the final output of the encoder layer is obtained.

[0160] However, the biggest problem with using Transformer is the high memory and computational cost. Since MHA involves large-scale matrix multiplication and similarity calculations on the feature matrix, it requires significant storage space and computational resources. The dot product attention computation graph is shown in Figure (5a).

[0161] The output of MHA is the concatenation of the outputs of all attention heads along the channel dimension. MHA uses an attention matrix to implement Q. h and K h The global interaction between them, and the computation of the attention matrix depends on Q. h and Kh The matrix dot product between them results in a time and space complexity of O(N). 2 D). Therefore, the complexity introduced by MHA makes it difficult for the Vanilla Transformer to scale to inputs with a large context size of N. Here, N is the number of local features in the image in the SLAM task, so MHA is not suitable for the feature enhancement task proposed in this paper.

[0162] To address the aforementioned issues, this embodiment employs the more efficient Attention-Free Transformer (AFT) to replace the MHA operation in the Vanilla Transformer. The specific network flowchart of AFT is shown in Figure (5b). Unlike MHA or linearized attention, the AFT algorithm is a neural network training algorithm that approximates dot-product attention. In traditional attention mechanisms, dot-product attention is typically used to calculate attention weights, and then the input feature vector is weighted and summed based on these attention weights. Specifically, AFT rearranges the calculation order of Q, K, and V, similar to linear attention, but instead of using matrix multiplication, it multiplies K and V. The AFT calculation for keypoint i can be expressed as:

[0163]

[0164] Where σ(·) is the Sigmoid function; Q i Represents the i-th row of Q; K i V j Let K represent the j-th row and V represent the j-th row. AFT is an improved version of MHA, where the number of attention heads equals the feature dimension D of the model, and the similarity kernel function used in MHA is also used.

[0165] sim(Q,K)=σ(Q)·softmax(K) (16)

[0166] In this way, attention can be computed using element-wise multiplication instead of matrix multiplication, which reduces the time and space complexity to O(ND). This significantly reduces the time and space complexity compared to traditional MHA.

[0167] Figure 6 The overall structure and details of the feature mutual enhancement network are shown. The mutual enhancement network adds an attention aggregation layer to the feature self-enhancement network. The details of the attention aggregation layer are shown in Figure (6b). The attention aggregation layer consists of multiple attention units chained together, each containing AFT, MLP, Dropout, residual blocks, and a linear normalization layer. By stacking attention layers, the recognizability and uniqueness of features can be strongly fitted.

[0168] S33, Training a description operator augmentation network using a loss function.

[0169] Similar to the training method of Superglue, this embodiment adds an optimal transport layer to the last layer of the network, treating the feature point matching problem as an optimal transport problem. This section treats the descriptor matching problem as nearest neighbor retrieval and uses average precision (AP) to train the feature descriptors.

[0170] Consider the local features after transformation This embodiment aims to maximize the AP of all features, so one of the network training objectives is to minimize the cost function:

[0171]

[0172] in, Denotes the enhanced descriptor, d i This represents the extracted original feature descriptor, AP represents the average precision, and N represents the number of descriptor operators.

[0173] To ensure that the improved features outperform the original features, this embodiment also uses another loss function, namely the description operator enhancement loss function, which together with equation (17) forms a composite loss function, and the performance of the features after forced transformation is better than that of the original features.

[0174]

[0175] The final composite loss function consists of two loss functions, as follows:

[0176]

[0177] Where λ is the weight that adjusts the second loss function.

[0178] To ensure the network can be iterated backward, this section uses the differentiable method (FastAP) to calculate the average precision of each descriptor.

[0179] Given the transformed features d in the first image tr and the feature set in the second image FastAP can compute the pairwise distance vector Z∈R using the underlying true labels of the matching pairs M={M+,M-}. N The sum and range Ω. Through distance quantization, Ω can be quantized as a finite set of b elements Ω = {z1, z2, ... z}. b Then precision and recall can be reformulated as functions of Euclidean distance z.

[0180]

[0181] Among them, P(M + |Z < z) represents the prior distribution of the positive match M conditional on Z < z + , and P(Z < z|M + ) represents the cumulative distribution function (CDF) of Z. Finally, AP can be represented by the area under the precision-recall curve , as shown in Equation (21). Among them, the ground truth label of the matching point M can be obtained using the ground truth pose and depth map, that is, the reference pseudo ground truth obtained in step S27

[0182]

[0183] S4. Use the enhanced feature description operator to perform feature point matching between adjacent frames, and perform the SLAM task based on the obtained feature point pairs to achieve environmental perception and positioning

[0184] Apply the above-described operator enhancement network to the ORB-SLAM2 system, and the specific framework is as Figure 7 shown

[0185] First, the input of the ORB-SLAM2 system is a series of adjacent image sequences. Then, representative feature points and their corresponding feature information are extracted from each image through the ORB feature extraction algorithm. Next, the extracted feature information is subjected to self-enhancement and mutual enhancement processing using the feature enhancement module to improve the description quality and discrimination of the features. The enhanced features are used for matching between adjacent frames

[0186] In the feature matching stage, the enhanced feature points in different images are matched to find their corresponding relationships. Using the enhanced features for matching can improve the matching accuracy and robustness of the front-end visual odometer, thereby increasing the number of feature point pairs in the front-end matching. Finally, based on the obtained feature point pairs, tasks such as back-end map construction in the ORB-SLAM2 system can be performed, thereby achieving environmental perception and positioning

[0187] By combining the feature enhancement module with the ORB-SLAM2 system, not only can the number of successfully matched point pairs in the monocular visual odometer be increased, but also the density of map points in the back-end mapping can be increased. The present invention greatly improves the tracking stability and mapping density of the monocular visual SLAM system, and can effectively improve the performance of computer vision tasks, especially in the case of handling large-scale perspective changes. The matching effects before and after feature enhancement under different perspective changes are as Figure 8 and Figure 9 shown. According to Figure 8 Under small angle changes, ORB, SIFT, and the matching effect of the method of this invention are comparable; according to Figure 9 Both ORB and SIFT feature matching suffer from intersecting matching lines, a problem avoided in the method of this invention. The mapping effect on self-collected datasets is compared to... Figure 10 As shown. According to Figure 10 The point cloud generated by the method of this invention is denser than that generated by the ORB method. Furthermore, this invention avoids the hardware and code modifications required to replace existing descriptor operators.

[0188] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0189] The preferred embodiments of the present invention have been described in detail above. It should be understood that those skilled in the art can make numerous modifications and variations based on the concept of the present invention without creative effort. Therefore, all technical solutions that can be obtained by those skilled in the art based on the concept of the present invention through logical analysis, reasoning, or limited experimentation on the basis of existing technology should be within the scope of protection defined by the claims.

Claims

1. An ORB descriptor operator augmented based SLAM method, characterized in that, Includes the following steps: S1, acquire the adjacent image sequence, perform homography transformation on the image and convert it to the HLS color space; S2, using the ORB feature extraction algorithm in 3D reconstruction to extract image feature points and their corresponding feature descriptors; S3, the extracted feature descriptor operators are subjected to self-enhancement and mutual enhancement processing using a descriptor enhancement network; S4. The enhanced feature description operator is used to match feature points in adjacent frames. Based on the matched feature point pairs, the SLAM task is performed to achieve environmental perception and localization. The loss function of the described operator-enhanced network is expressed as: in, This represents the feature-maximizing average precision loss function. This describes the description operator boosting loss function. It involves adjusting the weights of the loss function by adjusting the descriptive operator; in, , indicating the enhanced descriptor, This represents the extracted original feature descriptor, AP represents the average precision, and N represents the number of descriptor operators.

2. The SLAM method based on ORB description operator enhancement according to claim 1, characterized in that, Step S1 includes the following steps: S11, using the homography transformation matrix The homography transformation matrix projects points of the image onto a second view plane. It is a 3x3 homogeneous matrix; S12 converts the image after homography transformation from RGB space to HLS space: Let the RGB colors of an image be represented as ,remember for The maximum value in, for The minimum value in, The range of values ​​is ; 1) Calculate brightness L : when hour, This indicates that the color is gray. , It does not represent any color; 2) Calculate saturation S : If brightness ,but S= ; When brightness , but ; 3) Calculate hue H : when hour, Its color is between yellow and magenta; when hour, Its color is between cyan and yellow; when hour, Its color is between magenta and cyan; If color tone H If the calculation result is negative, then add 360 to the original calculation result to obtain the final hue.

3. The SLAM method based on ORB description operator enhancement according to claim 1, characterized in that, Step S2 includes the following steps: S21, Feature Extraction: Extracting feature points and corresponding feature descriptors from an image based on a feature extraction algorithm; S22, Image Pair Matching: Based on exhaustive matching, sequence matching, spatial matching and transitive matching, the correspondence between images is determined from different perspectives. Based on the matching results, image pairs are established, and each image pair includes a reference image and a target image. S23, Match feature points in the matched image pairs: Based on the KD tree nearest neighbor search algorithm, calculate the distance or similarity between the descriptor vectors of feature points to perform feature point matching; S24, based on geometric verification, filter out feature points that are mismatched; S25, perform sparse reconstruction and incremental reconstruction of the image, estimate the camera pose and reconstruct a sparse representation of the scene; S26. Based on the sparse representation of the camera pose and scene, generate depth maps and normal maps, perform dense reconstruction, and obtain depth images for each image. S27. Save the camera pose of each image as a trajectory ground truth value according to the timestamp, and use it as a reference pseudo-ground truth value for each training of the augmentation descriptor.

4. The SLAM method based on ORB description operator enhancement according to claim 1, characterized in that, The feature descriptor is used to characterize the feature information of an image. The feature information of the image includes independent information and set information of feature points. The independent information includes global geometric information and feature point representation information. The set information includes the relative positional relationship of the feature point set. The descriptor enhancement network includes a feature self-enhancement network based on representation information and a feature mutual enhancement network based on feature point set.

5. The SLAM method based on ORB description operator enhancement according to claim 4, characterized in that, The feature self-enhancing network, based on the MLP model, fuses geometric information, feature information, and representation information into the extracted feature descriptor, thereby enhancing the descriptor: in, This represents the feature descriptor operator extracted in step S2. MLP models that represent feature information. This represents the enhanced description operator. MLP models that represent geometric information Indicate key points All available geometric information, MLP models that represent information. Indicate key points All representational information, through the use of and The global geometric and representational information of the descriptor operator is mapped to a high-dimensional space.

6. The SLAM method based on ORB description operator enhancement according to claim 4, characterized in that, The feature mutual enhancement network captures spatial contextual cues of sparse local features extracted from the same image based on the Transformer model, and is represented as follows: in, This represents a Transformer operation. This represents the enhanced description operator; The Transformer model uses an AFT network, and its input consists of images within the same image. Each local feature is output as a feature descriptor that enhances the feature space.

7. The SLAM method based on ORB description operator enhancement according to claim 1, characterized in that, The average accuracy of each descriptor in the loss function of the descriptor-enhanced network is calculated based on the differentiable method FastAP: Given the transformed features in the first image and the feature set in the second image FastAP computes pairwise distance vectors using the base true labels of matching pairs M = {M+, M-}. Sum range ; Value range is determined by distance quantization Quantified as A finite set of n elements Precision and recall are reformulated as Euclidean distance. Functions: in, Indicated by Positive match for the condition The prior distribution, express The cumulative distribution function; Area of ​​the precision-recall curve Indicates average precision: Among them, matching points are obtained by using the actual pose and depth map of the ground. Ground truth labels.

8. A SLAM device based on ORB description operator enhancement, comprising a memory, a processor, and a program stored in the memory, characterized in that, When the processor executes the program, it implements the method as described in any one of claims 1-7.

9. A storage medium having a program stored thereon, characterized in that, When the program is executed, it implements the method as described in any one of claims 1-7.