A precision self-supervised visual relocalization method based on block random fern coding
By employing block random fern coding and pose accuracy self-supervision, the problem of visual relocalization under large viewing angle difference was solved, achieving high-precision visual relocalization and improving the positioning accuracy and real-time performance of robots and other devices.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HANGZHOU DIANZI UNIV
- Filing Date
- 2023-06-29
- Publication Date
- 2026-06-12
AI Technical Summary
Existing visual relocation technologies struggle to achieve accurate matching and real-time positioning under large angular differences, impacting the user experience and application of autonomous mobile devices such as robots.
A precision-supervised visual relocalization method based on block random fern coding is adopted. By constructing an image pyramid on the color map and depth map, performing block random fern coding, matching the historical keyframe database, and combining the pose precision self-identification module, the relocalization pose is optimized to improve the accuracy.
It achieves accurate visual relocalization under large angular difference, improves the accuracy and recall of relocalization, enhances user experience, and is suitable for real-time localization and environmental map construction of autonomous mobile devices such as robots.
Smart Images

Figure CN116740179B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer vision, and particularly to the field of visual relocation for robots, MR and other devices, and provides a method for real-time indoor visual relocation under large viewing angle differences. Background Technology
[0002] Visual SLAM (Visual Simultaneous Localization and Mapping) utilizes visual signals captured by monocular, binocular, or RGB-D cameras as input to help robots and other autonomous mobile devices simultaneously achieve device localization and construct a 3D map of their surroundings in unknown environments. This technology is now widely used in drones, autonomous vehicles, service robots, and augmented reality (AR) devices, providing them with high-precision localization and environmental mapping. While significant progress has been made, its robustness in localization has not yet reached the level required for large-scale application. Currently, it is mainly limited by situations where numerous moving objects appear within the camera's field of view, or by physical limitations when RGB-D cameras capture objects (such as numerous specular reflections or pure black objects), resulting in numerous erroneous depth values. These issues can lead to matching errors in the visual signals, causing camera localization failure. When erroneous localization occurs, a visual relocalization technology that can accurately identify the error and quickly obtain the correct pose of the device in the environment can significantly improve the robustness of visual SLAM systems, thereby promoting their practical application in the robotics and other autonomous mobile device industries.
[0003] Current visual relocalization technologies can be categorized into four types:
[0004] The first method is camera pose regression, which uses deep learning to directly regress the camera pose from the input image. Although this method can obtain the camera pose end-to-end, it currently has low generalization ability and the accuracy of relocalization is far from meeting the requirements of robot applications.
[0005] The second method is scene coordinate regression, which uses deep learning to regress the 3D coordinates of the object in each pixel. Then, the 3D pose of the camera is calculated using methods such as RANSAC. This type of method has made great progress in recent years and has achieved high relocalization accuracy. However, it requires a long time to train in the current scene when facing a new scene. That is, these methods need to learn prior knowledge of specific scenes and lack regression ability. Therefore, this method is currently difficult to apply in real-time SLAM systems.
[0006] The third method is sparse feature matching. This method extracts sparse feature points (such as SIFT or ORB) from the input image and matches them with feature points in 3D space to obtain the camera's pose in the environment at the current moment. Although this method is simple and effective, the calculation of feature point matching increases exponentially with the number of feature points in the scene, making it difficult to embed this method into large-scale real-time SLAM systems.
[0007] The fourth method is keyframe image retrieval. This method estimates the current camera's 3D pose by matching the differences between the input image and historical keyframe images, using the historical keyframe image with the smallest difference and the current frame image. This method is widely used in real-time visual SLAM systems and 3D reconstruction systems due to its high positioning accuracy and real-time performance. However, the biggest bottleneck of this type of method is the difficulty in achieving relocalization with new viewpoints and large field-of-view differences. Accurate relocalization can only be achieved when the camera moves to the pose collected from historical keyframes. This severely impacts the user experience when using relocalization and limits the application of visual relocalization. Summary of the Invention
[0008] To address the problem of visual relocalization under large angular differences, this invention proposes a precision self-supervised visual relocalization method based on block random fern coding.
[0009] This invention describes a method that, based on the color and depth map information of the current frame input, accurately and automatically calculates the six-degree-of-freedom (6DOF) pose of the camera in a known 3D space. First, an image pyramid is constructed on the input color and depth maps to obtain multi-scale information. Then, fern coding is performed on each image layer to obtain multi-scale, multi-sub-region coding information of the input image. Efficient binary matching is performed between the coding information of multiple sub-regions and all sub-region coding information in the historical keyframe database to obtain a candidate list of keyframes similar to the coding of a certain sub-region of the input frame. Next, the six-DOF pose of the current frame camera is estimated using the overall image information of the region closest to the current frame's local region and its corresponding historical camera pose. The pose accuracy is detected by a pose accuracy self-identification module. For high-precision poses, the pose is directly output; for low-precision or completely incorrect poses, the system loops back to the historical keyframe candidate list and selects the second-best historical keyframe that matches the local information of the current frame for pose estimation, until the repositioning accuracy self-detection module identifies a high-precision repositioning pose. Ultimately, the method presented in this paper not only yields accurate repositioning poses, but also demonstrates through numerous experiments that our method can successfully achieve repositioning at new viewpoints with large angular differences. This greatly improves the user experience during repositioning, allowing users to perform correct repositioning simply by roughly scanning the original historical area.
[0010] A precision-based self-supervised visual relocalization method based on block random fern coding mainly includes the following two implementation stages:
[0011] Phase 1: Block random fern coding and matching.
[0012] Step 1.1: Construction of image blocks.
[0013] First, the depth map is converted into a surface normal map. Then, an image pyramid is constructed on the input color map and the normal map respectively. On each layer of the pyramid, the region is segmented to construct blocks.
[0014] Step 1.2: Encoding of the block random fern.
[0015] Block random fern coding is performed on different layers of the color image pyramid and the normal vector image pyramid, respectively.
[0016] Step 1.3: Matching of block encoding information.
[0017] During 3D scene construction, we collect historical keyframes. First, we perform block-based random fern coding on each input image frame. By comparing the global coding information of the current frame with that of historical keyframes, we determine whether the frame is a new keyframe. If it is a keyframe, we record its coding information and corresponding camera pose. When the camera re-enters the scene, we use the keyframe database collected from the known scene (which stores the coding information of each keyframe and its six-DOF pose in the scene) to compare the similarity between the block-based random fern coding information of the current frame and the block-based coding information of historical keyframes. We then temporarily store the historical keyframe coding information and pose corresponding to the most similar block in the keyframe candidate list.
[0018] Phase Two: Precision Self-Supervised Repositioning Pose Optimization;
[0019] Step 2.1: Pose estimation based on historical keyframes.
[0020] From the historical keyframe candidate list, based on the similarity matching results, the pose with the closest similarity in the local area is selected. Using this pose, the known 3D scene, and the current frame input information, the camera pose of the current frame is estimated.
[0021] Step 2.2, Self-identification of pose accuracy.
[0022] Because erroneous keyframe matching is inevitable during the image matching stage, it leads to errors in pose estimation. To improve the accuracy of relocalization, this method identifies the accuracy of each relocalization pose. For high-precision poses, the pose is output directly; for low-precision or even completely incorrect poses, a suboptimal match is searched in the historical keyframe candidate list. Then, stage two is repeated until a high-precision relocalization pose is obtained.
[0023] The beneficial effects of this invention are as follows:
[0024] (1) An efficient block-based random fern coding matching method is proposed to achieve visual relocalization under large viewpoint differences. The biggest bottleneck of image matching-based visual relocalization methods is that it is difficult to extend the method to achieve relocalization on new viewpoints. That is, the correct pose can only be regressed when the camera pose is almost completely consistent with the pose collected from historical keyframes. This greatly affects the user experience when using relocalization and limits the practical application of relocalization methods. This method innovatively proposes an efficient block-based random fern coding matching method to achieve correct image matching under large viewpoint differences, and then correctly estimates the accurate pose of the camera in the scene.
[0025] (2) A self-testing method for repositioning accuracy is proposed. To improve the accuracy of repositioning, we propose a self-supervised repositioning accuracy testing method. For high-precision repositioning poses, the method outputs them directly. For low-precision or even completely incorrect repositioning poses, the method loops back to the matching stage, searches for suboptimal matching results, and recalculates the repositioning pose until the correct repositioning pose is output.
[0026] (3) High-precision visual relocation. Extensive experiments have demonstrated that this method not only achieves accurate relocation under large angular differences, but also that the accuracy and recall of the relocation achieved by this invention exceed the best results of all other visual relocation techniques, thus achieving the highest precision visual relocation. Attached Figure Description
[0027] Figure 1 This is a flowchart of the precision self-supervised visual relocalization based on block random fern coding, as described in an embodiment of the present invention.
[0028] Figure 2 This is a schematic flowchart illustrating the random block coding and block coding matching in an embodiment of the present invention;
[0029] Figure 3-6 This is a schematic diagram of precise repositioning under large viewing angle difference transformation according to an embodiment of the present invention. Detailed Implementation
[0030] The specific implementation of this invention patent will be described in detail below with reference to the accompanying drawings.
[0031] Combination Figure 1 The flowchart of precision self-supervised visual relocalization based on block random fern coding shows that this invention mainly includes the following two implementation stages:
[0032] Phase 1: Block random fern coding and matching.
[0033] The specific process for this stage is as follows: Figure 2 As shown, it is mainly divided into 3 sub-stages.
[0034] Step 1.1 Construction of image blocks.
[0035] To expand the matching capability of new perspectives, we convert the depth map, which characterizes the geometric properties of the captured object, into a surface normal vector map. This normal vector does not change with the camera's viewpoint, thus preparing data for subsequent additions of translation, rotation, and scale invariance. To enhance the matching of multi-scale information and improve the scale invariance of this visual relocalization method, we first construct image pyramids on the input color image and normal vector map, respectively. Then, to enhance translation invariance, we divide the image into regions of the same size on each pyramid layer. For example, we divide a w*h image into n^2 equal parts, with each part having a size of w / n*h / n.
[0036] Step 1.2 Encoding of the block random fern.
[0037] We then established the same pattern of random ferns at different scales and blocks, as generated in 1.1. Within a single block, we randomly selected m fern points for encoding, and then encoded all fern points in that block. Perform binary code processing to obtain the encoded information of the block. A pixel fern This includes the location information l of the fern point within the block and the threshold value for each image channel at that location. At pixel location l, the values of each channel of the input image pixel are summed. By comparing the threshold values, the binary code is obtained. The binary encoded value of each channel threshold is calculated using the following formula:
[0038]
[0039] Within the same scene, the information for each fern point (fern point location and threshold for each channel) on a block is only randomly generated once. The same pattern is used to construct fern points and obtain encoded information across different blocks. The color threshold τ is used as an example. r τ g τ b A vector threshold for randomly generated integers between 0 and 255. It is a float type value randomly generated from -1 to 1.
[0040] Step 1.3 Matching of block encoding information.
[0041] To obtain the accurate camera pose in the current 3D space, we estimate the six-DOF pose of the camera in the known space using the pose information of matched historical keyframes and the image information captured in the current frame. After block random fern coding in section 1.2, we calculate the similarity between different blocks between two frames using the block (local) Hamming distance (LHD).
[0042]
[0043] in This represents the encoded information, i.e., the encoded features, in the s1-th block of the g-th frame. Similarly, This represents the encoded features of the s2 block in the k-th frame. The operator ≡ returns 0 when the six-bit binary codes of the six channels of the two fern point pixels are completely equal, otherwise the operator ≡ returns 1.
[0044] We express the global similarity between two frames using the global Hamming distance between them:
[0045]
[0046] Where C g C represents the global coding features on the input of the g-th frame. k This represents the global coding feature on the input of the k-th frame.
[0047] During the reconstruction phase of the current scene (establishing a known scene), the global Hamming distance between two frames is calculated to determine the global similarity between the current frame and historical keyframes. When the similarity is less than a certain threshold, the frame is recorded as a historical keyframe and stored in the historical keyframe database. When visual relocalization is required, the similarity between two frames is calculated by measuring the block Hamming distance between the coded features of each block in the input frame and the coded features of all blocks in the historical keyframes. When the similarity is greater than a set threshold, the coded information and pose of the historical keyframe corresponding to the closest matching block are temporarily stored in the keyframe candidate list.
[0048] Phase Two: Precision Self-Supervised Repositioning Pose Optimization;
[0049] Step 2.1 Pose estimation based on historical keyframes.
[0050] From the candidate list of historical keyframes, based on the similarity matching results, the pose corresponding to the local block of the historical keyframe that is most similar to the local block of the current frame is selected. Using this pose, the known 3D scene, and the input information of the current frame (including depth map and color map), the camera pose of the current frame is estimated. Because this method matches local blocks, the viewpoint difference between the input frame and the matched historical keyframe may be large, making it difficult to accurately calculate the pose of the current frame directly using the most commonly used ICP. Therefore, we use the fast 3D registration method Teaser for fast coarse registration, and then use ICP iteratively to calculate the pose of the current camera.
[0051] Step 2.2 Self-identification of pose accuracy.
[0052] To our knowledge, there is currently no relocalization method that can self-supervisedly detect the accuracy of its own relocalized pose. We designed and implemented a relocalization accuracy self-detection module using SVM, specifically as follows: To detect the accuracy of the relocalized pose, the accuracy detection problem is transformed into a relocalized pose accuracy interval identification problem. In this invention, we designed a multi-classifier Θ to classify the accuracy interval y of the relocalized pose by inputting the intermediate variable value x generated by the ICP algorithm, i.e., y = Θ(x). This method uses Support Vector Machine (SVM) to construct this multi-classifier Θ. To train the multi-classifier, we constructed a training dataset containing the ICP intermediate variable values x:=H,R,O and the localization accuracy interval category y, where x includes the Hessian matrix H, the residual R, and the outlier percentage O. These variables all reflect the convergence degree of ICP to a certain extent. In our training dataset, we divide the accuracy interval y into 11 categories, including Category 1 (displacement error within 1 cm and rotation error within 1°), Category 2 (displacement error within 2 cm and rotation error within 2°, but not Category 1), and so on. The 11th category is for displacement errors greater than 10 cm or rotation errors greater than 10°. This dataset can be constructed on any localization or relocalization dataset containing ground truth pose values. For example, we constructed our relocalization pose accuracy detection dataset on the well-known visual relocalization 7-scenes dataset. After each frame of relocalization, we save the ICP intermediate variable x and the re-localized pose, and calculate the difference between the ground truth pose and the re-localized pose to obtain y.
[0053] After self-identifying the repositioning pose accuracy, high-precision repositioning poses are directly output. For low-precision or even completely incorrect poses, we search for suboptimal matches in the historical keyframe candidate list, and then repeat stage two until a high-precision repositioning pose is obtained. The high-precision and low-precision categories are set according to the actual situation and do not require retraining; only a threshold needs to be modified. The default positioning accuracy range, categories 1 to 5, is considered high-precision.
[0054] Furthermore, the specific categories are as follows:
[0055] Category 1: Displacement error within 1cm and rotation error within 1°;
[0056] Category 2: Displacement error within 2cm and rotation error within 2°, and not Category 1;
[0057] Category 3: Displacement error within 3cm and rotation error within 3°, and not Category 1-2;
[0058] Category 4: Displacement error within 4cm and rotation error within 4°, and not in Category 1-3;
[0059] Category 5: Displacement error within 5cm and rotation error within 5°, and not in categories 1-4;
[0060] Category 6: Displacement error within 6cm and rotation error within 6°, and not in Category 1-5;
[0061] Category 7: Displacement error within 7cm and rotation error within 7°, and not in Category 1-6;
[0062] Category 8: Displacement error within 8cm and rotation error within 8°, and not in categories 1-7;
[0063] Category 9: Displacement error within 9cm and rotation error within 9°, and not in categories 1-8;
[0064] Category 10: Displacement error within 10cm and rotation error within 10°, and not in Category 1-9;
[0065] Category 11: Displacement error greater than 10cm or rotation error greater than 10°;
[0066] Figure 3 Schematic diagram of repositioning under large angle difference transformation during translation; Figure 4 Schematic diagram of repositioning under large angular difference transformation under rotational transformation; Figure 5 Schematic diagram of repositioning under large angle difference transformation under scale transformation; Figure 6 This diagram illustrates relocalization under large viewpoint difference transformations during scale changes. The top left corner shows the color image of the historical keyframe, and the bottom left corner shows the color image of the current frame input. It can be seen that the pose of the current frame differs greatly from that of the historical keyframe under all four transformations. The right side shows the viewpoint under the relocalization result. The colored portion in the bottom left corner represents the visual information of the historical keyframe that can be seen in this relocalized pose, while the grayscale image in the top right corner represents the input image information of the current frame under this relocalized pose.
[0067] The above description, in conjunction with specific / preferred embodiments, provides a further detailed explanation of the present invention. It should not be construed that the specific implementation of the present invention is limited to these descriptions. Those skilled in the art can make various substitutions or modifications to these described embodiments without departing from the inventive concept, and all such substitutions or modifications should be considered within the scope of protection of the present invention.
[0068] The parts of this invention not described in detail are well-known to those skilled in the art.
Claims
1. A high-precision self-supervised visual relocalization method based on block random fern coding, characterized in that, It mainly includes the following two major implementation phases: Phase 1: Block random fern coding and matching; Step 1.1: Construction of image blocks; The depth map is converted into a surface normal map, and then an image pyramid is constructed on the input color map and the normal map respectively. Blocks are constructed in the segmented regions of each pyramid layer. Step 1.2: Encoding of the block random fern; Block random fern coding is performed on different layers of the color image pyramid and the normal vector image pyramid, respectively; Step 1.3: Matching of block encoding information; Based on the keyframe database collected in the known 3D scene, by comparing the block random fern coding information of the current frame with the block coding information of the historical keyframe in the keyframe database, the historical keyframe coding information and pose corresponding to the block with the closest match are temporarily stored in the keyframe candidate list. Phase Two: Precision Self-Supervised Repositioning Pose Optimization; Step 2.1: Pose estimation based on historical keyframes; From the keyframe candidate list, based on the similarity matching results, the pose with the closest similarity in the local region is selected. Using this pose, the known 3D scene, and the current frame input information, the current frame camera pose is estimated to obtain the relocalization pose. The current frame input information includes a color map and a depth map. Step 2.2: Self-identification of pose accuracy; To improve the accuracy of relocalization, the accuracy of each relocalization pose is identified. For high-precision poses, they are output directly; for low-precision or even completely wrong poses, a suboptimal match is searched in the keyframe candidate list, and then the stage two steps are repeated until a high-precision relocalization pose is obtained. The specific method for encoding the block random fern is as follows: Random ferns with the same pattern are established at different scales and in different blocks. Within a single block, random selection is performed. Encode each pixel fern point, and then encode all the pixels fern points in the block. Perform binary encoding to obtain the encoding information of the block. A single pixel fern dot Includes the location information of the pixel fern point within the block. And the threshold for each image channel at that location, at that pixel location. The above will sum the values of each image channel of the input image pixels. By comparing the threshold values, the binary code is obtained. ; In the same scene, the information of each pixel fern point in a block is only randomly generated once, and the same rule is used to construct pixel ferns to obtain encoded information in each different block; among them, the color threshold , , A vector threshold for randomly generated integers between 0 and 255. , , It is a float type value randomly generated from -1 to 1.
2. The precision self-supervised visual relocalization method based on block random fern coding according to claim 1, characterized in that, The specific method for matching block code information is as follows: To obtain the accurate camera pose in the current 3D space, the pose information of the matched historical keyframes and the current frame input information are used to estimate the six-DOF pose of the camera in the known 3D scene; after block random fern coding in step 1.2, the block Hamming distance is used. To calculate the similarity between different blocks in two frames. in Indicates the first The first frame The encoded information on a block is its encoded characteristics; similarly, Indicates the first The first frame The encoding features on the block, when the six-bit binary codes on the six channels of two pixels are completely equal, the operator Returns 0, otherwise the operator Return 1; Global similarity between two frames can be expressed using the global Hamming distance between the two frames: in Indicates the first Global coding features of frame input, Indicates the first Global coding features of frame input; During the reconstruction phase of the current scene, the global Hamming distance between two frames is calculated to determine the global similarity between the current frame and historical keyframes. When the similarity is less than a certain threshold, the current frame is recorded as a historical keyframe and stored in the historical keyframe database. When visual relocalization is required, the similarity between two frames is calculated by calculating the block Hamming distance between the block coding features of each block of the input frame and all block coding features of the historical keyframes. When the similarity is greater than a set threshold, the historical keyframe coding information and pose corresponding to the closest matching block are temporarily stored in the keyframe candidate list.
3. A precision self-supervised visual relocalization method based on block random fern coding according to claim 1 or 2, characterized in that, The specific method for self-identification of pose accuracy is as follows: Use a multi-classifier Input intermediate variable values generated by the ICP algorithm To classify and reposition the accuracy range of the pose ,Right now Precision range The system is divided into 11 categories, including Category 1, which has a displacement error of less than 1 cm and a rotation error of less than 1°; Category 2, which has a displacement error of less than 2 cm and a rotation error of less than 2° and is not Category 1; and so on. The 11th category is a displacement error greater than 10 cm or a rotation error greater than 10°.
4. The precision self-supervised visual relocalization method based on block random fern coding according to claim 1, characterized in that, Use Support Vector Machine (SVM) to build a multi-classifier. To train a multi-classifier, a dataset containing intermediate ICP variable values is constructed on any localization or relocalization dataset that includes ground truth pose values. and positioning accuracy range category The training dataset, in which Includes Hessian matrix residual and outlier percentage .