Vision-based relocalization method and electronic device

By combining RGB/monochrome cameras, depth cameras, and external visual inertial odometry, the problems of repositioning accuracy and persistence in AR applications are solved, achieving efficient and robust repositioning under environmental changes, and is suitable for mobile AR devices and multi-user scenarios.

CN115516524BActive Publication Date: 2026-06-12GUANGDONG OPPO MOBILE TELECOMMUNICATIONS CORP LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GUANGDONG OPPO MOBILE TELECOMMUNICATIONS CORP LTD
Filing Date
2021-06-03
Publication Date
2026-06-12

Smart Images

  • Figure CN115516524B_ABST
    Figure CN115516524B_ABST
Patent Text Reader

Abstract

A vision-based relocalization method executable by an electronic device. A vision-based relocalization method including sequence-based pose refinement is presented to improve relocalization accuracy. The device selects a query frame sequence from an input frame sequence based on an evaluation of depth image-based single-frame relocalization associated with the input frame sequence. The input frame sequence is obtained from different perspectives. The device refines an estimated pose associated with the query frame sequence using an extrinsic pose associated with the query frame sequence for vision-based relocalization. The extrinsic pose is obtained from an extrinsic odometry.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of augmented reality (AR) systems, and in particular to a vision-based relocalization method. Background Technology

[0002] In augmented reality (AR) applications, vision-based relocalization is a key component supporting AR object persistence and multi-user registration. Persistence is the ability to maintain virtual objects in the same physical location and orientation as they were previously in real-world space during an AR session or across different AR sessions. For example, during the first AR session, a user uses an AR app to place a virtual sofa in a room. Some time later, the user enters another AR session using the same app, which displays the virtual sofa in the same location and orientation. The process of AR object persistence is also known as relocalization, which involves re-estimating the device's pose relative to a previously stored "map" representation. For multiple user interactions within an AR session, one user device can set a reference, or "anchor point," which can be some reference point or object in real-world space. Other user devices can relocalize themselves by matching some sensory data with the "anchor point." Relocalization can utilize different sensory data, with vision-based relocalization being the most common.

[0003] Vision-based relocalization typically uses digital images from a camera as input and calculates a six-degrees-of-freedom (6DoF) camera pose with respect to a predefined coordinate system as output. Therefore, after relocalization, the device can be tracked in the same coordinate system as the previous AR session or a different user's AR session. Summary of the Invention

[0004] Numerous studies on vision-based relocalization have been published, many of which are implemented in conjunction with Simultaneous Localization and Mapping (SLAM) processes. These techniques have been widely developed and integrated into current AR software products (such as ARKit and ARCore) and current AR hardware products (such as AR glasses). Relocalization typically requires a sparse or dense map representation of the environment. The visual appearance of the map is then used to provide an initial pose estimate, followed by a pose refinement stage based on the application. Most methods use red-green-blue (RGB) images for relocalization.

[0005] The purpose of this application is to propose a vision-based relocation method and an electronic device.

[0006] In a first aspect, embodiments of the present invention provide a vision-based relocalization method executable by an electronic device, comprising: selecting a query frame sequence from the input frame sequence based on an evaluation of single-frame relocalization based on a depth-based image associated with an input frame sequence, wherein the input frame sequence is obtained from different viewpoints; and refining an estimated pose associated with the query frame sequence for vision-based relocalization using an external pose associated with the query frame sequence, wherein the external pose is obtained from an external odometry.

[0007] In a second aspect, embodiments of the present invention provide an electronic device including a camera, a depth camera, an inertial measurement unit (IMU), and a processor. The camera is configured to acquire an input frame sequence. Each input frame includes a color space image. The depth camera is configured to acquire a depth image associated with the color space image. The IMU is configured to provide an external odometry associated with the color space image. The processor is configured to perform: selecting a query frame sequence from the input frame sequence based on an evaluation of single-frame relocalization based on the depth image associated with the input frame sequence, wherein the input frame sequence is obtained from different viewpoints; and refining the estimated pose associated with the query frame sequence for vision-based relocalization using an external pose associated with the query frame sequence, wherein the external pose is obtained from the external odometry.

[0008] The disclosed methods can be implemented in a chip. The chip may include a processor configured to call and run a computer program stored in memory to cause a device on which the chip is mounted to perform the disclosed methods.

[0009] The disclosed methods can be programmed as computer-executable instructions stored in a non-transitory computer-readable medium. When loaded into a computer, the non-transitory computer-readable medium instructs the computer's processor to execute the disclosed methods.

[0010] The non-transitory computer-readable medium may include at least one selected from the group consisting of: hard disk, CD-ROM, optical storage device, magnetic storage device, read-only memory, programmable read-only memory, erasable programmable read-only memory, EPROM, electrically erasable programmable read-only memory, and flash memory.

[0011] The disclosed methods can be programmed into a computer program product that enables a computer to execute the disclosed methods.

[0012] The disclosed methods can be programmed into computer programs that cause a computer to execute the disclosed methods.

[0013] To overcome these challenges, this invention utilizes an RGB / monochrome camera and a depth camera. Unlike other RGB and depth (RGBD) relocalization methods, this invention also uses the output of an external visual inertial odometry (VIO) available on most AR devices. The VIO output includes the device's pose. VIO is the process of determining the device's position and orientation by analyzing relevant image and inertial measurement unit (IMU) data. This invention provides mapping and relocalization enhanced with VIO, and is efficient, decoupled from the SLAM process, highly flexible in deployment, and requires no learning process. VIO uses an RGB / monochrome camera and an IMU providing external odometry. In other words, this invention ultimately uses data from an RGB / monochrome camera, an IMU, and a depth camera. By using heterogeneous sensor data as input, the proposed method can improve the accuracy of relocalization. Furthermore, this invention utilizes image sequences as input and can provide long-term persistence. For example, n frames of sensory data can be used for relocalization. If visual changes to the environment occur for a small subset of frames after the mapping process, the disclosed method can still select the unchanged frames from the n-frame sequence to perform relocalization. Compared to single-frame-based relocalization, the proposed relocalization method is sequence-based and exhibits more robust performance when visual changes persist over long periods. Attached Figure Description

[0014] To more clearly illustrate the embodiments of this application or related technologies, the embodiments will be briefly described below with reference to the accompanying drawings. Obviously, the drawings are only some embodiments of this application, and those skilled in the art can obtain other drawings based on these drawings without any preconditions.

[0015] Figure 1 A schematic diagram of virtual object relocation is shown.

[0016] Figure 2 A schematic diagram of a system including a mobile device performing a relocation method according to an embodiment of this application is shown.

[0017] Figure 3 A schematic diagram of three types of vision-based relocation methods is shown.

[0018] Figure 4 A schematic diagram of a mapping pipeline for a vision-based relocation method is shown.

[0019] Figure 5 A schematic diagram of a mapping pipeline for a vision-based relocation method according to an embodiment of this application is shown.

[0020] Figure 6 A schematic diagram of a relocation pipeline based on a vision-based relocation method according to an embodiment of this application is shown.

[0021] Figure 7 A block diagram of a system for wireless communication according to an embodiment of this application is shown. Detailed Implementation

[0022] The technical aspects, structural features, achieved objectives, and effects of the embodiments of the present invention will be described in detail below with reference to the accompanying drawings. Specifically, the terminology used in the embodiments of the present invention is only used to describe the purpose of specific embodiments and is not intended to limit the present invention.

[0023] refer to Figure 1 For example, during a first AR session A, a user uses an AR application executed by electronic device 10 to place a virtual object 220 (e.g., an avatar) in a room with a table 221. Some time later, the user enters another AR session B using the same application, and even if the device moves to a different location, the application can still display the virtual object 220 in the same position and orientation relative to the table 221. Another user's other electronic device 10c can display the virtual object 220 in the same position and orientation relative to the table 221 in AR session C.

[0024] like Figure 1 As shown, vision-based relocalization can aid in persistence and multi-user registration. Recently, depth cameras have been increasingly installed on mobile devices such as smartphones and AR glasses. Depth information obtained from depth cameras adds geometric detail to the RGB appearance, which can be used to improve the accuracy and robustness of relocalization.

[0025] Reference Figure 2 The system, including mobile devices 10a and 10b, base station (BS) 200a, and network entity device 300, executes the method disclosed according to embodiments of this application. Mobile devices 10a and 10b may be mobile phones, AR glasses, or other AR processing devices. Figure 1This is shown for illustrative purposes and not for limitation, and the system may include more mobile devices, BS, and CN entities. Connections between devices and device components are shown as lines and arrows in the diagram. Mobile device 10a may include processor 11a, memory 12a, transceiver 13a, camera 14a, depth camera 15a, and inertial measurement unit (IMU) 16a. Mobile device 10b may include processor 11b, memory 12b, transceiver 13b, camera 14b, depth camera 15b, and inertial measurement unit (IMU) 16b. Each of cameras 14a and 14b acquires and generates a color space image from the scene. Each of depth cameras 15a and 15b acquires and generates a depth image from the scene. IMU 16a measures and generates an external odometry for device 10a. IMU 16b measures and generates an external odometry for device 10b. The device odometry is an estimate that uses data from motion sensors to estimate the device's position change over time. A color space image camera, such as camera 14a or 14b, is configured to acquire a sequence of input frames, each of which includes a color space image. A depth camera, such as depth camera 15a or 15b, is configured to acquire a depth image associated with the color space image in each frame. An IMU, such as IMU 16a or 16b, is configured to provide an external odometry associated with the color space image in each frame.

[0026] Base station 200a may include processor 201a, memory 202a, and transceiver 203a. Network entity device 300 may include processor 301, memory 302, and transceiver 303. Each of processors 11a, 11b, 201a, and 301 may be configured to implement the proposed functions, processes, and / or methods described herein. The radio interface protocol layer may be implemented in processors 11a, 11b, 201a, and 301. Each of memories 12a, 12b, 202a, and 302 is operable to store various programs and information to operate the connected processor. Each of transceivers 13a, 13b, 203a, and 303 is operable to be coupled to the connected processor to transmit and / or receive radio signals or wired signals. Base station 200a may be one of eNB, gNB, or other types of radio nodes and may be configured with radio resources for mobile devices 10a and 10b.

[0027] Each of processors 11a, 11b, 201a, and 301 may include an application-specific integrated circuit (ASIC), other chipsets, logic circuitry, and / or data processing devices. Each of memories 12a, 12b, 202a, and 302 may include read-only memory (ROM), random access memory (RAM), flash memory, memory cards, storage media, and / or other storage devices. Each of transceivers 13a, 13b, 203a, and 303 may include baseband circuitry and radio frequency (RF) circuitry to process radio frequency signals. When embodiments are implemented in software, the techniques described herein can be implemented using modules, processes, functions, entities, etc., that perform the functions described herein. These modules may be stored in memory and executed by the processor. Memory may be implemented within or outside the processor, wherein those may be communicatively coupled to the processor in various ways known in the art.

[0028] Examples of electronic devices 10 in the description may include one of mobile devices 10a and 10b.

[0029] refer to Figure 3 Three common pipelines for vision-based relocalization include pipelines for implementing direct regression, matching and thinning, and matching-regression methods. Image 310 is input into the pipeline. Electronic devices can execute these methods to implement the pipeline.

[0030] The direct regression pipeline 320 implements the direct regression method using an end-to-end approach, which leverages a deep neural network (DNN) to directly regress the pose 350. Pose can be defined as a 6-DOF translation, where the user's camera orientation refers to the coordinate space. The 6DoF pose of a 3D object represents the localization of the object's position and orientation. Pose is defined in ARCore as:

[0031] "Pose represents an immutable rigid transformation from one coordinate space to another. As all ARCore APIs provide, pose always describes a transformation from an object's local coordinate space to the world coordinate space... The transformation is defined using quaternion rotations and translations around the origin."

[0032] Pose from the ARCore API can be considered equivalent to an OpenGL model matrix.

[0033] The matching regression pipeline 340, which implements the matching regression method, extracts features from the image, finds a match between the extracted features and the stored map, and finally calculates the pose through the match. The map can be a virtual reconstructed environment. The map is generated by sensors such as RGB cameras, depth cameras, or LiDAR sensors. The map can be acquired locally or downloaded from a server. The matching and refinement pipeline 330, which implements the matching and regression method, obtains sparse or dense features of the frame (block 331), directly regresses the match between the features and the map (block 332), then calculates the pose based on the match (block 333), and outputs the calculated pose (block 350).

[0034] Vision-based relocalization also requires a mapping process to generate a representation of the real-world space. This mapping method is typically designed based on the specific relocalization method used. For example, Figure 3 Direct regression methods in mapping require a DNN learning step within the mapping. Matching regression methods also utilize the learning process within the mapping, not just DNNs. Matching and refining mapping pipelines typically use keyframe-based methods. Common keyframe methods include Enhanced Hierarchical Bag-of-Words (DBoW2) and Random Fern. The mapping process is as follows: Figure 4 As shown. The electronic device can execute a mapping procedure. For example, when mapping begins, frame 20, having an image 21 and a pose 22, is preprocessed (block 401) to extract sparse or dense features. A keyframe check is then performed (block 402) to check if the current frame 20 is eligible to become a new keyframe. If the current frame 20 is eligible, frame 20 is added to the keyframe database 30 and indexed in the keyframe database 30 (block 403). The keyframe database is used in subsequent relocation processes to retrieve the most similar keyframes based on the input frames. If the current frame 20 is not suitable to become a new keyframe, frame 20 is discarded (block 404).

[0035] Despite the numerous relocalization methods developed, many present significant challenges in AR applications. The first challenge is long-term persistence, meaning virtual objects should persist indefinitely. In indoor scenes, the environment is constantly changing. For example, chairs can be moved, cups can be placed in different locations, and bed sheets can be changed periodically. Outdoor scenes are affected by lighting, occlusion, and seasonal changes. A rudimentary solution might be to continuously update the map, which is impractical in most cases. The second challenge is the limited computing power of most AR mobile devices, necessitating efficient relocalization solutions. The third challenge is that multi-user AR applications, especially in indoor scenes, require high relocalization accuracy for a good user experience.

[0036] To overcome these challenges, this invention utilizes both an RGB / monochrome camera and a depth camera. Unlike other RGB and depth (RGBD) relocalization methods, this invention also uses the output of an external visual inertial odometry (VIO) available on most AR devices. The VIO output includes the device's pose. VIO is the process of determining the device's position and orientation by analyzing relevant image and inertial measurement unit (IMU) data. This invention provides VIO-enhanced mapping and relocalization, which is efficient, decoupled from the SLAM process, highly flexible in deployment, and requires no learning process. VIO uses an RGB / monochrome camera and an IMU providing external odometry. In other words, this invention ultimately uses data from an RGB / monochrome camera, an IMU, and a depth camera. By using heterogeneous sensor data as input, this method improves the accuracy of relocalization. Furthermore, this invention utilizes image sequences as input and can provide long-term persistence. For example, n frames of sensory data can be used for relocalization. If visual changes to the environment occur for a small subset of frames after the mapping process, the disclosed method can still select the unchanged frames from the n-frame sequence to perform relocalization. Compared to single-frame-based relocalization, the proposed relocalization method is sequence-based and exhibits more robust performance in the face of long-term persistent visual changes.

[0037] This invention requires RGB / monochrome images, depth images, and external odometry data for each frame, and combines the data from the query frame sequence as input. Note that this invention provides embodiments of matching and refinement methods and does not rely on any specific keyframe selection and retrieval model. Figure 5 The mapping pipeline of the disclosed method is shown. Any current RGB / monochrome keyframe selection method can be used in this invention. For example, a keyframe selection method is disclosed in an article entitled "Real-time RGB-D camera relocalization via random ferns for keyframe encoding" by Ben Glocker, Jamie Shotton, Antonio Criminisi, and Shahram Izadi in IEEE Transactions on Visualization and Computer Graphics 21, no. 5 (2014): 571-583. Another keyframe selection method is disclosed in an article entitled "DBoW2: Enhanced hierarchical bag-of-word library for C++". A keyframe is a frame that can represent important information in the mapping. Figure 4 and Figure 5As shown, each frame is checked to see if it qualifies as a keyframe. If a frame is qualified as a keyframe, it is stored in the keyframe database. Query frames are special keyframes in the relocation process, and their selection criteria are completely different from those of keyframes in the mapping process.

[0038] If the current frame 20 is eligible to become a new keyframe, then frame 20 is added to the keyframe database 30 and indexed therein. In addition to the keyframes, the 3D point cloud 23 is also recorded as a depth image of the keyframe (block 403'), so each keyframe has a 3D point cloud 23 recorded as a depth image of the keyframe. Point clouds can be generated from a depth camera. Therefore, a series of 3D point clouds are constructed, which can be combined into a 3D map point cloud.

[0039] The relocalization process can be performed on the same device or on different users' devices in a later AR session. For example, the vision-based relocalization method of this application is performed by device 10. The vision-based relocalization method includes an evaluation of single-frame relocalization based on depth images associated with an input frame sequence, selecting a query frame sequence from the input frame sequence. The input frame sequence is obtained from different viewpoints. Each input frame in the input frame sequence includes a color space image associated with a depth image, and the evaluation of single-frame relocalization based on depth images includes evaluating the point cloud registration of the current frame in the input frame sequence using depth information from the depth image associated with the current frame and depth information from depth images associated with multiple keyframes in a 3D map. The multiple keyframes include the k nearest keyframes relative to the current frame, where k is a positive integer. The point cloud registration of the current frame may include an Iterative Closest Point (ICP) algorithm applied to the current frame. The device refines the estimated pose associated with the query frame sequence for vision-based relocalization using the external pose associated with the query frame sequence. The external pose is obtained from an external odometry.

[0040] Embodiments of the relocation method of this application include a brief preprocessing step, and two stages for estimating the 6DoF pose. These two stages include a first stage 620 for sequence generation and a second stage 630 for pose refinement. Figure 6 The entire relocation pipeline is displayed. Device 10 can implement the relocation pipeline by executing the disclosed vision-based relocation method.

[0041] For example, frame 20 includes a color space image 21, a depth image 23, and a ranging pose 24. The color space image may include an RGB or monochrome image obtained from a camera. The depth image 23 may be obtained from a depth camera. The ranging pose may be obtained from an external odometry. Frame 20 is processed as the current frame for preprocessing, with a first stage for sequence generation and a second stage for pose refinement. This invention introduces a novel pipeline that combines color space images, depth images, and external odometry to estimate relocalization. Furthermore, this invention proposes a method for generating multi-mode sequences to reduce erroneous relocalization. Additionally, a vision-based relocalization method based on sequence-based pose refinement is proposed to improve relocalization accuracy.

[0042] like Figure 6 As shown, device 10 acquires one or more frames using the disclosed relocalization method. In one or more frames, a frame is selected as the current frame 20 and includes a color space image 21, a depth image 23, and a 6-DOF pose 24 from an external odometry system. All color space images, depth images, and 6-DOF poses are synchronized. In the preprocessing of the current frame 20 (block 610), the color space image 21, depth image 23, and ranging pose 24 are registered to an RGB / monochrome camera (e.g., a 3D Odometry camera). Figure 2 Using the same reference frame (one of cameras 14a and 14b shown), extrinsic parameters obtainable through the calibration process are used. Extrinsic parameters refer to the transformation matrix between the monochrome / RGB camera and the depth camera. For example, pinhole camera parameters are represented by a 4×3 matrix called the camera matrix. This matrix maps the 3-D world scene to the image plane. The calibration algorithm calculates the camera matrix using extrinsic and intrinsic parameters. Extrinsic parameters represent the camera's position in the 3-D scene. Intrinsic parameters represent the camera's optical center and focal length. Preprocessing one or more frames outputs a frame sequence containing images with depth information and pose, and is passed to the first stage 620 for sequence generation.

[0043] The first stage of sequence generation:

[0044] The first stage for sequence generation is the sequence generation stage, which is configured to select and store a sequence of frames as different frames acquired from different viewpoints. Each selected frame has a high probability of estimating the pose and generating the correct pose. Note that the frames selected from multiple input frames in this stage are different from the keyframes stored for mapping and retrieval, because the frames input to this stage were captured at different times or from different devices. The frames selected in this stage are called query frames. Query frames need to have a different viewpoint from all other query frames in the stored sequence and have the potential to estimate the correct pose. Figure 6 As shown, the first stage has four main steps.

[0045] Position check:

[0046] The first step in this phase is a pose check (block 621). This step ensures that the new query frame comes from a different viewpoint relative to previous query frames added to the sequence. The device compares the pose of the current frame 20 with the pose of at least one stored query frame in the query frame sequence to determine whether the current frame represents a completely different viewpoint from the stored query frames when the query frame sequence is not empty and has another query frame besides the current frame. If there are no query frames in the sequence, this pose check step is omitted. The device 10 uses the pose from an external odometer associated with the current frame 20 to check whether the current frame 20 has a sufficient viewpoint difference relative to previous query frames. The pose of the current frame 20 is compared with one or more last query frames in the sequence. When comparing the pose of the current frame 20 with the pose of a stored query frame in the sequence, if the Euclidean distance between the two compared poses is greater than a threshold δ... trans Or the angle difference between two compared poses is greater than the threshold δ rot If so, then the current frame 20 is selected for further processing in the next step. If the Euclidean distance between the two compared poses is not greater than the threshold δ... trans Or the angle difference between two compared poses is not greater than the threshold δ rot If so, device 10 determines that the current frame is not a valid query frame, and the current frame 20 is ignored (block 625).

[0047] Single-frame relocation:

[0048] The second step is to use single-frame relocalization (block 622). Device 10 performs an evaluation of single-frame relocalization based on the depth image for the current frame 20. Specifically, (1) feature extraction of the current frame 20 is performed according to the keyframe selection method used during mapping. For example, a keyframe selection method is disclosed in an article entitled “Real-time RGB-D camera relocalization via random ferns for keyframe encoding” in IEEE Transactions on Visualization and Computer Graphics 21, no. 5 (2014): 571-583 by Ben Glocker, Jamie Shotton, Antonio Criminisi, and Shahram Izadi. Another keyframe selection method is disclosed in an article entitled “DBoW2: Enhanced hierarchical bag-of-word library for C++” by Gálvez-López, D, and JDTardós.

[0049] Device 10 then uses K-Nearest Neighbors (kNN) from a keyframe database to search for the k nearest keyframes, where k is a positive integer. The distance measurement for kNN is also based on feature definitions. For example, if random ferns are used as features of a frame, the distance is calculated as the Hamming distance between the fern of the current frame 20 and one of the k nearest frames. Rublee, Ethan, Vincent Rabaud, Kurt Konolige, and Gary Bradski disclosed an ORB-based frame feature extraction method in their paper “ORB: An efficient alternative to SIFT or SURF” presented at the 2011 IEEE International Conference on Computer Vision, pp. 2564-2571. If sparse features such as ORB are used as features of a frame, the distance can be calculated as the Hamming distance between the ORB descriptor of the current frame 20 and the ORB descriptor of one of the k nearest frames.

[0050] (2) The k nearest keyframes provide k initial poses for the current frame. These k poses are associated with the k nearest keyframes and are pre-stored in the keyframe database during the mapping process. Device 10 then performs an Iterative Closest Point (ICP) algorithm between the 3D point cloud from the depth image associated with the current frame and the 3D point cloud associated with each nearest keyframe to refine the k poses. Thus, k refined poses associated with the k nearest keyframes are generated.

[0051] (3) Among all k refined poses, the pose with the smallest interior point RMSE (root mean square error) and the largest interior point percentage is selected as the estimated pose for the current frame 20 in the next stage. Device 10 calculates the interior point RMSE of the specific pose among the k keyframes associated with the specific pose for that specific keyframe. rmse :

[0052]

[0053] Represents a 3D point in the point cloud of the current frame;

[0054] Represents a 3D point in the point cloud of a specific keyframe; and

[0055] Operations Indicates output and Operations on Euclidean norms.

[0056] The inlier percentage of a specific pose is the percentage of one or more inliers out of all 3D points in the current frame 20. One or more inliers are defined as points in the current frame that map to points in a specific keyframe of the 3D map during ICP. k refined poses are associated with k inlier RMSEs and k inlier percentages. Device 10 selects one of the k refined poses that has the minimum inlier root mean square error (RMSE) and the maximum inlier percentage to form the estimated pose for the current frame.

[0057] ICP metric check:

[0058] The third step is the ICP metric check (block 623). In single-frame relocalization, ICP is used for point transformation. In the ICP metric check, ICP is used for double checkpoints. The ICP metric is a combination of in-point RMSE and in-point percentage. The ICP metric check uses in-point percentage and in-point RMSE to determine whether a frame can be used as a query frame. In the ICP metric check, if the current frame has a selected pose and its in-point RMSE is below a threshold δ, then the frame is considered a query frame. rmse And the percentage of interior points is higher than a certain threshold δ per If the current frame 20 is selected, it becomes the query frame and is added to the query frame sequence (block 624). Otherwise, the current frame is ignored (block 625), and the process continues to the next frame.

[0059] Two key conditions that could lead to a high interior point RMSE:

[0060] 1) The current frame 20 includes regions that have not been mapped by the mapping process;

[0061] 2) The current frame 20 includes a region that has already been mapped, but keyframe retrieval failed to find a good match.

[0062] In this scenario, the initial pose of the ICP might be too far from reality, or it might be the ground truth pose. The first condition should be avoided. If the current frame contains an area that has not yet been mapped, relocalization is not performed at all. The current frame containing this area is called an out-of-map frame. Unless the out-of-map frame has a similar appearance and geometry to some keyframe in the map, the interior point RMSE may be high. Threshold δ rmse and δ per It can be set based on experience, but may vary depending on the depth camera parameters and the mapping scene. Finding δ can be performed after the mapping process. rmse and δ perThe process of determining the optimal threshold. Device 10 can use keyframes in the map as input to perform single-frame relocalization. Single-frame relocalization is a process of determining the pose of a frame relative to the map. In the keyframe database, each keyframe stores a camera pose. This pose is calculated in the mapping phase and can be called the "baseline truth pose". The mapping process selects a set of keyframes and calculates the pose of the selected keyframes. In this step, these poses are considered the baseline truth. Since the baseline truth pose of each keyframe is known, the result of relocalization can be determined. Since relocalization is successfully completed when the translation and rotation errors of the estimated pose are less than a threshold, the selection of query frames can be viewed as a classification problem using ICP metrics as features. ICP metrics can include parameters of interior point RMSE and interior point percentage related measurements. Machine learning can then be used to process such ICP metric parameters, such as simple decision tree learning, to avoid most negative cases.

[0063] When the inlier RMSE of the selected refined pose in the current frame 20 is lower than the RMSE threshold δ rmse Furthermore, the selected refined pose of the current frame 20 is higher than a certain percentage threshold δ. per When a query frame is added to the query frame sequence, device 10 selects the current frame 20 as the query frame and adds it to the query frame sequence. As a result of selecting the current frame 20, the estimated pose of the selected current frame 20 is obtained as one of the estimated poses of the query frames. When a query frame is added to the sequence, device 10 also stores the corresponding point cloud from the depth image associated with the query frame. To improve efficiency, the point cloud may be downsampled. Device 10 can use the point cloud for pose refinement. This process can be repeated for each input frame to generate multiple query frames and their estimated poses.

[0064] The second stage of pose refinement:

[0065] The pose refinement stage uses a refined subset of frames from the query frame sequence to refine the estimated pose of the query frames (block 631). This stage is used when the number of query frames exceeds a threshold N. seq The process begins at a certain time. Although all query frames satisfy the ICP metric in the first stage, not all query frames are used for the final pose refinement due to errors in pose estimation or ICP. For example, since a tabletop in a room has a plane similar to the plane of the ground, the point cloud of the tabletop may match the point cloud of the ground. The goal of the second stage is to select a sufficient number of inlier frames from the query frames. Note that inliers here refer to frames during ICP, not points. A method similar to Random Sample Consensus (RANSAC) can be used to select inliers. The algorithm for the second stage is shown in Table 1:

[0066] Table 1

[0067]

[0068] In this pose refinement process, the input to the second stage is all query frames in the sequence, with external poses from the odometry and estimated poses about the map from the sequence generation stage. The external pose is generated from the external odometry. The estimated pose is generated during the relocalization process. As shown in lines 1 and 2 of Algorithm 1, device 10 uses the estimated poses of the query frames to transform all point clouds from all query frames into reference coordinate frames for the 3D map. Any map has an origin and orientation for the x, y, and z axes. The coordinate system of the map is called the reference coordinate frame. The reference coordinate frame is not a frame in the sequence.

[0069] As shown in line 3 of Algorithm 1, device 10 calculates the Euclidean RMSE between the point cloud of each transformed point cloud of the query frame and the point cloud of the reference coordinate frame in the 3D map. As shown in line 4 of Algorithm 1, device 10 determines the calculated Euclidean RMSE associated with the query frame to generate multiple inlier frames, wherein when the calculated Euclidean RMSE of frame i is less than a threshold δ... rmse At that time, frame i in the query frame sequence is identified as an inlier frame. Device 10 combines the point clouds from all inlier frames and refines the estimated pose of the inlier frames to generate a refined estimated pose using ICP. Device 10 can use the refined estimated pose to improve vision-based relocalization. For example, device 10 can use the refined estimated pose to better relocalize AR sessions, AR content, or virtual objects. After relocalization, the virtual object can be placed into the scene.

[0070] Reference Figure 6 In the second stage, from all estimated poses, device 10 selects the estimated pose with sufficiently good value. Frame i. Therefore, for each estimated pose... Device 10 uses the estimated pose shown in the second line of Algorithm 1. Convert all point clouds in all query frames into reference coordinate frames for the map. Frame i has the estimated pose. and external pose Through transformation The point cloud PC of the j-th frame in the query frame sequence. j Transform to the reference coordinate frame. PC j Where (j = 1..n) represents all frames in the sequence. Basically, the algorithm processes each frame (i = 0:n as shown in the first row), using the current frame i as a reference during each frame, and transforming all frames in the sequence using the condition in the second row.

[0071] Then, using the pose of frame i and the point cloud PC of the map... map PC of point cloud of all frames in the point calculation sequence seq The Euclidean RMSE between points in the interval. If the Euclidean RMSE is less than the threshold δ...rmse If frame i is found, then frame i is considered an inlier. When the number of inliers is large enough, for example, greater than n / 2, all inlier frames are saved as elements in a refined subset. In one embodiment, once such an inlier frame is found, device 10 returns the inlier frame and the transformation applied to the inlier as the output of the second stage. Each inlier frame in the second stage output is associated with the estimated pose of all j in the sequence (1..n). external pose and transformation Related. Variable i is the selected frame index used for pose initialization, and variable j is a frame index from 1 to n. yes The reverse rotation. This early return strategy reduces the computational cost of Algorithm 1. In an alternative embodiment, after evaluating all query frames with estimated poses, the frame with the largest number of inliers is selected and saved as an element in the refinement subset. For example, a smaller RMSE is flattened. In other words, if two frames have the same number of inliers, embodiments of the disclosed method prefer the frame with the smaller RMSE in the selection of frames for the refinement subset. Device 10 combines the point cloud from all inlier frames and refines the estimated pose using ICP, and then refines the estimated pose P. final As part of the output of the second phase.

[0072] Device 10 determines whether pose refinement was successful (block 632). When pose refinement is successfully performed with an estimated pose having the minimum average RMSE, the estimated pose with the minimum average RMSE and the interior points associated with the estimated pose are also stored as the refined estimated pose P. final (Box 634). After processing all frames, if device 10 cannot find an estimated pose with enough inliers, device 10 removes the outlier frame with the minimum mean RMSE as the estimated pose and repeats the first and second stages for the other input frames. Device 10 processes the new frame as the current frame in the first and second stages until the refined subset has enough frames.

[0073] Outlier removal occurs when, after processing a sequence of N frames, the second stage cannot obtain frames that meet the criteria. Outlier removal involves slightly reducing the sequence by N frames. The sequence is then shortened, and the second stage waits for the sequence to have N frames again.

[0074] The proposed method utilizes RGB / monochrome images, depth images, and an external odometry as input to achieve vision-based relocalization. This method employs a conventional pipeline, offering fast computation speed and suitability for mobile AR devices. Sequence-based relocalization achieves higher accuracy than single-frame methods. Because it uses sequences rather than single frames as input, the method is also robust to visual changes in the environment.

[0075] Figure 7 This is a block diagram of an example system 700 of a vision-based relocation method disclosed in an embodiment of this application. The embodiments described herein can be implemented into the system using any appropriately configured hardware and / or software. Figure 7 The system 700 shown includes a radio frequency (RF) circuit 710, a baseband circuit 720, a processing unit 730, a memory 740, a display 750, a camera module 760, a sensor 770, and an input / output (I / O) interface 780, which are coupled to each other as shown.

[0076] Processing unit 730 may include circuitry, such as, but not limited to, one or more single-core or multi-core processors. The processor may include any combination of general-purpose processors and special-purpose processors (e.g., graphics processors and application processors). The processor may be coupled to memory and configured to execute instructions stored in memory to enable various applications and / or operating systems running on the system.

[0077] Baseband circuitry 720 may include circuitry, such as, but not limited to, one or more single-core or multi-core processors. The processor may include a baseband processor. The baseband circuitry may handle various radio control functions enabling communication with one or more radio networks via RF circuitry. Radio control functions may include, but are not limited to, signal modulation, encoding, decoding, radio frequency shifting, etc. In some embodiments, the baseband circuitry may provide communication compatible with one or more radio technologies. For example, in some embodiments, the baseband circuitry may support communication with 5G NR, LTE, Evolved Universal Terrestrial Radio Access Network (EUTRAN) and / or other Wireless Metropolitan Area Networks (WMAN), Wireless Local Area Networks (WLAN), and Wireless Personal Area Networks (WPAN). Embodiments of the baseband circuitry configured to support radio communication with more than one radio protocol may be referred to as multimode baseband circuitry. In various embodiments, baseband circuitry 720 may include circuitry operating with signals not strictly considered to be at baseband frequencies. For example, in some embodiments, the baseband circuitry may include circuitry operating with signals having an intermediate frequency (IF) between the baseband frequency and the radio frequency.

[0078] RF circuit 710 can communicate with a wireless network using modulated electromagnetic radiation through a non-solid medium. In various embodiments, the RF circuit may include switches, filters, amplifiers, etc., to facilitate communication with the wireless network. In various embodiments, RF circuit 710 may include circuitry that operates with signals not strictly considered to be in the radio frequency range. For example, in some embodiments, the RF circuit may include circuitry that operates with a signal having an intermediate frequency (IF), which is between the baseband frequency and the radio frequency.

[0079] In various embodiments, the transmitter circuitry, control circuitry, or receiver circuitry discussed above with respect to the UE, eNB, or gNB may be wholly or partially embodied in the RF circuitry, baseband circuitry, and / or processing unit. As used herein, “circuit” may refer to, belong to, or include application-specific integrated circuits (ASICs), electronic circuitry, processors (shared, dedicated, or grouped) and / or memories (shared, dedicated, or grouped) executing one or more software or firmware programs, combinational logic circuitry, and / or other suitable hardware components providing the aforementioned functionality. In some embodiments, electronic device circuitry may be implemented in one or more software or firmware modules, or the functionality associated with the circuitry may be implemented by one or more software or firmware modules. In some embodiments, some or all of the components of the baseband circuitry, processing unit, and / or memory may be implemented together on a system-on-a-chip (SOC).

[0080] Memory 740 can be used to load and store data and / or instructions, for example, for the system. Memory used in one embodiment may include any combination of suitable volatile memory (e.g., dynamic random access memory (DRAM)) and / or non-volatile memory (e.g., flash memory). In various embodiments, I / O interface 780 may include one or more user interfaces designed to enable a user to interact with the system and / or peripheral component interfaces designed to enable peripheral components to interact with the system. User interfaces may include, but are not limited to, physical keyboards or keypads, touchpads, speakers, microphones, etc. Peripheral component interfaces may include, but are not limited to, non-volatile memory ports, universal serial bus (USB) ports, audio jacks, and power interfaces.

[0081] Camera module 760 may include a color space image camera and a depth camera, such as depth camera 15a or 15b. The color space image camera is configured to acquire a sequence of input frames, where each input frame includes a color space image. The depth camera is configured to acquire a depth image associated with the color space image in each frame.

[0082] Sensor 770 is configured to provide an external odometry associated with a color space image in each frame. In various embodiments, sensor 770 may include one or more sensing devices to determine environmental conditions and / or location information relevant to the system. In some embodiments, the sensor may include, but is not limited to, an IMU, a gyroscope sensor, an accelerometer, a proximity sensor, an ambient light sensor, and a positioning unit. The positioning unit may also be part of or interact with baseband circuitry and / or RF circuitry to communicate with components of a positioning network, such as Global Positioning System (GPS) satellites. In various embodiments, display 750 may include a display, such as a liquid crystal display (LCD) and a touchscreen display. In various embodiments, system 700 may be a mobile computing device, such as, but not limited to, a laptop, tablet, netbook, ultrabook, smartphone, etc. In various embodiments, the system may have more or fewer components and / or different architectures. Where appropriate, the methods described herein can be implemented as a computer program. The computer program may be stored on a storage medium, such as a non-transitory storage medium.

[0083] The embodiments of this application are combinations of technologies / processes that can be used to create the final product.

[0084] Those skilled in the art will understand that the various units, algorithms, and steps described and disclosed in the embodiments of this invention are implemented through electronic hardware or a combination of computer software and electronic hardware. Whether a function runs in hardware or software depends on the application conditions and the design requirements of the technical solution. Those skilled in the art can use different methods to implement the functions for each specific application, but such implementations should not exceed the scope of this application. Those skilled in the art will understand that the working process of the above-described systems, devices, and units can refer to the working process of the systems, devices, and units in the above embodiments, and the units are basically the same. For ease of description and simplicity, these working processes will not be described in detail.

[0085] It is understood that the systems, devices, and methods disclosed in the embodiments of the present invention can be implemented in other ways. The above embodiments are merely exemplary. The division of units is based solely on logical function; other divisions exist in the implementation. Multiple units or components can be combined or integrated into another system. It is also possible to omit or skip certain features. On the other hand, the mutual coupling, direct coupling, or communication coupling shown or discussed operates through some ports, devices, or units, whether indirectly or through electrical, mechanical, or other forms of communication.

[0086] The units used as separators for explanation may be physically separate or not. The units used for display may or may not be physical units, i.e., located in one place or distributed across multiple network units. Some or all units may be used depending on the purpose of the embodiment. Furthermore, the various functional units in different embodiments may be integrated into a single processing unit, or they may be physically independent, or two or more units may be integrated into a single processing unit.

[0087] If software functional units are implemented, used, and sold as a product, they can be stored in a readable storage medium within a computer. Based on this understanding, the technical solutions proposed in this invention can be implemented substantially or partially in the form of a software product. Alternatively, a portion of the technical solution advantageous to the prior art can be implemented as a software product. The software product in the computer is stored in a storage medium and includes multiple commands for a computing device (e.g., a personal computer, server, or network device) to execute all or part of the steps disclosed in the embodiments of this invention. The storage medium includes a USB flash drive, a portable hard drive, a read-only memory (ROM), a random access memory (RAM), a floppy disk, or other media capable of storing program code.

[0088] The proposed solution employs a matching and refinement pipeline and includes a two-stage process to refine the pose. The first stage selects query frames into the sequence. The second stage selects inlier frames from the sequence. Finally, the inlier frames are used to refine the pose. The disclosed method achieves high relocalization accuracy while maintaining efficiency with low computational resources. Due to the inlier selection within the sequence, the invention avoids the drawbacks of keyframe-based methods, including poor initialization and poor ICP due to insufficient geometric detail. Furthermore, the sequence employs inlier frames with good geometric fit. When the sequence is long enough to cover static portions of the scene without visual changes, the disclosed method can handle scenes with visual changes.

[0089] Although this application has been described in conjunction with what are considered to be the most practical and preferred embodiments, it should be understood that this application is not limited to the disclosed embodiments, but is intended to cover various arrangements made without departing from the broadest interpretation of the appended claims.

Claims

1. An electronic device-executable vision-based repositioning method, characterized by, include: Based on the evaluation of depth image-based single-frame relocalization associated with the input frame sequence, a query frame sequence is selected from the input frame sequence, wherein the input frame sequence consists of different frames obtained from different viewpoints, each input frame in the input frame sequence includes an RGB image associated with a depth image, and the evaluation of depth image-based single-frame relocalization includes: evaluating the point cloud registration of the current frame in the input frame sequence using depth information from the depth image associated with the current frame and depth information from depth images associated with multiple keyframes in a 3D map; and Using the external pose associated with the query frame sequence, the estimated pose associated with the query frame sequence is refined for vision-based relocalization, wherein the external pose is obtained from external odometry. The plurality of keyframes includes k nearest keyframes relative to the current frame, where k is a positive integer, and the point cloud registration of the current frame includes an Iterative Closest Point (ICP) algorithm applied to the current frame. The method further includes: Provide the current frame with k poses associated with the k nearest keyframes; An Iterative Closest Point (ICP) algorithm is performed between the 3D point cloud of the depth image associated with the current frame and the 3D point cloud associated with each of the k nearest keyframes to refine the k poses associated with the k nearest keyframes. Select one of k refined poses that has the minimum root mean square error (RMSE) of inliers and the maximum percentage of inliers to form the estimated pose of the current frame.

2. The vision-based relocalization method as described in claim 1, characterized in that, The method further includes: The pose of the current frame is compared with the pose of at least one stored query frame in the query frame sequence to determine whether the current frame represents a completely different viewpoint from the stored query frame when the query frame sequence has another query frame other than the current frame.

3. The vision-based relocalization method as described in claim 2, characterized in that, The method further includes: When the Euclidean distance between the pose of the current frame and the pose of the at least one stored query frame is greater than a threshold, it is determined that the current frame represents a completely different viewpoint from the stored query frames; and An evaluation of single-frame relocalization based on a depth image is performed on the current frame, wherein the current frame represents a viewpoint completely different from the stored query frame.

4. The vision-based relocalization method as described in claim 1, characterized in that, For a specific keyframe among k keyframes associated with a specific pose, calculate the interior point RMSE of the specific pose in the k poses. The interior point percentage of the specific pose is the percentage of one or more interior points among all 3D points in the current frame. The one or more interior points are defined as points in the current frame that are mapped to points in the specific keyframe in the 3D map during the ICP. The k refined poses are associated with the k interior point RMSE and the k interior point percentage.

5. The vision-based relocalization method as described in claim 4, characterized in that, The method further includes: When the RMSE of the inliers of the selected refined pose of the current frame is lower than the RMSE threshold, and the percentage of inliers of the selected refined pose of the current frame is higher than a specific percentage threshold, the current frame is selected as a query frame and added to the query frame sequence, wherein the estimated pose of the selected current frame is obtained as one of the estimated poses of the query frame.

6. The vision-based relocalization method as described in claim 5, characterized in that, The method further includes storing the depth image associated with the current frame added to the query frame sequence.

7. The vision-based relocalization method as described in claim 5, characterized in that, The method further includes: The estimated pose of the query frame is used to transform all point clouds of all the query frames into reference coordinate frames of the 3D map; Calculate the Euclidean RMSE between the point cloud of each transformation of the query frame and the points of the reference coordinate frame in the 3D map; Determine the Euclidean RMSE of the calculated query frame to generate multiple inlier frames, wherein the calculated Euclidean RMSE of the i-th frame is less than a threshold. When, the i-th frame in the query frame sequence is determined to be an inlier frame; and The point clouds from all inlier frames are combined and the estimated poses of the inlier frames are refined to generate a refined estimated pose using ICP.

8. The vision-based relocalization method as described in claim 7, characterized in that, The i-th frame has an estimated pose. and external pose and through transformation The point cloud of the j-th frame in the query frame sequence Transformed into the reference coordinate frame.

9. An electronic device, characterized in that, include: A camera configured to acquire a sequence of input frames, where each input frame includes an RGB image; A depth camera, configured to acquire a depth image associated with the RGB image; An inertial measurement unit is configured to provide an external odometry associated with the RGB image; as well as Processor, configured to execute: Based on the evaluation of depth image-based single-frame relocalization associated with the input frame sequence, a query frame sequence is selected from the input frame sequence, wherein the input frame sequence consists of different frames obtained from different viewpoints, each input frame in the input frame sequence includes an RGB image associated with a depth image, and the evaluation of depth image-based single-frame relocalization includes: determining the point cloud registration of the current frame in the input frame sequence using depth information of the depth image associated with the current frame and depth information of the depth images associated with multiple keyframes in a three-dimensional (3D) map; as well as Using the external pose associated with the query frame sequence, the estimated pose associated with the query frame sequence is refined for vision-based relocalization, wherein the external pose is obtained from the external odometry. The plurality of keyframes includes k nearest keyframes relative to the current frame, where k is a positive integer, the point cloud registration of the current frame includes an Iterative Closest Point (ICP) algorithm applied to the current frame, and the processor is further configured to execute: Provide the current frame with k poses associated with the k nearest keyframes; An Iterative Nearest Point (ICP) algorithm is performed between the 3D point cloud of the depth image associated with the current frame and the 3D point cloud associated with each of the k nearest keyframes to refine the k poses associated with the k nearest keyframes; and Select one of k refined poses that has the minimum root mean square error (RMSE) of inliers and the maximum percentage of inliers to form the estimated pose of the current frame.

10. The electronic device as claimed in claim 9, characterized in that, The processor is also configured to execute: The pose of the current frame is compared with the pose of at least one stored query frame in the query frame sequence to determine whether the current frame represents a completely different viewpoint from the stored query frame when the query frame sequence has another query frame other than the current frame.

11. The electronic device as claimed in claim 10, characterized in that, The processor is also configured to execute: When the Euclidean distance between the pose of the current frame and the pose of the at least one stored query frame is greater than a threshold, it is determined that the current frame represents a completely different viewpoint from the stored query frames; and An evaluation of single-frame relocalization based on a depth image is performed on the current frame, wherein the current frame represents a viewpoint completely different from the stored query frame.

12. The electronic device as claimed in claim 11, characterized in that, For a specific keyframe among k keyframes associated with a specific pose, calculate the interior point RMSE of the specific pose in the k poses. The interior point percentage of the specific pose is the percentage of one or more interior points among all 3D points in the current frame. The one or more interior points are defined as points in the current frame that are mapped to points in the specific keyframe in the 3D map during the ICP. The k refined poses are associated with the k interior point RMSE and the k interior point percentage.

13. The electronic device as claimed in claim 12, characterized in that, The processor is also configured to execute: When the RMSE of the inliers of the selected refined pose of the current frame is lower than the RMSE threshold, and the percentage of inliers of the selected refined pose of the current frame is higher than a specific percentage threshold, the current frame is selected as a query frame and added to the query frame sequence, wherein the estimated pose of the selected current frame is obtained as one of the estimated poses of the query frame.

14. The electronic device as claimed in claim 13, characterized in that, The processor is also configured to perform the following: storing the depth image associated with the current frame added to the query frame sequence.

15. The electronic device as claimed in claim 13, characterized in that, The processor is also configured to execute: The estimated pose of the query frame is used to transform all point clouds of all the query frames into reference coordinate frames of the 3D map; Calculate the Euclidean RMSE between the point cloud of each transformation of the query frame and the points of the reference coordinate frame in the 3D map; Determine the Euclidean RMSE of the calculated query frame to generate multiple inlier frames, wherein the calculated Euclidean RMSE of the i-th frame is less than a threshold. When the i-th frame in the query frame sequence is determined to be an in-point frame; as well as The point clouds from all inlier frames are combined and the estimated poses of the inlier frames are refined to generate a refined estimated pose using ICP.

16. The electronic device as claimed in claim 15, characterized in that, The i-th frame has an estimated pose. and external pose and through transformation The point cloud of the j-th frame in the query frame sequence Transformed into the reference coordinate frame.

17. A chip, characterized in that, include: A processor configured to invoke and run a computer program stored in memory, causing a device on which the chip is mounted to perform the method as described in any one of claims 1 to 8.

18. A computer-readable storage medium storing a computer program, characterized in that, The computer program causes the computer to perform the method as described in any one of claims 1 to 8.

19. A computer program product, comprising a computer program, characterized in that, The computer program causes the computer to perform the method according to any one of claims 1 to 8.

Citation Information

Patent Citations

  • Unmanned aerial vehicle vision-inertia fusion indoor positioning method

    CN111024066A

  • Device pose estimation using 3D line clouds

    US20200005486A1