Instant positioning and mapping method and device in dynamic environment and electronic equipment

By combining deep learning models and dense optical flow algorithms, a mask map is generated and the dynamic probability of map points is updated, which solves the problem of distinguishing between static and dynamic instances in dynamic environments and improves the accuracy and robustness of real-time localization and mapping.

CN116412809BActive Publication Date: 2026-06-12BEIJING UNIV OF POSTS & TELECOMM

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING UNIV OF POSTS & TELECOMM
Filing Date
2023-02-22
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

In dynamic environments, existing technologies struggle to effectively distinguish between static and dynamic instances, resulting in poor accuracy and robustness in real-time localization and mapping.

Method used

By combining deep learning models and dense optical flow algorithms, a first mask image and a second mask image are generated, the dynamic probability of map points is updated, reprojection residuals and feature relative position residuals are constructed, camera pose is calculated, and dynamic and static features are distinguished.

🎯Benefits of technology

It improves the accuracy and robustness of real-time localization and mapping, reduces the number of missed and false detections, and improves the accuracy of camera pose solving.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116412809B_ABST
    Figure CN116412809B_ABST
Patent Text Reader

Abstract

The application provides a real-time positioning and mapping method, device and electronic equipment in a dynamic environment. The method comprises: obtaining an environment image and a preset map, and obtaining a key frame according to the environment image and the preset map; according to the key frame, a first mask map is calculated through a trained deep learning model; according to the first mask map, a second mask map is calculated through a dense optical flow algorithm; the dynamic probability of a map point of the preset map is updated according to the first mask map and the second mask map; a re-projection residual and a feature relative position residual are constructed according to the dynamic probability of the map point, and a camera pose is calculated according to the re-projection residual and the feature relative position residual; and a target map is constructed according to the camera pose. Through the method, device and electronic equipment provided by the application, the accuracy and robustness of real-time positioning and mapping can be improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer vision technology, and in particular to a method, apparatus and electronic device for real-time localization and mapping in dynamic environments. Background Technology

[0002] Simultaneous Localization and Mapping (SLAM) is a technique that builds a map based on the surrounding environment while moving. Using SLAM, a robot can start from an unknown location in an unknown environment, perform self-localization based on its position and a pre-set map during movement, and simultaneously build an incremental map based on its self-localization, thus achieving autonomous localization and navigation.

[0003] In the process of self-localization based on the preset map, the robot usually first uses the preset camera to acquire environmental images, and then performs feature matching on the corresponding feature points in the environmental images and the preset map based on static instances to perform self-localization or build an incremental map.

[0004] In related technologies, prior semantic information of the environmental images acquired by the camera is mainly obtained through deep learning models, and then dynamic instances are removed. However, the technical solutions in these technologies for identifying static and dynamic instances are prone to missed detections, resulting in poor accuracy and robustness of localization and mapping in dynamic environments. Summary of the Invention

[0005] In view of this, the purpose of this application is to propose a method, apparatus and electronic device for real-time localization and mapping in dynamic environments.

[0006] To achieve the above objectives, this application provides a real-time localization and mapping method in a dynamic environment, comprising:

[0007] Acquire environmental images and a preset map, and obtain keyframes based on the environmental images and the preset map;

[0008] Based on the keyframes, the first mask image is calculated using a trained deep learning model;

[0009] Based on the first mask image, the second mask image is calculated using the dense optical flow algorithm;

[0010] Update the dynamic probability of map points in the preset map based on the first mask image and the second mask image;

[0011] The reprojection residual and the feature relative position residual are constructed based on the dynamic probability of the map points, and the camera pose is calculated based on the reprojection residual and the feature relative position residual.

[0012] Construct a target map based on the camera pose.

[0013] Optionally, the step of calculating the first mask image based on the keyframe using a trained deep learning model includes:

[0014] Based on the keyframes, instance segmentation is performed using a trained deep learning model to obtain instance segmentation results;

[0015] A sub-mask image is generated based on the instance segmentation result; each sub-mask image contains at least one instance;

[0016] Obtain the depth value corresponding to each instance in each of the sub-mask images;

[0017] The first mask image is obtained by sequentially overlaying at least all of the sub-mask images in descending order of depth values.

[0018] Optionally, the step of calculating the second mask image based on the first mask image using a dense optical flow algorithm includes:

[0019] Obtain the first mask image from the previous moment;

[0020] Based on the first mask image of the previous moment and the current moment, the offset value of each pixel coordinate is calculated by a preset dense optical flow algorithm.

[0021] Based on the offset value and the first mask image at the previous moment, the coordinate values ​​of each pixel in the second mask image are calculated using the following formula;

[0022]

[0023] Among them, u t v is the x-coordinate value of any pixel in the second mask image. t u is the ordinate value of the pixel in the second mask image. t-1 v is the x-coordinate value of the pixel corresponding to the second mask image at the previous time step. t-1 Δu is the vertical coordinate of the pixel corresponding to the second mask image at the previous moment, Δv is the horizontal coordinate offset, and Δv is the vertical coordinate offset.

[0024] The second mask image is obtained based on the coordinate values ​​of each pixel in the second mask image.

[0025] Optionally, the step of acquiring the environmental image and the preset map, and obtaining keyframes based on the environmental image and the preset map, includes:

[0026] Acquire environmental images;

[0027] Based on the environmental image, image feature points are obtained using a preset feature extraction method;

[0028] The image feature points are matched with the map points to obtain an image frame containing the feature matching relationship between the image feature points and the map points;

[0029] In response to determining that the image frame satisfies a first preset condition, the image frame is designated as a key frame; the key frame includes at least one instance; any instance includes at least one key frame feature point of the key frame.

[0030] Optionally, before updating the dynamic probabilities of map points in the preset map, the method further includes:

[0031] The intersection-union ratio (IUR) of at least all feature points in the first mask image and at least all feature points in the second mask image is calculated using the IUR function.

[0032] In response to the intersection-union ratio of any of the feature points being greater than or equal to a preset threshold, the mask value of the feature point in the first mask image is used as the target mask value of the feature point.

[0033] In response to the fact that the intersection-union ratio of any of the feature points is less than the preset threshold, and the feature point belongs to the first mask image, the mask value of the feature point in the first mask image is used as the target mask value of the feature point.

[0034] In response to the fact that the intersection-union ratio of any of the feature points is less than the preset threshold, and the feature point belongs to the second mask image, the mask value of the feature point in the second mask image is used as the target mask value of the feature point.

[0035] At least all of the aforementioned feature points are used as target feature points, and a target mask map is generated based on at least all of the aforementioned target feature points and the target mask values ​​corresponding to the target feature points; the target feature points have a corresponding relationship with the map points.

[0036] Optionally, updating the dynamic probability of map points in the preset map includes:

[0037] Determine the map point corresponding to each of the target feature points;

[0038] Obtain the prior dynamic probability of this map point;

[0039] The dynamic probability of this map point is updated using the following formula;

[0040] p(m t )=ηp(z t |m t )p(m t |z t-1 ,m0);

[0041] Where η is the preset normalization coefficient, m t Let z be the state of a map point at time t. t Let z be the map point observation state at time t. t-1 t-1 represents the observed state of the map point, and m0 represents the initial state of the map point.

[0042] Optionally, the step of constructing reprojection residuals and feature relative position residuals based on the dynamic probabilities of the map points, and calculating camera pose based on the reprojection residuals and feature relative position residuals, includes:

[0043] Based on the dynamic probabilities of at least all the map points, static instances are selected from the instances;

[0044] Based on the preset map, static map points are obtained from the map points;

[0045] The camera pose is calculated using the following formula;

[0046]

[0047]

[0048]

[0049] Among them, e p For the reprojection residual of any static instance, e r Let N be the position difference of any static map point at different times, N be the number of static map points and static instance pairs at the same time, M be the number of multiple static instance pairs at the same time, and T be the position difference of any static map point at different times. cw Let p(m) be the camera pose at time t. i ) represents the dynamic probability of the i-th static map point, m is the static map point, x is the static instance, z is the observation of the static map point, and π(T) is the dynamic probability of the i-th static map point. cw m i Let ω be the projection function of the i-th static map point. k Preset weights.

[0050] Optionally, the first preset condition is:

[0051] The time difference between the current time and the last time a keyframe was built exceeds the preset time;

[0052] And / or, the inlier rate of the feature points in the image frame is less than a preset threshold; the inlier rate represents the ratio of static feature points in the image frame.

[0053] Based on the same inventive concept, this application also provides a real-time positioning and mapping device in a dynamic environment, comprising:

[0054] The acquisition module is configured to acquire an environmental image and a preset map, and obtain keyframes based on the environmental image and the preset map.

[0055] The first calculation module is configured to calculate the first mask image based on the keyframe using a trained deep learning model.

[0056] The second calculation module is configured to calculate the second mask image based on the first mask image using a dense optical flow algorithm.

[0057] The third calculation module is configured to update the dynamic probability of map points of the preset map based on the first mask map and the second mask map, and update the dynamic probability of instances based on the first mask map and the second mask map.

[0058] The fourth calculation module is configured to construct reprojection residuals and feature relative position residuals based on the dynamic probabilities of the map points, and to calculate the camera pose based on the reprojection residuals and feature relative position residuals.

[0059] The generation module is configured to construct a target map based on the camera pose, distinguish dynamic features from static features based on the dynamic probability of the instance, construct reprojection residuals and feature relative position residuals based on static features, calculate camera pose, and construct a target map based on the camera pose.

[0060] Based on the same inventive concept, this application also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the real-time positioning and mapping method in a dynamic environment as described in any of the above claims.

[0061] As can be seen from the above, the instant localization and mapping method, apparatus and electronic device in dynamic environment provided in this application obtains a first mask image through a deep learning model, and then obtains a second mask image through a dense optical flow algorithm based on the first mask image. The static and dynamic instances in the image are determined by the first and second mask images together, and the dynamic instances are then removed. This reduces the missed detection and false detection of the dynamic probability of the instances, and improves the accuracy and robustness of instant localization and mapping. Attached Figure Description

[0062] To more clearly illustrate the technical solutions in this application or related technologies, the drawings used in the description of the embodiments or related technologies will be briefly introduced below. Obviously, the drawings described below are only embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0063] Figure 1 This is a flowchart illustrating one or more embodiments of the real-time localization and mapping method in a dynamic environment according to this application.

[0064] Figure 2 This is a schematic diagram of the structure of a real-time positioning and mapping device in a dynamic environment according to one or more embodiments of this application;

[0065] Figure 3 This is a schematic diagram illustrating experimental results of the real-time localization and mapping method in a dynamic environment according to one or more embodiments of this application;

[0066] Figure 4 This is a schematic diagram illustrating experimental results of the real-time localization and mapping method in a dynamic environment according to one or more embodiments of this application;

[0067] Figure 5 This is a schematic diagram of the structure of a real-time localization and mapping system in a dynamic environment according to one or more embodiments of this application;

[0068] Figure 6 This is a schematic diagram of the hardware structure of an electronic device according to one or more embodiments of this application. Detailed Implementation

[0069] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with specific embodiments and the accompanying drawings.

[0070] It should be noted that, unless otherwise defined, the technical or scientific terms used in the embodiments of this application should have the ordinary meaning understood by one of ordinary skill in the art to which this application pertains. The terms "first," "second," and similar terms used in the embodiments of this application do not indicate any order, quantity, or importance, but are merely used to distinguish different components. Terms such as "comprising" or "including" mean that the element or object preceding the word encompasses the elements or objects listed after the word and their equivalents, without excluding other elements or objects. Terms such as "connected" or "linked" are not limited to physical or mechanical connections, but can include electrical connections, whether direct or indirect. Terms such as "upper," "lower," "left," and "right" are only used to indicate relative positional relationships; when the absolute position of the described object changes, the relative positional relationship may also change accordingly.

[0071] Simultaneous Localization and Mapping (SLAM) is a technique that builds a map based on the surrounding environment while moving. Using SLAM, a robot can start from an unknown location in an unknown environment, perform self-localization based on its position and a pre-set map during movement, and simultaneously build an incremental map based on its self-localization, thus achieving autonomous localization and navigation.

[0072] In the process of using SLAM technology to achieve real-time localization and incremental map building for robots, environmental images acquired by the robot through its onboard camera are utilized, and feature matching is performed between these environmental images and a pre-defined map. Both the environmental images and the pre-defined map include at least one instance, and all instances can be categorized into static and dynamic instances based on dynamic probability. Instances represent buildings, objects, animals, etc., that can serve as reference points, and dynamic probability represents the likelihood that the instance's position will change; based on this probability, instances can be classified as static or dynamic. Each instance includes at least one feature point in both the environmental images and the pre-defined map.

[0073] As described in the background section, in related technologies, dynamic instances in images are removed using deep learning methods based on prior semantic information. However, filtering dynamic instances using deep learning methods is prone to false negatives or missed detections. Therefore, in dynamic environments, especially those with limited static context, the accuracy and robustness of real-time localization and mapping solutions employed by these technologies are unsatisfactory.

[0074] Therefore, this application proposes a real-time localization and mapping method in a dynamic environment. Based on the determination of instances through deep learning, instances are further determined through dense optical flow algorithm. The dynamic probabilities of instances obtained by the two methods are used to filter and summarize all instances to avoid missed or false detections, thereby improving the accuracy and robustness of real-time localization and mapping.

[0075] The technical solutions of one or more embodiments of this application will be described in detail below through specific examples.

[0076] refer to Figure 1 The real-time localization and mapping method of one or more embodiments in this specification includes the following steps:

[0077] Step S101: Obtain an environmental image and a preset map, and obtain keyframes based on the environmental image and the preset map.

[0078] In this step, keyframes are first acquired to facilitate subsequent steps. In some embodiments, these keyframes can be obtained by feature matching between the acquired image and a preset map.

[0079] In some embodiments, the above keyframes can be obtained through the following specific steps:

[0080] Step S201: Acquire environmental images.

[0081] In some embodiments, the aforementioned environmental image can be obtained through a preset camera. In some embodiments, the aforementioned camera can be an RGB-D camera. In other embodiments, the aforementioned camera can be a stereo camera. RGB-D cameras and stereo cameras can directly or indirectly calculate the pixel information and depth information of the image. Different camera choices, as long as they achieve the corresponding purpose, will not affect the scope of protection of this invention.

[0082] Step S202: Based on the environmental image, obtain image feature points using a preset feature extraction method.

[0083] In this step, feature extraction is performed on the environmental image obtained in step S201.

[0084] In some embodiments, image preprocessing may be performed before feature extraction. In some embodiments, the preprocessing may include converting a color image to a grayscale image.

[0085] In some embodiments, the above feature extraction methods may include: obtaining ORB features using the FAST algorithm, where ORB features include feature points and descriptors; obtaining SIFT features using the SIFT algorithm, where SIFT features include feature points; obtaining SURF features using the SURF algorithm, where SURF features include feature points; or obtaining SuperPoint features using SuperPoint extraction, where SuperPoint features include feature points and descriptors. Different feature extraction methods, as long as they achieve the corresponding objectives, will not affect the scope of protection of this invention.

[0086] Step S203: Perform feature matching between the image feature points and the map points to obtain an image frame containing the feature matching relationship between the image feature points and the map points.

[0087] In this step, feature matching is performed between the feature points obtained in step S202 and the feature points of the preset map to obtain an image frame containing the feature matching relationship between the image feature points and the map points.

[0088] In some embodiments, the aforementioned preset map can also be obtained through initialization. In some embodiments, the specific steps of the initialization may include: calculating the three-dimensional coordinates of the feature points based on their depth values ​​in the environmental image. In some embodiments, the specific steps of the initialization may further include: triangulating the feature points based on the binocular time difference of the environmental image, and initializing the map based on the triangulated feature points. Different initialization methods, as long as they achieve the corresponding purpose, will not affect the scope of protection of this invention.

[0089] Step S204: In response to determining that the image frame meets the first preset condition, the image frame is used as a keyframe.

[0090] In some embodiments, the first preset condition may be: the time since the last keyframe acquisition exceeds a predetermined value; and / or, the inlier rate of the feature points in the feature map is less than a predetermined threshold. In some embodiments, the predetermined time may be 0.25 seconds. The inlier rate represents the ratio of static feature points in the feature map.

[0091] As described above, instances are reference objects that can be used for real-time localization or mapping. In some embodiments, both the aforementioned environmental image and the aforementioned preset map may include at least one instance. Therefore, the feature map obtained based on the aforementioned environmental image and the aforementioned preset map should also include at least one instance. That is, the aforementioned keyframes should also include at least one instance. Each of the aforementioned instances includes feature points from at least one keyframe.

[0092] In other embodiments, the aforementioned environmental image or preset map does not include instances; that is, it is impossible to identify instances that are included in both the environmental image and the preset map. In this case, the keyframe does not include instances. In some embodiments, the keyframe background can be used for localization in this situation.

[0093] Since the above steps are to obtain the key frame by matching the feature points of the above environmental image with the map points of the above preset map, the feature points in the above key frame have a corresponding relationship with the map points of the above preset map, and the dynamic probability of the feature points in the key frame is the dynamic probability of the corresponding map points of the preset map.

[0094] Step S102: Based on the keyframe, calculate the first mask image using the trained deep learning model.

[0095] In some embodiments, the deep learning model described above can be an image segmentation model. In some embodiments, the deep learning model described above can be at least one of the Mask R-CNN model, the FCIS model, and the YOLACT model. Different deep learning models, as long as they achieve the corresponding objectives, do not affect the scope of protection of this invention.

[0096] As mentioned above, each instance corresponds to a pixel region. In some embodiments, there may be overlapping pixel regions corresponding to different instances in the keyframe, which affects the subsequent determination of the dynamic probability of the instance. To solve this problem, in some embodiments, at least one sub-mask image can be obtained first through instance segmentation. Each sub-mask image includes a unique instance, and the sub-mask images correspond one-to-one with the instances in the keyframe. Then, the average depth of the pixels in the pixel region of the unique instance in each sub-mask image is calculated, and this average depth is used as the depth value of the instance. Based on the depth values ​​of the instances in all the sub-mask images, they are superimposed in descending order of depth value to obtain the first mask image. Through the above steps, the instances in the first mask image are superimposed sequentially from farthest to nearest according to their distance.

[0097] In some embodiments, the aforementioned depth value may represent the distance of the instance from the camera. In some embodiments, the depth value of the instance can be obtained by taking the average of the non-zero value set of the pixel region corresponding to the instance in the depth map, based on data collected by an RGB-D camera. In some embodiments, the depth of the instance can be obtained by reconstructing the pixel region corresponding to the instance into a world coordinate system map based on data collected by a stereo camera, and taking the average depth of the resulting map point set as the depth value of the instance. Different methods for obtaining the instance depth value do not affect the scope of protection of this invention as long as they achieve the corresponding purpose.

[0098] In some embodiments, the keyframe does not contain instances, that is, the keyframe only contains the background or people, objects, etc. that cannot be used as references. In this case, the instances cannot be segmented by the deep learning model trained above, and there is no corresponding sub-mask image. Therefore, all pixels in the corresponding mask image are set to the preset background value.

[0099] Step S103: Based on the first mask image, calculate the second mask image using the dense optical flow algorithm.

[0100] In implementing this application, the applicant discovered that segmenting instances solely through image segmentation or other deep learning models may result in either false negatives or false positives. False negatives occur when actual instances are not detected during the segmentation process; false negatives occur when instances with dynamic probabilities below a preset threshold—meaning they could be used as fixed references for subsequent camera pose and mapping calculations—are incorrectly classified as unsuitable for reference. These issues may lead to an insufficient number of static instances, resulting in low accuracy in camera pose calculations and further contributing to poor mapping accuracy or robustness.

[0101] Therefore, the applicant proposes a method that, after obtaining the first mask image through a deep learning model, obtains a second mask image through a dense optical flow algorithm, and uses the second mask image and the first mask image to verify each other, so as to avoid the occurrence of missed detections and false detections as much as possible.

[0102] Specifically, in some embodiments, the second mask image can be obtained through the following steps: obtaining the first mask image from the previous moment; calculating the offset value of each pixel coordinate based on the first mask image from the previous moment and the current moment using a preset dense optical flow algorithm; and calculating the coordinate values ​​of each pixel in the second mask image using the following formula based on the offset value and the first mask image from the previous moment: Among them, u t v is the x-coordinate value of any pixel in the second mask image described above. t The ordinate value of the pixel in the second mask image above, u t-1 v is the x-coordinate value of the pixel corresponding to the second mask image mentioned above at the previous time step. t-1 The ordinate value is the vertical coordinate of the pixel corresponding to the second mask image at the previous moment, Δu is the horizontal coordinate offset value, and Δv is the vertical coordinate offset value; the second mask image is obtained based on the coordinate values ​​of each pixel in the second mask image.

[0103] Step S104: Update the dynamic probability of the instance based on the first mask image and the second mask image.

[0104] As described above, both the first and second mask images include at least one feature point, and all feature points can be matched with corresponding map points. In some embodiments, the first and second mask images may contain the same feature point and correspond to the same or different dynamic probabilities. In some embodiments, a feature point may exist in the first / second mask image but not in the second / first mask image.

[0105] In some embodiments, the cross-union ratio (CURRR) of at least all instances in the first mask map with at least all instances in the second mask map can be calculated using the cross-union ratio function, and the most likely dynamic probability of all possible instances can be determined using the CURRR.

[0106] In some embodiments, in response to the intersection-union ratio (IU) of any of the feature points being greater than or equal to a preset threshold, the mask value of the feature point in the first mask image is used as the target mask value of the feature point; in response to the intersection-union ratio of any of the feature points being less than the preset threshold, and the feature point belonging to the first mask image, the mask value of the feature point in the first mask image is used as the target mask value of the feature point; in response to the intersection-union ratio of any of the feature points being less than the preset threshold, and the feature point belonging to the second mask image, the mask value of the feature point in the second mask image is used as the target mask value of the feature point; at least all of the feature points are used as target feature points, and a target mask image is generated based on at least all of the target feature points and the target mask values ​​corresponding to the target feature points; the target feature points have a correspondence with the map points.

[0107] In some embodiments, after determining all possible feature points in the keyframe through the above content, the dynamic probability of each feature point is calculated using the following method. Since the feature points obtained at this time correspond one-to-one with the map points of the preset map, these feature points can be considered as the map points. In some embodiments, the initial dynamic probability of the feature points, i.e., the map points, can be set to 0.5. In some embodiments, based on the prior information of the feature points, i.e., the map points, the dynamic probability of the hypothetical dynamic feature points, i.e., the map points, can be set to 0.9, and the dynamic probability of the hypothetical static feature points, i.e., the map points, can be set to 0.1. In some embodiments, the dynamic probability can be calculated as follows: obtain the prior dynamic probabilities of at least all feature points, i.e., map points, in the target mask image; based on the prior dynamic probabilities, update the dynamic probabilities of at least all the feature points, i.e., the map points, using the following formula: p(m t )=ηp(z t |m t )p(m t |z t-1 ,m0); where η is the preset normalization coefficient, m t Let z be the state of a map point at time t. t Let z be the map point observation state at time t. t-1 t-1 represents the observed state of the map point, and m0 represents the initial state of the map point.

[0108] Step S105: Construct reprojection residuals and feature relative position residuals based on the dynamic probabilities of the map points, and calculate the camera pose based on the reprojection residuals and feature relative position residuals.

[0109] In some embodiments, the above dynamic probability can be regarded as the updated dynamic probability, and subsequent calculations such as camera pose solving and map building can be performed based on the updated dynamic probability.

[0110] In some embodiments, constructing reprojection residuals and feature relative position residuals based on the dynamic probabilities of the map points, and calculating camera pose based on the reprojection residuals and feature relative position residuals, includes: selecting static instances from the instances based on the dynamic probabilities of at least all the map points; selecting static map points from the map points based on the preset map; and calculating camera pose according to the following formula.

[0111]

[0112]

[0113]

[0114] Among them, e p For the reprojection residual of any static instance, e r Let N be the position difference of any static map point at different times, N be the number of static map points and static instance pairs at the same time, M be the number of multiple static instance pairs at the same time, and T be the position difference of any static map point at different times. cw Let p(m) be the camera pose at time t. i ) represents the dynamic probability of the i-th static map point, m is the static map point, x is the static instance, z is the observation of the static map point, and π(T) is the dynamic probability of the i-th static map point. cw m i Let ω be the projection function of the i-th static map point. k Preset weights.

[0115] In some embodiments, when selecting feature point pairs for constructing relative position residuals, since the relative position errors of static feature point pairs at different times satisfy a t-distribution, feature point pairs that satisfy the t-test are selected to construct the constraint term, and the weight ω of the corresponding constraint term is calculated according to the distribution. k .

[0116] Step S106: Construct a target map based on the camera pose.

[0117] In implementing this application, the applicant discovered that the processing time for a single frame of image using a deep learning model for image segmentation cannot meet the latency requirements of practical application scenarios. Therefore, the applicant proposes a non-blocking tracking thread structure. That is, if the aforementioned dynamic probability is not calculated within a preset time, the original dynamic probability is still used for calculation during camera pose determination; if the aforementioned dynamic probability calculation is completed within the preset time, the updated dynamic probability is used during camera pose determination.

[0118] like Figure 3 and Figure 4 As shown, the applicant conducted experiments using the TUM dataset and the KITTI dataset to demonstrate the ORB-SLAM3 SLAM technique and the method provided in this application.

[0119] The experimental results include three aspects: absolute trajectory error (ATE), relative pose error (RPE), and time. ATE estimates the direct difference between the pose and the true pose, providing a very intuitive reflection of the algorithm's accuracy and global trajectory consistency. RPE primarily describes the accuracy of the pose difference between two frames separated by a fixed time interval (compared to the true pose), essentially measuring the odometry error directly.

[0120] As can be seen from the figure, the method provided in this application improves the accuracy of instance detection and its dynamic probability, improves the camera pose solution constraints, alleviates the model omission problem of related technologies, and improves the accuracy and robustness of real-time localization and mapping in dynamic scenes.

[0121] It should be noted that the method in this embodiment can be executed by a single device, such as a computer or server. The method can also be applied in a distributed scenario, where multiple devices cooperate to complete the task. In such a distributed scenario, one of these devices may execute only one or more steps of the method in this embodiment, and the multiple devices will interact with each other to complete the method described.

[0122] It should be noted that the above description describes some embodiments of this application. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recorded in the claims can be performed in a different order than that shown in the above embodiments and still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require a specific or sequential order to achieve the desired result. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

[0123] Based on the same inventive concept, corresponding to any of the above embodiments, this application also provides a real-time positioning and mapping device in a dynamic environment.

[0124] refer to Figure 2 The real-time positioning and mapping device in the dynamic environment includes:

[0125] The acquisition module 11 is configured to acquire an environmental image and a preset map, and to obtain keyframes based on the environmental image and the preset map;

[0126] The first calculation module 12 is configured to calculate the first mask image based on the keyframe using a trained deep learning model.

[0127] The second calculation module 13 is configured to calculate the second mask image based on the first mask image using a dense optical flow algorithm.

[0128] The third calculation module 14 is configured to update the dynamic probability of map points in the preset map based on the first mask image and the second mask image;

[0129] The fourth calculation module 15 is configured to construct reprojection residuals and feature relative position residuals based on the dynamic probabilities of the map points, and to calculate the camera pose based on the reprojection residuals and feature relative position residuals.

[0130] The generation module 16 is configured to construct a target map based on the camera pose.

[0131] like Figure 5 As shown, one or more embodiments of this application can be applied to the following SLAM systems, including:

[0132] The image acquisition module is configured to acquire environmental images through a preset camera device.

[0133] The inter-frame pose calculation module is configured to perform image matching based on the aforementioned environmental image and a preset map to obtain a feature map. The map points of the preset map and the dynamic probabilities of the instances corresponding to the map points are obtained through the mapping module. After obtaining the feature map, the camera pose is calculated according to a preset algorithm, and when the feature map meets preset conditions, the feature map is sent as a keyframe to the instance segmentation module, the mapping module, and the loop closure detection module respectively.

[0134] The instance segmentation module is configured to calculate updated instance dynamic probabilities based on keyframes. Its specific calculation process and algorithm are as described above and will not be repeated here. After obtaining the updated dynamic probabilities, these probabilities are sent to the mapping module.

[0135] The mapping module is configured to update map points based on the dynamic probabilities of received instances and maintain keyframes and co-views.

[0136] The loop closure detection module is configured to determine whether a loop closure has occurred based on preset conditions, and to calculate the relative pose between loop closure frames when loop closure calculation is required.

[0137] The backend optimization module is configured to optimize map points and camera poses based on the dynamic probabilities of instances.

[0138] For ease of description, the above devices are described in terms of function, divided into various modules. Of course, in implementing this application, the functions of each module can be implemented in one or more software and / or hardware.

[0139] The apparatus described above is used to implement the real-time localization and mapping method in the corresponding dynamic environment in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.

[0140] Based on the same inventive concept, corresponding to the methods of any of the above embodiments, this application also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the real-time positioning and mapping method in a dynamic environment as described in any of the above embodiments.

[0141] Figure 6 This embodiment illustrates a more specific hardware structure of an electronic device, which may include a processor 1010, a memory 1020, an input / output interface 1030, a communication interface 1040, and a bus 1050. The processor 1010, memory 1020, input / output interface 1030, and communication interface 1040 are interconnected internally via the bus 1050.

[0142] The processor 1010 can be implemented using a general-purpose CPU (Central Processing Unit), microprocessor, application-specific integrated circuit (ASIC), GPU (Graphics Processing Unit), NPU (Neural Processor Unit), or one or more integrated circuits, to execute relevant programs and implement the technical solutions provided in the embodiments of this specification. Different implementation methods, as long as they achieve the corresponding objectives, will not affect the scope of protection of this invention.

[0143] The memory 1020 can be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory), static storage device, dynamic storage device, etc. The memory 1020 can store the operating system and other applications. When the technical solutions provided in the embodiments of this specification are implemented by software or firmware, the relevant program code is stored in the memory 1020 and is called and executed by the processor 1010.

[0144] The input / output interface 1030 is used to connect input / output modules to realize information input and output. Input / output modules can be configured as components within the device (not shown in the figure) or externally connected to the device to provide corresponding functions. Input devices may include keyboards, mice, touchscreens, microphones, various sensors, etc., while output devices may include displays, speakers, vibrators, indicator lights, etc.

[0145] The communication interface 1040 is used to connect a communication module (not shown in the figure) to enable communication between this device and other devices. The communication module can communicate via wired means (such as USB, Ethernet cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.).

[0146] Bus 1050 includes a pathway for transmitting information between various components of the device, such as processor 1010, memory 1020, input / output interface 1030, and communication interface 1040.

[0147] It should be noted that although the above-described device only shows the processor 1010, memory 1020, input / output interface 1030, communication interface 1040, and bus 1050, in specific implementations, the device may also include other components necessary for normal operation. Furthermore, those skilled in the art will understand that the above-described device may only include the components necessary for implementing the embodiments of this specification, and not necessarily all the components shown in the figures.

[0148] The electronic devices described above are used to implement the real-time positioning and mapping methods in the corresponding dynamic environment in any of the foregoing embodiments, and have the beneficial effects of the corresponding method embodiments, which will not be repeated here.

[0149] Those skilled in the art should understand that the discussion of any of the above embodiments is merely exemplary and is not intended to imply that the scope of this application (including the claims) is limited to these examples; within the framework of this application, the technical features of the above embodiments or different embodiments can also be combined, the steps can be implemented in any order, and there are many other variations of different aspects of the embodiments of this application as described above, which are not provided in the details for the sake of brevity.

[0150] Additionally, to simplify the description and discussion, and to avoid obscuring the embodiments of this application, the well-known power / ground connections to integrated circuit (IC) chips and other components may or may not be shown in the provided drawings. Furthermore, the apparatus may be shown in block diagram form to avoid obscuring the embodiments of this application, and this also takes into account the fact that the details of the implementation of these block diagram apparatuses are highly dependent on the platform on which the embodiments of this application will be implemented (i.e., these details should be fully understood by those skilled in the art). While specific details (e.g., circuits) have been set forth to describe exemplary embodiments of this application, it will be apparent to those skilled in the art that the embodiments of this application can be implemented without these specific details or with variations thereof. Therefore, these descriptions should be considered illustrative rather than restrictive.

[0151] Although this application has been described in conjunction with specific embodiments thereof, many substitutions, modifications, and variations of these embodiments will be apparent to those skilled in the art from the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may be used with the embodiments discussed.

[0152] The embodiments of this application are intended to cover all such substitutions, modifications, and variations that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the embodiments of this application should be included within the protection scope of this application.

Claims

1. A method for simultaneous localization and mapping in a dynamic environment, the method comprising: include: Acquire environmental images and a preset map, and obtain keyframes based on the environmental images and the preset map; Based on the keyframes, the first mask image is calculated using a trained deep learning model; Based on the first mask image, the second mask image is calculated using the dense optical flow algorithm; Update the dynamic probability of map points in the preset map based on the first mask image and the second mask image; The reprojection residual and the feature relative position residual are constructed based on the dynamic probability of the map points, and the camera pose is calculated based on the reprojection residual and the feature relative position residual. Construct a target map based on the camera pose; The dynamic probability of updating the map points of the preset map includes: Determine the map point corresponding to each target feature point; Obtain the prior dynamic probability of this map point; The dynamic probability of this map point is updated using the following formula; ; in, The normalization coefficients are preset. for The status of map points at any given time. for The status of map point observations at any given time. for The status of map point observations at any given time. This represents the initial state of the map points.

2. The real-time positioning and mapping method in a dynamic environment according to claim 1, characterized in that, The step of calculating the first mask image based on the keyframe using a trained deep learning model includes: Based on the keyframes, instance segmentation is performed using a trained deep learning model to obtain instance segmentation results; A sub-mask image is generated based on the instance segmentation result; each sub-mask image contains at least one instance; Obtain the depth value corresponding to each instance in each of the sub-mask images; The first mask image is obtained by sequentially overlaying at least all of the sub-mask images in descending order of depth values.

3. The real-time positioning and mapping method in a dynamic environment according to claim 2, characterized in that, The step of calculating the second mask image based on the first mask image using a dense optical flow algorithm includes: Obtain the first mask image from the previous moment; Based on the first mask image of the previous moment and the current moment, the offset value of each pixel coordinate is calculated by a preset dense optical flow algorithm. Based on the offset value and the first mask image at the previous moment, the coordinate values ​​of each pixel in the second mask image are calculated using the following formula; ; in, The x-coordinate value of any pixel in the second mask image. This represents the ordinate value of that pixel in the second mask image. The x-coordinate value of the pixel corresponding to the second mask image at the previous time step. This represents the ordinate value of the pixel corresponding to the second mask image at the previous moment. This is the horizontal coordinate offset value. This represents the offset value of the ordinate. The second mask image is obtained based on the coordinate values ​​of each pixel in the second mask image.

4. The real-time positioning and mapping method in a dynamic environment according to claim 3, characterized in that, The process of acquiring environmental images and preset maps, and obtaining keyframes based on the environmental images and preset maps, includes: Acquire environmental images; Based on the environmental image, image feature points are obtained using a preset feature extraction method; The image feature points are matched with the map points to obtain an image frame containing the feature matching relationship between the image feature points and the map points; In response to determining that the image frame meets a first preset condition, the image frame is designated as a keyframe.

5. The real-time positioning and mapping method in a dynamic environment according to claim 4, characterized in that, Before updating the dynamic probabilities of map points in the preset map, the method further includes: The intersection-union ratio (IUR) of at least all feature points in the first mask image and at least all feature points in the second mask image is calculated using the intersection-union ratio function. In response to the intersection-union ratio of any of the feature points being greater than or equal to a preset threshold, the mask value of the feature point in the first mask image is used as the target mask value of the feature point. In response to the fact that the intersection-union ratio of any of the feature points is less than the preset threshold, and the feature point belongs to the first mask image, the mask value of the feature point in the first mask image is used as the target mask value of the feature point. In response to the fact that the intersection-union ratio of any of the feature points is less than the preset threshold, and the feature point belongs to the second mask image, the mask value of the feature point in the second mask image is used as the target mask value of the feature point. At least all of the aforementioned feature points are used as target feature points, and a target mask map is generated based on at least all of the aforementioned target feature points and the target mask values ​​corresponding to the target feature points; the target feature points have a corresponding relationship with the map points.

6. The real-time positioning and mapping method in a dynamic environment according to claim 2, characterized in that, The step of constructing reprojection residuals and feature relative position residuals based on the dynamic probabilities of the map points, and calculating camera pose based on the reprojection residuals and feature relative position residuals, includes: Based on the dynamic probabilities of at least all the map points, static instances are selected from the instances; Based on the preset map, static map points are obtained from the map points; The camera pose is calculated using the following formula; in, For any static instance, the reprojection residual, Let N be the position difference of any static map point at different times, N be the number of static map points and static instance pairs at the same time, and M be the number of multiple static instance pairs at the same time. Let t be the camera pose at time t. Let be the dynamic probability of the i-th static map point. For static map points, For static instances, For observation of static map points, Let i be the projection function for the i-th static map point. Preset weights.

7. The real-time positioning and mapping method in a dynamic environment according to claim 4, characterized in that, The first preset condition is: The time difference between the current time and the last time a keyframe was built exceeds the preset time; And / or, the inlier rate of the feature points in the image frame is less than a preset threshold; the inlier rate represents the ratio of static feature points in the image frame.

8. A real-time positioning and mapping device for dynamic environments, characterized in that, include: The acquisition module is configured to acquire an environmental image and a preset map, and to obtain keyframes based on the environmental image and the preset map. The first calculation module is configured to calculate the first mask image based on the keyframe using a trained deep learning model. The second calculation module is configured to calculate the second mask image based on the first mask image using a dense optical flow algorithm. The third calculation module is configured to update the dynamic probability of map points in the preset map based on the first mask image and the second mask image; The fourth calculation module is configured to construct reprojection residuals and feature relative position residuals based on the dynamic probabilities of the map points, and to calculate the camera pose based on the reprojection residuals and feature relative position residuals. The generation module is configured to construct a target map based on the camera pose; The third calculation module is specifically configured as follows: Determine the map point corresponding to each target feature point; Obtain the prior dynamic probability of this map point; The dynamic probability of this map point is updated using the following formula; ; in, The normalization coefficients are preset. for The status of map points at any given time. for The status of map point observations at any given time. for The status of map point observations at any given time. This represents the initial state of the map points.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the method as described in any one of claims 1 to 7.