FPGA-based high-energy-efficiency real-time dense slam method and system

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By extracting binocular camera image features on an FPGA and performing disparity estimation and key point matching to generate a dense disparity map, the problem of limited computing resources and sparse point clouds in existing technologies is solved, realizing an efficient and real-time dense SLAM method that is suitable for micro-sized autonomous mobile robots.

CN117745817BActive Publication Date: 2026-06-16SUN YAT SEN UNIV

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: SUN YAT SEN UNIV
Filing Date: 2023-12-15
Publication Date: 2026-06-16

Application Information

Patent Timeline

15 Dec 2023

Application

16 Jun 2026

Publication

CN117745817B

IPC: G06T7/73; G01C21/00; G01C21/20; G01B11/02; G01B11/00; G01S17/08; G01S17/06; G01S17/89

CPC: Y02D10/00

AI Tagging

Application Domain

Image analysis Navigational calculation instruments

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing FPGA-based visual SLAM accelerators cannot provide robust feature representations in complex environments and have limited computing resources, only constructing incomplete sparse point clouds, which cannot meet the needs of advanced tasks such as autonomous navigation and autonomous obstacle avoidance.

⚗Method used

A high-efficiency real-time dense SLAM method based on FPGA is adopted. By extracting features from binocular camera images, performing disparity estimation and key point matching, and combining binary neural networks and non-maximum suppression algorithms, a dense disparity map and visual key points are generated and sent to the CPU for pose estimation and 3D dense mapping.

🎯Benefits of technology

It improves computational efficiency, reduces the size and power consumption of computing devices, enhances positioning accuracy and robustness, can generate dense point clouds to meet advanced task requirements, and improves real-time performance.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN117745817B_ABST

Patent Text Reader

Abstract

The application discloses a high-energy-efficiency real-time dense SLAM method and system based on FPGA, and the method comprises the following steps: extracting features to obtain a feature map; performing parallax estimation according to the feature map to obtain a dense parallax map; extracting key points of a left eye image as first key points, performing feature correlation with a left feature map, and performing feature augmentation and aggregation to obtain second key points; performing non-maximum suppression on the second key points to retain, in each set size neighborhood, the second key points with the highest key point response value as third key points; performing heap sorting on the third key points to retain the third key points with the largest key point response value as visual key points; calculating the Hamming distance between the visual key points and the to-be-matched key points from a CPU, and performing matching to obtain a matched key point pair. The application can reduce the required computing resources for simultaneous localization and mapping, and can be widely applied in the technical field of simultaneous localization and mapping.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of synchronous localization and mapping technology, and in particular to a high-efficiency real-time dense SLAM method and system based on FPGA. Background Technology

[0002] Simultaneous Localization and Mapping (SLAM) aims to estimate a robot's trajectory and perceive the geometry and appearance of its surroundings using onboard sensors. Current mainstream SLAM solutions typically employ LiDAR or RGB-D cameras to acquire distance information for subsequent localization. However, LiDAR is generally expensive and bulky, making it unsuitable for small, autonomously moving robots, while RGB-D cameras often perform poorly outdoors in direct sunlight.

[0003] Over the past few decades, widely recognized visual SLAM frameworks such as ORB-SLAM and VINS have emerged. These mature frameworks are software-based solutions, running on CPU / GPU platforms that require hundreds of watts of power. This makes them unsuitable for micro-mobile robots with strict limitations on computing resources, power consumption, and workload. To further broaden the applicability of visual SLAM and balance energy efficiency and real-time performance, there has been extensive research on hardware acceleration of visual SLAM. However, most existing FPGA-based visual SLAM accelerators are designed for mature SLAM algorithms using hand-designed visual features (such as ORB, SIFT, etc.). These algorithms cannot provide robust feature representations in complex environments. Furthermore, due to the limited computing resources on FPGAs, the aforementioned FPGA-based accelerators can only construct incomplete maps with sparse point clouds, which cannot meet the requirements of advanced tasks such as autonomous navigation and obstacle avoidance. Summary of the Invention

[0004] In view of this, this application provides a high-energy-efficiency real-time dense SLAM method and system based on FPGA to reduce the computing resources required for synchronous localization and mapping.

[0005] One aspect of this application provides a high-energy-efficiency real-time dense SLAM method based on FPGA, comprising:

[0006] Features are extracted from the left and right images of the binocular camera to obtain a left feature map and a right feature map, respectively. Disparity estimation is performed based on the left and right feature maps to obtain a dense disparity map. The dense disparity map is then sent to the CPU for pose estimation and 3D dense mapping.

[0007] Key points are extracted from the left image of the binocular camera as first key points. The first key points are then associated with the left feature map and subjected to feature augmentation and aggregation to obtain second key points. Non-maximum suppression is applied to the second key points to retain the second key points with the highest key point response values in each neighborhood of a set size as third key points. The third key points are then heap sorted to retain the third key points with the largest key point response values as visual key points. The visual key points are then sent to the CPU for pose estimation.

[0008] Calculate the Hamming distance between the visual keypoint and the keypoint to be matched from the CPU, and match the visual keypoint and the keypoint to be matched according to the Hamming distance to obtain a matching keypoint pair; determine whether the matching keypoint pair meets the preset maximum Hamming distance and motion model constraints. If it does, send the matching keypoint pair to the CPU so that the CPU can perform pose estimation based on the matching keypoint pair.

[0009] Optionally, the step of extracting features from the left image and the right image of the stereo camera to obtain a left feature map and a right feature map includes:

[0010] Features are extracted from the left image of the binocular camera using a binary neural network to obtain the left feature map;

[0011] Features are extracted from the right image of the binocular camera using a binary neural network to obtain the right feature map.

[0012] Optionally, extracting the key points from the left image of the stereo camera as the first key points includes:

[0013] The left image from the binocular camera is cached in multiple rows to extract multiple image blocks;

[0014] FAST corner detection is performed on each image block to determine the coordinates of key points that meet the corner detection conditions, and the key points corresponding to each key point coordinate are taken as the first key points.

[0015] Optionally, the step of associating the first key point with the left feature map and performing feature augmentation and aggregation to obtain the second key point includes:

[0016] Find and associate the features corresponding to the first key point in the left feature map;

[0017] The second keypoint is obtained by augmenting and aggregating other image features in the neighborhood of the first keypoint that is associated with the feature.

[0018] Optionally, the step of performing non-maximum suppression on the second keypoint to retain the second keypoint with the highest keypoint response value as the third keypoint within each neighborhood of a predetermined size includes:

[0019] Using the non-maximum suppression range as the side length, the feature map containing the second key point is divided into multiple non-maximum suppression blocks;

[0020] The response value of each of the second key points in each of the non-maximum suppression blocks is compared with the response value of the corresponding maximum point;

[0021] If the response value of each of the second key points in the non-maximum suppression block is greater than the response value of the corresponding maximum point, then the second key point is replaced with the corresponding maximum point; otherwise, the second key point is discarded; each of the maximum points is used as the third key point.

[0022] Optionally, the step of performing a heap sort on the third keypoints to retain the third keypoint with the largest keypoint response value as a visual keypoint includes:

[0023] Assign a unique identifier to each of the third key points and temporarily store the descriptors of each of the third key points;

[0024] Each of the third key points is input into a max-heap, sorted from high to low according to the key point response value, and the top-ranked third key points are retained.

[0025] Based on the identifiers of each of the third key points retained in the max-heap, the corresponding descriptors are found, and each of the third key points retained in the max-heap is output as the visual key points.

[0026] Another aspect of this application provides a high-efficiency real-time dense SLAM system based on FPGA, comprising: a CPU and an FPGA; wherein the CPU and the FPGA are connected via an AXI4 bus;

[0027] The FPGA is used to execute the aforementioned FPGA-based high-efficiency real-time dense SLAM method;

[0028] The CPU is used for pose estimation and 3D dense mapping.

[0029] Another aspect of this application provides a high-efficiency real-time dense SLAM device based on FPGA, comprising:

[0030] The first unit is used to extract features from the left image and the right image of the binocular camera to obtain a left feature map and a right feature map; perform disparity estimation based on the left feature map and the right feature map to obtain a dense disparity map, and send the dense disparity map to the CPU so that the CPU can perform pose estimation and 3D dense mapping based on the dense disparity map.

[0031] The second unit is used to extract key points from the left image of the binocular camera as first key points, associate the first key points with the left feature map and perform feature augmentation and aggregation to obtain second key points; perform non-maximum suppression on the second key points to retain the second key points with the highest key point response values in each neighborhood of a set size as third key points; perform heap sort on the third key points to retain the third key points with the largest key point response values as visual key points, and send the visual key points to the CPU for the CPU to perform pose estimation based on the visual key points;

[0032] The third unit is used to calculate the Hamming distance between the visual keypoint and the keypoint to be matched from the CPU, match the visual keypoint and the keypoint to be matched according to the Hamming distance to obtain a matching keypoint pair; determine whether the matching keypoint pair meets the preset maximum Hamming distance and motion model constraints, and if it does, send the matching keypoint pair to the CPU so that the CPU can perform pose estimation based on the matching keypoint pair.

[0033] Another aspect of this application provides an electronic device, including a processor and a memory;

[0034] The memory is used to store programs;

[0035] The processor executes the program to implement the aforementioned method.

[0036] Another aspect of this application provides a computer-readable storage medium storing a program that is executed by a processor to implement the aforementioned method.

[0037] This application also discloses a computer program product or computer program including computer instructions stored in a computer-readable storage medium. A processor of an electronic device can read the computer instructions from the computer-readable storage medium and execute the computer instructions to cause the electronic device to perform the aforementioned method.

[0038] This application includes at least the following beneficial effects:

[0039] This application sends the calculated dense disparity map and visual key points to the CPU, reducing CPU computation and improving efficiency. At the same time, this application can also receive key points to be matched from the CPU and then match them with visual key points. The matching feature process is also performed independently of the CPU, further reducing CPU computation, lowering the size and power consumption of the computing equipment required for synchronous localization and mapping, and improving real-time performance. It can be applied to micro-sized autonomous mobile robots. Attached Figure Description

[0040] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0041] Figure 1 An example structural diagram of a high-efficiency real-time dense SLAM system provided in this application embodiment;

[0042] Figure 2 A flowchart illustrating a high-efficiency real-time dense SLAM method based on FPGA provided in this application embodiment;

[0043] Figure 3 An example flowchart of a high-energy-efficiency real-time dense SLAM method based on FPGA provided in this application embodiment;

[0044] Figure 4 This is a schematic diagram of a key point detection module provided in an embodiment of this application;

[0045] Figure 5 A schematic diagram of a feature aggregation module provided in an embodiment of this application;

[0046] Figure 6 This is a schematic diagram of a nonmaximum suppression module provided in an embodiment of this application;

[0047] Figure 7 A schematic diagram of a heap sorting module provided in an embodiment of this application;

[0048] Figure 8 This application provides a schematic diagram of a pipeline optimization.

[0049] Figure 9 This is a structural block diagram of a high-efficiency real-time dense SLAM device provided in an embodiment of this application. Detailed Implementation

[0050] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0051] It should be noted that although functional modules are divided in the device schematic diagram and the logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than the module division in the device or the order in the flowchart.

[0052] The terms "first," "second," etc., used in the specification, claims, and accompanying drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0053] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit this application.

[0054] First, a high-efficiency real-time dense SLAM system based on FPGA provided in this application embodiment will be described, comprising a CPU and an FPGA; wherein the CPU and FPGA are connected via an AXI4 bus;

[0055] The FPGA is used to execute a high-efficiency real-time dense SLAM method based on FPGA provided in the embodiments of this application.

[0056] The CPU is used for pose estimation and 3D dense mapping.

[0057] As an optional implementation, the CPU and FPGA in this embodiment can be the CPU and FPGA of a Zynq system-on-a-chip, that is, the high-efficiency real-time dense SLAM system of this application can be applied to a Zynq system-on-a-chip.

[0058] Reference Figure 1 This embodiment provides an example structural diagram of a high-efficiency real-time dense SLAM system.

[0059] To achieve real-time localization and dense 3D mapping on a miniature robot platform with strict requirements on computing device size and power, the software methods deployed on the FPGA side can include feature extraction, depth estimation, and feature matching modules. Simultaneously, a localization backend and a dense 3D reconstruction backend run on the CPU side. The CPU and FPGA sides interact via an AXI4 bus.

[0060] Next, we will describe a high-energy-efficiency real-time dense SLAM method based on FPGA provided in this application, referring to... Figure 2 This method can be applied to FGPA, including steps S100 to S120, as follows:

[0061] S100: Extract features from the left and right images of the binocular camera to obtain a left feature map and a right feature map; perform disparity estimation based on the left and right feature maps to obtain a dense disparity map, and send the dense disparity map to the CPU so that the CPU can perform pose estimation and 3D dense mapping based on the dense disparity map.

[0062] Specifically, refer to Figure 3 The three steps in this embodiment can be executed by the corresponding modules respectively. S100 can be executed by... Figure 3 The depth estimation module has completed execution.

[0063] First, the left and right images captured by the stereo cameras are input into the FPGA and then into the depth estimation module. The BNN (Binary Neural Network) in the depth estimation module extracts features from the left and right images. It's important to note that the features extracted in this step are shared by both the disparity estimation part of the depth estimation module and the feature aggregation part of the feature extraction module, to obtain high-quality dense disparity maps and high-quality visual keypoint descriptors, respectively. The dense disparity map is then sent to the CPU via the AXI4 bus for pose estimation and 3D dense mapping.

[0064] Therefore, the implementation method for extracting features from binocular camera images can be further specified as follows:

[0065] Features are extracted from the left image of the binocular camera using a binary neural network to obtain the left feature map;

[0066] Features are extracted from the right image of the binocular camera using a binary neural network to obtain the right feature map.

[0067] S110: Extract key points from the left image of the binocular camera as first key points; perform feature association and feature augmentation and aggregation on the first key points and the left feature map to obtain second key points; perform non-maximum suppression on the second key points to retain the second key points with the highest key point response values in each neighborhood of a set size as third key points; perform heap sort on the third key points to retain the third key points with the largest key point response values as visual key points, and send the visual key points to the CPU for the CPU to perform pose estimation based on the visual key points.

[0068] Specifically, S110 can be via Figure 3 The feature extraction module in the program is now complete.

[0069] The left image from the binocular camera can be input into the feature extraction module for keypoint detection, extracting keypoints and their coordinates from the image. In the feature aggregation module, this embodiment performs feature association and feature augmentation on the keypoints and their corresponding feature maps to obtain corresponding keypoint descriptors. Subsequently, to avoid excessive concentration of visual keypoints in textured areas, this embodiment applies non-maximum suppression to the aggregated keypoints. Non-maximum suppression retains only the visual keypoints with the highest keypoint response values within a certain neighborhood, preventing excessive concentration of keypoints from interfering with localization. Finally, to ensure the consistency of the total number of visual keypoints, the feature points are stacked to retain the visual keypoints with the largest keypoint response values. Finally, the FPGA sends the keypoints to the CPU via the AXI4 bus for pose estimation. Simultaneously, visual keypoints belonging to this frame are sent to the matching module for subsequent feature matching operations.

[0070] Furthermore, the step of extracting key points from the left image of the binocular camera as the first key point includes:

[0071] The left image from the binocular camera is cached in multiple rows to extract multiple image blocks;

[0072] FAST corner detection is performed on each image block to determine the coordinates of key points that meet the corner detection conditions, and the key points corresponding to each key point coordinate are taken as the first key points.

[0073] A schematic diagram of the key point detection module in the feature extraction module is shown below. Figure 4 As shown, the left image from the binocular camera is buffered to extract image patches, and FAST corner detection is performed on the image patches. The coordinates of key points in the image that meet the corner detection criteria are output to the next module.

[0074] Further, the step of associating the first key point with the left feature map and performing feature augmentation and aggregation to obtain the second key point includes:

[0075] Find and associate the features corresponding to the first key point in the left feature map;

[0076] The second keypoint is obtained by augmenting and aggregating other image features in the neighborhood of the first keypoint that is associated with the feature.

[0077] Specifically, refer to Figure 5 The feature aggregation module in the feature extraction module is shown in the diagram below. Figure 5 As shown, for visual keypoints that meet the FAST corner point conditions, corresponding features are searched in the feature map using a lookup method. To enhance the robustness of visual features, the feature aggregation module augments and aggregates other image features in the neighborhood of the visual keypoints and outputs the results.

[0078] Further, the step of performing non-maximum suppression on the second keypoint to retain the second keypoint with the highest keypoint response value in each neighborhood of a set size as the third keypoint includes:

[0079] Using the non-maximum suppression range as the side length, the feature map containing the second key point is divided into multiple non-maximum suppression blocks;

[0080] The response value of each of the second key points in each of the non-maximum suppression blocks is compared with the response value of the corresponding maximum point;

[0081] If the response value of each of the second key points in the non-maximum suppression block is greater than the response value of the corresponding maximum point, then the second key point is replaced with the corresponding maximum point; otherwise, the second key point is discarded; each of the maximum points is used as the third key point.

[0082] Specifically, a schematic diagram of the non-maximum suppression module in the feature extraction module is shown below. Figure 6As shown, this embodiment designs an innovative non-maximum suppression method. This embodiment uses the non-maximum suppression range as the side length to divide the input image into multiple non-maximum suppression blocks. For streaming visual keypoints, the non-maximum suppression module of this embodiment compares them with the corresponding maxima temporarily stored in the BRAM. If the response value of the input visual keypoint is greater than the maxima temporarily stored in the BRAM, then the input visual keypoint is replaced with the maxima; otherwise, the input visual keypoint is discarded. After all visual keypoints associated with the non-maximum suppression block have been input, the last temporarily stored maxima is the maxima of that block. To avoid read / write conflicts, the non-maximum suppression module of this embodiment designs registers R[0] and R[1] for read / write buffering. Finally, the retained visual keypoints are input into the heap sorting module to ensure the consistency of the number of visual keypoints between different frames.

[0083] Further, the step of performing a heap sort on the third keypoints to retain the third keypoint with the largest keypoint response value as a visual keypoint includes:

[0084] Assign a unique identifier to each of the third key points and temporarily store the descriptors of each of the third key points;

[0085] Each of the third key points is input into a max-heap, sorted from high to low according to the key point response value, and the top-ranked third key points are retained.

[0086] Based on the identifiers of each of the third key points retained in the max-heap, the corresponding descriptors are found, and each of the third key points retained in the max-heap is output as the visual key points.

[0087] Specifically, the heap sorting module in the feature extraction module is illustrated in the diagram below. Figure 7 As shown. To avoid sorting visual keypoints with a large data volume, the heap sorting module in this embodiment assigns a unique identifier to each input visual keypoint and temporarily stores the descriptors of the visual keypoints in the BRAM. The heap sorting part maintains a max-heap, which retains a certain number of visual keypoints with the largest response values and their identifiers. The response values and identifiers of the visual keypoints input to the heap sorting module are added to the max-heap for sorting. After all visual keypoints have been input, the visual keypoints retained in the max-heap represent the maximum number of visual keypoints. In the output stage, the visual keypoints retained in the max-heap look up the corresponding descriptor information from the BRAM based on their identifiers, and then output them from the heap sorting module.

[0088] Traditional 3D reconstruction methods must traverse and perform calculations on all voxels in space. However, observations show that most regions do not contain the surfaces of obstacles or objects, thus eliminating the need for voxel updates. In the backend of dense 3D reconstruction, this application proposes a "coarse-to-fine" 3D reconstruction method: first, it determines whether a large-scale voxel block contains the surfaces of obstacles or objects. If the voxel block is occupied by the surface of an object, it is further subdivided and calculated; if it does not contain the surface of an object, it is skipped without calculation. This method avoids unnecessary calculations on a large number of voxel blocks that do not contain object surfaces, significantly improving the computational efficiency of dense 3D reconstruction.

[0089] S120: Calculate the Hamming distance between the visual keypoint and the keypoint to be matched from the CPU; match the visual keypoint and the keypoint to be matched according to the Hamming distance to obtain a matching keypoint pair; determine whether the matching keypoint pair satisfies the preset maximum Hamming distance and motion model constraints; if satisfied, send the matching keypoint pair to the CPU so that the CPU can perform pose estimation based on the matching keypoint pair.

[0090] Specifically, the matching module receives visual keypoints from the feature extraction module and matches them with keypoints to be matched from the CPU side. Depending on the requirements of the matching stage, the keypoints to be matched on the CPU side may be visual keypoints from the previous frame or global feature map points. In the matching module, the Hamming distance between the keypoints to be matched is first calculated, and then it checks whether the successfully matched data meets pre-set constraints such as the maximum Hamming distance and motion model constraints. Only keypoint pairs that simultaneously meet all constraints pass the matching constraint check and are sent as matching results to the CPU via the AXI4 bus for pose estimation.

[0091] Furthermore, embodiments of this application perform pipeline optimization for heterogeneous operations executed on CPU and FPGA respectively. Figure 8 This is a schematic diagram of pipeline optimization. DE represents depth estimation, FE represents feature extraction, FM represents feature matching, PE represents pose estimation, and PO represents pose optimization. Figure 8 As shown, while the CPU is performing pose estimation and pose optimization steps for the Nth frame image, the FPGA has already begun to perform depth estimation, feature extraction, and feature matching tasks for the N+1th frame in advance to improve the system's output rate and further enhance real-time performance.

[0092] Compared with the prior art, the embodiments of this application have the following beneficial effects:

[0093] 1. Existing technologies require computing devices that are too large and consume too much power, making them unsuitable for deployment in micro-robots. This application designs a SLAM acceleration method based on a Zynq system-on-a-chip with an FPGA and CPU. Feature detection, feature matching, and dense disparity map calculation are deployed on the FPGA, thereby improving the system's real-time performance. The FPGA, as a hardware accelerator, significantly improves computational efficiency, thus reducing the size and power consumption of the required computing devices.

[0094] 2. Existing technologies are generally based on manually designed image feature descriptors, which have poor robustness in highly dynamic scenes and are prone to tracking loss. This application uses image descriptors based on binary neural networks, and uses deep neural networks to identify and match key points in the image, thereby improving the robustness and accuracy of localization.

[0095] 3. Due to limitations in computing power, existing technologies generally only construct incomplete maps with sparse point clouds, which cannot meet the requirements of advanced tasks such as autonomous navigation and obstacle avoidance. This application obtains a dense disparity map from the current camera's viewpoint based on binary neural network features and projects the dense disparity map into three-dimensional space to build a map with dense point clouds, thereby meeting the needs of advanced tasks.

[0096] 4. Existing technologies employ direct traversal of small-scale voxel blocks for dense 3D mapping, resulting in high computational complexity and long computation time. This application, in dense 3D mapping, first traverses large-scale voxel blocks and then selects voxel blocks containing object surfaces for further subdivision, thereby reducing invalid traversal of voxel blocks in empty areas and significantly improving the running speed of dense 3D mapping.

[0097] This application deploys feature extraction and feature matching from the visual SLAM framework on an FPGA for hardware acceleration, significantly improving the computational efficiency of feature extraction and feature matching. This application extracts image features using a binary neural network to improve the accuracy and robustness of localization and dense mapping. The image features extracted by the binary neural network are simultaneously used for feature point matching in visual SLAM and the generation of dense disparity maps, thus avoiding the redundant deployment of resource-intensive BNNs (binary neural networks) on the FPGA, reducing resource consumption and FPGA power consumption. The method provided in this application is deployed on the FPGA, enabling hardware acceleration, and simultaneously obtaining the feature point pairs and dense disparity maps required for trajectory tracking for subsequent dense 3D reconstruction. During dense 3D reconstruction, this application first traverses large-scale voxel blocks, and only further subdivides and traverses large-scale voxel blocks containing object surfaces, thus avoiding traversal of invalid voxel spaces and improving the computational efficiency of dense 3D reconstruction. This application designs a nonmaximum suppression algorithm suitable for hardware accelerators. This algorithm requires very little FPGA hardware resources to achieve a large nonmaximum suppression window, making it very suitable for SLAM hardware accelerators.

[0098] The technical solutions related to this application run on general-purpose computing devices with high-performance CPUs, thus requiring large computing devices with high power consumption. This application uses a Zynq system-on-a-chip with an FPGA, resulting in a small computing device with low power consumption. Related technical solutions use optical flow for front-end localization, thus exhibiting insufficient localization accuracy and robustness. This application uses a feature point method for front-end localization and extracts image features using a binary neural network (BNN), resulting in high localization accuracy and strong robustness. Related technical solutions generate dense disparity maps based on the SGM method for subsequent dense mapping, resulting in poor quality of the dense disparity maps. This application obtains dense disparity maps based on image features extracted by a binary neural network through cost aggregation and other methods, resulting in higher quality dense disparity maps. Related technical solutions perform dense mapping based on voxel block traversal, resulting in long processing times. This application uses a method that first traverses large-scale voxel blocks and then subdivides voxel blocks containing object surfaces, thus significantly reducing the time required for the dense mapping step.

[0099] If a general-purpose CPU / GPU computing platform were used to achieve the same purpose, the power consumption of the computing device would increase by tens of times, and the size of the device would also increase significantly, failing to meet the requirements of micro-sized unmanned systems with stringent requirements for airborne equipment. If hand-designed feature descriptors were used, the accuracy and robustness of the positioning would be affected. If a shared BNN (Binary Neural Network) structure was not used, more on-chip logic resources of the FPGA would be consumed, making on-chip deployment impossible.

[0100] Reference Figure 9 This application provides a high-efficiency real-time dense SLAM device based on FPGA, comprising:

[0101] The first unit is used to extract features from the left image and the right image of the binocular camera to obtain a left feature map and a right feature map; perform disparity estimation based on the left feature map and the right feature map to obtain a dense disparity map, and send the dense disparity map to the CPU so that the CPU can perform pose estimation and 3D dense mapping based on the dense disparity map.

[0102] The second unit is used to extract key points from the left image of the binocular camera as first key points, associate the first key points with the left feature map and perform feature augmentation and aggregation to obtain second key points; perform non-maximum suppression on the second key points to retain the second key points with the highest key point response values in each neighborhood of a set size as third key points; perform heap sort on the third key points to retain the third key points with the largest key point response values as visual key points, and send the visual key points to the CPU for the CPU to perform pose estimation based on the visual key points;

[0103] The third unit is used to calculate the Hamming distance between the visual keypoint and the keypoint to be matched from the CPU, match the visual keypoint and the keypoint to be matched according to the Hamming distance to obtain a matching keypoint pair; determine whether the matching keypoint pair meets the preset maximum Hamming distance and motion model constraints, and if it does, send the matching keypoint pair to the CPU so that the CPU can perform pose estimation based on the matching keypoint pair.

[0104] The specific implementation of this high-efficiency real-time dense SLAM device is basically the same as the specific embodiment of the high-efficiency real-time dense SLAM method described above, and will not be repeated here.

[0105] This application also provides an electronic device, which includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the above-described high-efficiency real-time dense SLAM method.

[0106] Specifically, electronic devices can be user terminals or servers.

[0107] This application also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described high-energy-efficiency real-time dense SLAM method.

[0108] Memory, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs and non-transitory computer-executable programs. Furthermore, memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory may optionally include memory remotely located relative to the processor, and these remote memories can be connected to the processor via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

[0109] This application also discloses a computer program product or computer program, which includes computer instructions stored in a computer-readable storage medium. The processor of an electronic device can read the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, causing the electronic device to perform... Figure 2 The method shown.

[0110] In some alternative embodiments, the functions / operations mentioned in the block diagrams may not occur in the order shown in the operation diagrams. For example, depending on the functions / operations involved, two consecutively shown blocks may actually be executed substantially simultaneously, or the blocks may sometimes be executed in reverse order. Furthermore, the embodiments presented and described in the flowcharts of this application are provided by way of example to provide a more comprehensive understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and sub-operations described as part of a larger operation are executed independently.

[0111] Furthermore, although this application is described in the context of functional modules, it should be understood that, unless otherwise stated, one or more of the described functions and / or features may be integrated into a single physical device and / or software module, or one or more functions and / or features may be implemented in a separate physical device or software module. It is also understood that a detailed discussion of the actual implementation of each module is unnecessary for understanding this application. Rather, given the properties, functions, and internal relationships of the various functional modules in the apparatus disclosed herein, the actual implementation of the module will be understood within the scope of conventional technology for an engineer. Therefore, those skilled in the art can implement the application set forth in the claims using ordinary techniques without excessive experimentation. It is also understood that the specific concepts disclosed are merely illustrative and not intended to limit the scope of this application, which is determined by the full scope of the appended claims and their equivalents.

[0112] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause an electronic device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0113] The logic and / or steps represented in the flowchart or otherwise described herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a processor-included system, or other system that can fetch and execute instructions from, an instruction execution system, apparatus, or device). For the purposes of this specification, "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transmit programs for use by, or in conjunction with, an instruction execution system, apparatus, or device.

[0114] More specific examples of computer-readable media (a non-exhaustive list) include: electrical connections (electronic devices) having one or more wires, portable computer disk drives (magnetic devices), random access memory (RAM), read-only memory (ROM), erasable and editable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM). Furthermore, computer-readable media can even be paper or other suitable media on which the program can be printed, because the program can be obtained electronically, for example, by optically scanning the paper or other medium, followed by editing, interpreting, or otherwise processing as necessary, and then stored in computer memory.

[0115] It should be understood that various parts of this application can be implemented using hardware, software, firmware, or a combination thereof. In the above embodiments, multiple steps or methods can be implemented using software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.

[0116] In the description of this specification, the references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of this application. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.

[0117] Although embodiments of this application have been shown and described, those skilled in the art will understand that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of this application, the scope of which is defined by the claims and their equivalents.

[0118] The above is a detailed description of the preferred embodiments of this application, but this application is not limited to the embodiments described. Those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of this application, and these equivalent modifications or substitutions are all included within the scope defined by the claims of this application.

Claims

1. A high-energy-efficiency real-time dense SLAM method based on FPGA, characterized in that, include: Features are extracted from the left and right images of the binocular camera to obtain a left feature map and a right feature map, respectively. Disparity estimation is performed based on the left and right feature maps to obtain a dense disparity map. The dense disparity map is then sent to the CPU for pose estimation and 3D dense mapping. Extract the key points from the left image of the binocular camera as the first key points, and perform feature association, feature augmentation and aggregation on the first key points and the left feature map to obtain the second key points; Non-maximum suppression is applied to the second key point to retain the second key point with the highest key point response value in each neighborhood of a set size as the third key point; The third keypoints are heap sorted to retain the third keypoint with the largest keypoint response value as the visual keypoint, and the visual keypoints are sent to the CPU so that the CPU can perform pose estimation based on the visual keypoints. Calculate the Hamming distance between the visual keypoint and the keypoint to be matched from the CPU, and match the visual keypoint and the keypoint to be matched according to the Hamming distance to obtain a matching keypoint pair; determine whether the matching keypoint pair meets the preset maximum Hamming distance and motion model constraints. If it does, send the matching keypoint pair to the CPU so that the CPU can perform pose estimation based on the matching keypoint pair.

2. The high-energy-efficiency real-time dense SLAM method based on FPGA according to claim 1, characterized in that, The step of extracting features from the left and right images of the stereo camera to obtain corresponding left and right feature maps includes: Features are extracted from the left image of the binocular camera using a binary neural network to obtain the left feature map; Features are extracted from the right image of the binocular camera using a binary neural network to obtain the right feature map.

3. The high-energy-efficiency real-time dense SLAM method based on FPGA according to claim 1, characterized in that, The step of extracting key points from the left image of the binocular camera as the first key point includes: The left image from the binocular camera is cached in multiple rows to extract multiple image blocks; FAST corner detection is performed on each image block to determine the coordinates of key points that meet the corner detection conditions, and the key points corresponding to each key point coordinate are taken as the first key points.

4. The high-energy-efficiency real-time dense SLAM method based on FPGA according to claim 1, characterized in that, The step of associating the first key point with the left feature map and performing feature augmentation and aggregation to obtain the second key point includes: Find and associate the features corresponding to the first key point in the left feature map; The second keypoint is obtained by augmenting and aggregating other image features in the neighborhood of the first keypoint that is associated with the feature.

5. The high-energy-efficiency real-time dense SLAM method based on FPGA according to claim 1, characterized in that, The step of performing non-maximum suppression on the second keypoint to retain the second keypoint with the highest keypoint response value in each neighborhood of a predetermined size as the third keypoint includes: Using the non-maximum suppression range as the side length, the feature map containing the second key point is divided into multiple non-maximum suppression blocks; The response value of each of the second key points in each of the non-maximum suppression blocks is compared with the response value of the corresponding maximum point; If the response value of each of the second key points in the non-maximum suppression block is greater than the response value of the corresponding maximum point, then the second key point is replaced with the corresponding maximum point; otherwise, the second key point is discarded; each of the maximum points is used as the third key point.

6. The high-energy-efficiency real-time dense SLAM method based on FPGA according to claim 1, characterized in that, The step of performing a heap sort on the third keypoints to retain the third keypoint with the largest keypoint response value as the visual keypoint includes: Assign a unique identifier to each of the third key points and temporarily store the descriptors of each of the third key points; Each of the third key points is input into a max-heap, sorted from high to low according to the key point response value, and the top-ranked third key points are retained. Based on the identifiers of each of the third key points retained in the max-heap, the corresponding descriptors are found, and each of the third key points retained in the max-heap is output as the visual key points.

7. A high-efficiency real-time dense SLAM system based on FPGA, characterized in that, include: CPU and FPGA; wherein the CPU and FPGA are connected via an AXI4 bus; The FPGA is used to execute a high-energy-efficiency real-time dense SLAM method based on an FPGA as described in any one of claims 1 to 6; The CPU is used for pose estimation and 3D dense mapping.

8. A high-efficiency real-time dense SLAM device based on FPGA, characterized in that, include: The first unit is used to extract features from the left image and the right image of the binocular camera to obtain a left feature map and a right feature map; perform disparity estimation based on the left feature map and the right feature map to obtain a dense disparity map, and send the dense disparity map to the CPU so that the CPU can perform pose estimation and 3D dense mapping based on the dense disparity map. The second unit is used to extract key points from the left image of the binocular camera as first key points, associate the first key points with the left feature map and perform feature augmentation and aggregation to obtain second key points. Non-maximum suppression is applied to the second key point to retain the second key point with the highest key point response value in each neighborhood of a set size as the third key point; The third keypoints are heap sorted to retain the third keypoint with the largest keypoint response value as the visual keypoint, and the visual keypoints are sent to the CPU so that the CPU can perform pose estimation based on the visual keypoints. The third unit is used to calculate the Hamming distance between the visual keypoint and the keypoint to be matched from the CPU, match the visual keypoint and the keypoint to be matched according to the Hamming distance to obtain a matching keypoint pair; determine whether the matching keypoint pair meets the preset maximum Hamming distance and motion model constraints, and if it does, send the matching keypoint pair to the CPU so that the CPU can perform pose estimation based on the matching keypoint pair.

9. An electronic device, characterized in that, Including the processor and memory; The memory is used to store programs; The processor executes the program to implement the method as described in any one of claims 1 to 6.

10. A computer-readable storage medium, characterized in that, The storage medium stores a program that is executed by a processor to implement the method as described in any one of claims 1 to 6.