A binocular image key point matching method based on a hierarchical optimization strategy and a medium
By constructing a matching cost volume pyramid and performing hierarchical optimization, the instability and computational complexity of existing keypoint matching algorithms under varying lighting conditions and untrained scenarios are resolved, achieving high-precision keypoint matching and 3D depth information acquisition.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TONGJI UNIV
- Filing Date
- 2023-12-08
- Publication Date
- 2026-06-26
AI Technical Summary
Existing keypoint matching algorithms are unstable when faced with changes in lighting and untrained scenes. Traditional methods are computationally intensive and have complex hyperparameter settings. Data-driven methods require a large amount of labeled data, while direct deep feature matching algorithms are computationally intensive and unstable.
A binocular image key point matching method based on a hierarchical optimization strategy is adopted. Feature maps are extracted through a pre-trained deep neural network, a matching cost pyramid is constructed, and high-precision key point matching results are obtained through layer-by-layer optimization. Cosine similarity calculation and local extrema are used to determine matching pairs.
It improves the accuracy and robustness of keypoint matching, reduces sensitivity to untrained scenes, simplifies the deployment process, reduces computational complexity, and can acquire high-precision 3D depth information with just a consumer-grade camera.
Smart Images

Figure CN117746071B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer vision, and in particular to a binocular image key point matching method and medium based on a hierarchical optimization strategy. Background Technology
[0002] Keypoint matching algorithms aim to match identical pixels in two images. The matched keypoint pairs can be used for 3D scene understanding tasks such as object pose estimation, image matching, real-time localization, and map building. Furthermore, when the keypoint matching algorithm is deployed on stereo image pairs acquired by a stereo camera, the keypoint matching process is transformed from a 2D search to a 1D search. Combining the stereo camera's normal vector and focal length, the positional relationship of the matched keypoint pairs in the image can be used to calculate the distance of their corresponding real-world points from the stereo camera. This distance can then be used for image perspective transformation, image correction, training deep learning-based depth estimation networks, or as seed points in seed-growing depth estimation methods to inspire depth estimation of their neighboring pixels.
[0003] In related technologies, existing keypoint matching methods have several shortcomings:
[0004] Traditional keypoint matching algorithms based on manually designed features, such as SIFT, SURF, and BRISK, first extract keypoints using local feature detectors and descriptors, and then use nearest neighbor search algorithms to match the keypoints. While this approach is relatively clear and simple, the manually designed local feature detectors and descriptors are not robust to changes in environmental factors such as lighting, making it difficult to achieve stable results in practical applications.
[0005] Data-driven keypoint matching algorithms, such as SuperPoint, D2-Net, R2D2, SuperGlue, and NCNet, use deep neural networks for feature extraction and keypoint matching, achieving more accurate results than traditional algorithms. However, methods still based on keypoint detection, such as SuperPoint, D2-Net, and R2D2, often mismatch keypoints in weakly textured regions; methods that directly match image pixels, such as NCNet, often fail to extract effective local information to describe pixels due to scale issues. Furthermore, these data-driven keypoint matching algorithms require large-scale, manually labeled datasets for network training and exhibit poor performance in unfamiliar scenarios.
[0006] Methods that directly use deep features for end-to-end matching, such as DFM (Feature Matching Principle), directly use a pre-trained neural network feature extractor to extract pixel-level features from the image and perform pixel-level matching. This algorithm uses a hierarchical optimization strategy to reduce computation and improve accuracy. Then, at each layer, the algorithm uses a nearest neighbor search algorithm for keypoint hierarchical optimization, which introduces a large number of hyperparameters, increasing the difficulty of practical deployment. Summary of the Invention
[0007] The purpose of this invention is to provide a high-precision binocular image key point matching method and medium based on a hierarchical optimization strategy.
[0008] The objective of this invention can be achieved through the following technical solutions:
[0009] A binocular image keypoint matching method based on a hierarchical optimization strategy includes the following steps:
[0010] S1. Acquire multiple pairs of left and right stereo images with different resolutions captured by the stereo camera;
[0011] S2. Based on multiple pairs of left and right stereo images, a pre-trained deep neural network feature extractor is used to process them, resulting in multiple pairs of left and right feature maps with different resolutions.
[0012] S3. The matching cost body of the left and right feature map pairs is calculated using a similarity calculation method to construct a matching cost body pyramid containing matching cost bodies of different resolutions.
[0013] S4. Based on the matching cost body pyramid, search on the matching cost body with the lowest resolution to obtain initial key point matching pairs;
[0014] S5. Calculate the matching cost between corresponding image blocks in the matching cost body of the key point matching pair in the second lowest resolution, and take the pixel pair corresponding to the matching cost that satisfies the local extremum as the lowest level key point matching result.
[0015] S6. Repeat step S5 to optimize the matching cost body in the matching cost body pyramid layer by layer according to the resolution from low to high, until the final key point matching result is obtained.
[0016] Furthermore, in step S2, the pre-trained deep neural network feature extractor is obtained by training a VGG neural network.
[0017] Further, in step S3, the matching cost body of the left and right feature map pairs is calculated using cosine similarity, and the expression for calculating the matching cost is:
[0018]
[0019] In the formula, C i (p,d) represents the matching cost, where x and y represent the x-coordinate and y-coordinate of pixel p, respectively, and d represents the disparity candidate. Let i be the i-th left and right feature maps.
[0020] Furthermore, in step S4, the initial key point matching pairs are obtained using the nearest neighbor search algorithm.
[0021] Furthermore, the step of obtaining the initial keypoint matching pairs includes:
[0022] Calculate the peak ratio of the matching cost for each pixel in the lowest resolution matching cost volume;
[0023] Determine whether the peak value of the matching cost for each left pixel exceeds a set threshold. If it does, then the left pixel and its corresponding right pixel are used as the initial keypoint matching pair. Otherwise, they are not used as the initial keypoint matching pair.
[0024] Furthermore, the expression for calculating the peak matching cost ratio is as follows:
[0025]
[0026] In the formula, Left pixel The peak ratio of matching cost at point C, where k represents the number of layers in the feature map, and C k The matching cost between the left and right feature maps of the k-th layer is... Representing the left pixel The disparity corresponding to the minimum and second-minimum matching costs.
[0027] Furthermore, in step S5, the steps for constructing the local extremum satisfaction condition include:
[0028] The corresponding image block is divided into upper and lower sub-image blocks, and its left and right neighboring pixel image blocks are obtained;
[0029] Based on the above and below sub-image blocks and their left and right neighboring pixel image blocks, a local extremum condition is constructed.
[0030] Furthermore, the local extremum property satisfies the following condition:
[0031]
[0032]
[0033] In the formula, These are the upper and lower sub-image blocks, respectively. For the (k-1)th layer sub-image patch; These are the left and right neighboring pixel image blocks corresponding to the sub-image block; C is a set of pixels for two inputs; k-1 The matching cost between the left and right feature maps of the (k-1)th layer.
[0034] Further, in step S5, the matching cost that satisfies local extrema is:
[0035]
[0036] In the formula, Left pixel The matching cost at point d is the disparity.
[0037] The present invention also provides a computer-readable storage medium including one or more programs executable by one or more processors of an electronic device, the one or more programs including instructions for performing the binocular image keypoint matching method based on the hierarchical optimization strategy as described above.
[0038] Compared with the prior art, the present invention has the following beneficial effects:
[0039] (1) This invention constructs a matching cost body pyramid containing matching cost bodies of different resolutions and continuously optimizes the matching cost body pyramid layer by layer using a hierarchical optimization strategy, thereby achieving matching accuracy superior to traditional key point matching algorithms.
[0040] (2) This invention uses only a pre-trained neural network feature extractor, which eliminates the training process compared to data-driven key point matching algorithms and improves robustness to unlearned scenarios.
[0041] (3) This invention explores the local extremum of the matching cost when applying the key point matching algorithm to stereo images. Compared with the existing DFM method, it avoids excessive hyperparameter introduction and greatly facilitates practical deployment.
[0042] (4) The present invention can obtain high-precision three-dimensional depth information using only a consumer-grade camera, which is much cheaper than using LiDAR. Attached Figure Description
[0043] Figure 1 This is a schematic diagram of the method flow of the present invention;
[0044] Figure 2 This is a schematic diagram of the present invention, which constructs a matching cost body pyramid based on feature maps extracted by a neural network feature extractor.
[0045] Figure 3 This is a schematic diagram illustrating the key point matching of a certain layer of the present invention for the corresponding image block of the next layer;
[0046] Figure 4 This is a schematic diagram demonstrating the key point matching effect of the present invention. Detailed Implementation
[0047] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments. These embodiments are based on the technical solution of the present invention and provide detailed implementation methods and specific operating procedures. However, the scope of protection of the present invention is not limited to the following embodiments.
[0048] This embodiment provides a binocular image keypoint matching method based on a hierarchical optimization strategy. This method uses a hierarchical optimization strategy to optimize low-resolution keypoint matching pairs layer by layer between the left and right images acquired by a stereo camera to generate high-resolution keypoint matching results. Figure 1 As shown, the method includes the following steps:
[0049] S1. Acquire multiple pairs of left and right stereo images with different resolutions captured by the stereo camera.
[0050] S2. Based on multiple pairs of left and right stereo images, a pre-trained deep neural network feature extractor is used to process them, resulting in multiple pairs of left and right feature maps with different resolutions.
[0051] The pre-trained deep neural network used for feature extraction is a VGG neural network trained on the ImageNet dataset for an image classification task. Feature maps at different resolutions for the left and right images are defined. Where k represents the number of layers in the feature map, With the highest resolution, The resolution is 1 / 2 times.
[0052] S3. The matching cost body of the left and right feature map pairs is calculated using a similarity calculation method to construct a matching cost body pyramid containing matching cost bodies of different resolutions.
[0053] like Figure 2 As shown, the matching cost body pyramid is defined. Each matching cost body is obtained by calculating the cosine similarity between the left and right feature maps:
[0054]
[0055] Where x and y represent the horizontal and vertical coordinates of pixel p, respectively, d represents the disparity candidate, and <·,·> represent the vector dot product.
[0056] Furthermore, the matching cost volume is normalized to 0 to 1, where a smaller matching cost indicates a more accurate matching relationship.
[0057] S4. Based on the matching cost body pyramid, search on the matching cost body with the lowest resolution to obtain initial key point matching pairs.
[0058] This step uses the nearest neighbor search algorithm to obtain initial keypoint matching pairs on the matching cost body with the lowest resolution. Specifically:
[0059] First, define the method for calculating the peak matching cost ratio for each pixel in the k-th layer:
[0060]
[0061] in Representing the left pixel The disparity corresponding to the minimum and second-minimum matching costs;
[0062] Secondly, for any pixel When its peak matching cost ratio exceeds a certain set threshold, Corresponding pixels in the right image They are considered keypoint matching pairs.
[0063] Finally, the original keypoint match pair will only be retained as the original keypoint match pair in the case of bidirectional matching, i.e.
[0064] S5. Calculate the matching cost between corresponding image blocks in the matching cost volume of the key point matching pair in the second lowest resolution, and take the pixel pair corresponding to the matching cost that satisfies the local extremum as the lowest level key point matching result.
[0065] like Figure 3 As shown, the steps are as follows:
[0066] The image patch corresponding to the original keypoint matching pair in the second lowest resolution layer, for a certain original keypoint matching pair in the k-th layer. At the (k-1)th layer of the second lowest resolution, there is a pair of image patches of size 2×2. in
[0067] Does the matching cost between image patches satisfy the local extremum continuity for a pair of image patches in the (k-1)th layer? First, it is divided into upper and lower sub-image blocks of size 2×1. as well as And define its left and right neighboring pixel image blocks. as well as
[0068] The matching cost between image patches satisfies the local extremum continuity, which is defined as:
[0069]
[0070] in:
[0071]
[0072] in, This represents the set of pixels for two inputs.
[0073] Pixel pairs with matching costs that satisfy local extrema are selected as the optimization results of the original keypoint matching pairs at this layer. First, the matching cost that satisfies local extrema is defined as:
[0074]
[0075] Furthermore, for those that meet the above conditions And parallax d, point pair as well as That is, it is used as the key point matching pair of the kth layer. The results of hierarchical optimization at level k-1.
[0076] S6. Repeat step S5 to optimize the matching cost body in the matching cost body pyramid layer by layer according to the resolution from low to high, until the final key point matching result is obtained.
[0077] The hierarchical optimization strategy in step S5 is repeated to obtain the keypoint optimization results at the highest resolution. The hierarchical optimization strategy described in step S4 is repeated k-1 times to obtain the keypoint matching results for layers k-1, k-2, ..., 1. Finally, the optimization result of the keypoint matching pairs in layer 1 is taken as the final keypoint matching result output by this algorithm, as shown below. Figure 4 The image shows the key point matching results in this embodiment.
[0078] This invention overcomes the problems of low robustness of traditional keypoint matching algorithms to changes in illumination and poor performance of deep learning-based keypoint matching algorithms in untrained scenes. It achieves high-precision keypoint matching using only a pre-trained deep feature extractor.
[0079] This invention provides a convenient and highly accurate keypoint matching method for binocular images, which can be used for various tasks such as image perspective transformation, image correction, and self-supervised training of deep learning-based stereo matching networks.
[0080] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0081] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code. The solutions in the embodiments of the present invention can be implemented using various computer languages, such as the object-oriented programming language Java and the interpreted scripting language JavaScript.
[0082] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0083] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0084] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0085] Although preferred embodiments of the invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including both the preferred embodiments and all changes and modifications falling within the scope of the invention.
Claims
1. A binocular image keypoint matching method based on a hierarchical optimization strategy, characterized in that, Includes the following steps: S1. Acquire multiple pairs of left and right stereo images with different resolutions captured by the stereo camera; S2. Based on multiple pairs of left and right stereo images, a pre-trained deep neural network feature extractor is used to process them, resulting in multiple pairs of left and right feature maps with different resolutions. S3. The matching cost body of the left and right feature map pairs is calculated using a similarity calculation method to construct a matching cost body pyramid containing matching cost bodies of different resolutions. S4. Based on the matching cost body pyramid, search on the matching cost body with the lowest resolution to obtain initial key point matching pairs; S5. Calculate the matching cost between corresponding image blocks in the matching cost volume of the initial keypoint matching pair in the second lowest resolution, and take the pixel pair corresponding to the matching cost that satisfies the local extremum as the lowest level keypoint matching result. In step S5, the construction steps for satisfying the local extremum condition include: The corresponding image block is divided into upper and lower sub-image blocks, and its left and right neighboring pixel image blocks are obtained; Based on the above and below sub-image blocks and their left and right neighboring pixel image blocks, a local extremum satisfaction condition is constructed, which is as follows: In the formula, , These are the upper and lower sub-image blocks, respectively. For the first Sub-image patches of a layer; , These are the left and right neighboring pixel image blocks corresponding to the sub-image block; Given two sets of pixels as input; For the first Matching cost between the left and right feature maps of a layer; In step S5, the matching cost that satisfies local extrema is: In the formula, Left pixel The matching cost at the location, For parallax; S6. Repeat step S5 to optimize the matching cost body in the matching cost body pyramid layer by layer according to the resolution from low to high, until the final key point matching result is obtained.
2. The binocular image key point matching method based on a hierarchical optimization strategy according to claim 1, characterized in that, In step S2, the pre-trained deep neural network feature extractor is obtained by training a VGG neural network.
3. The binocular image key point matching method based on a hierarchical optimization strategy according to claim 1, characterized in that, In step S3, the matching cost body of the left and right feature map pairs is calculated using cosine similarity. The expression for calculating the matching cost is: In the formula, For matching cost, Representing pixels x and y coordinates Indicates parallax. , For the first i There are approximately 10 feature maps.
4. The binocular image key point matching method based on a hierarchical optimization strategy according to claim 1, characterized in that, In step S4, the nearest neighbor search algorithm is used to obtain the initial key point matching pairs.
5. The binocular image key point matching method based on a hierarchical optimization strategy according to claim 4, characterized in that, The steps for obtaining the initial keypoint matching pairs include: Calculate the peak ratio of the matching cost for each pixel in the lowest resolution matching cost volume; Determine whether the peak value of the matching cost for each left pixel exceeds a set threshold. If it does, then the left pixel and its corresponding right pixel are used as the initial keypoint matching pair. Otherwise, they are not used as the initial keypoint matching pair.
6. The binocular image key point matching method based on a hierarchical optimization strategy according to claim 5, characterized in that, The expression for calculating the peak ratio of matching costs is: In the formula, Left pixel Peak matching cost ratio at the location Indicates the number of layers in the feature map. For the first Matching cost between the left and right feature maps of a layer. , Representing the left pixel The disparity corresponding to the minimum and second-minimum matching costs.
7. A computer-readable storage medium, characterized in that, Includes one or more programs executable by one or more processors of an electronic device, said one or more programs including instructions for performing the binocular image keypoint matching method based on the hierarchical optimization strategy as described in any one of claims 1-6.