A dense image matching method applied in weak texture environment
By using an improved dense image matching method based on ResNet-18 and a linear window attention network, the problem of feature matching error accumulation in low-texture environments is solved, achieving high-precision and low-cost feature extraction and matching, which is suitable for image matching tasks in complex traffic scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HEFEI INSTITUTE OF PHYSICAL SCIENCE CHINESE ACADEMY OF SCIENCES
- Filing Date
- 2023-06-06
- Publication Date
- 2026-06-19
Smart Images

Figure CN116863168B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of computer vision and image matching technology, and in particular to a dense image matching method applied in weak texture environments. Background Technology
[0002] Local feature matching between images is fundamental to many 3D computer vision tasks, including Structure-of-Motion (SfM), Simultaneous Localization and Mapping (SLAM), and visual localization. Image matching based on deep learning methods has become a current research hotspot. The advantages of using deep learning methods for feature extraction are mainly: 1. Due to the nature of convolution and pooling calculations, translation in the image has no impact on the final feature vector. From this perspective, the extracted features are less prone to overfitting; 2. Compared with other methods, the features extracted by convolutional networks are more stable, effectively improving matching accuracy; 3. The fitting ability of the overall model can be controlled by using different convolutions, pooling, and the final output feature vectors. Overfitting can be mitigated by reducing the dimensionality of the feature vectors, while underfitting can be mitigated by increasing the output dimensionality of the convolutional layers, making it more flexible than other feature extraction methods. Although deep learning methods have gradually replaced traditional image matching approaches in many vision tasks and achieved remarkable results... However, matching errors from traditional and deep learning methods persist in subsequent processing stages and gradually accumulate, severely hindering the effective implementation of the final visual task. Incorrect matching leads to erroneous calculations in accurate estimations, causing some visual task results to deviate significantly from reality. Therefore, designing a high-precision and high-efficiency matching method to meet the needs of current practical or large-scale real-world applications is the main trend for the future development of image matching.
[0003] The shortcomings of existing technologies lie in the fact that matching errors, whether from traditional or deep learning methods, are retained in subsequent processing stages and gradually accumulate, severely restricting the effective implementation of the final visual task. Incorrect matching leads to erroneous calculations of certain accurate estimates, causing some visual task results to deviate significantly from reality. Furthermore, due to factors such as poor texture, repetitive patterns, viewpoint changes, illumination variations, and motion blur, feature detectors may fail to extract a sufficient number of repeatable interest points between images. In low-texture regions, there are no repeatable interest points, and even with perfect descriptions, it is impossible to find correct correspondences. Summary of the Invention
[0004] The purpose of this invention is to overcome the shortcomings of the existing technology. To achieve the above objective, a dense image matching method applied in a weak texture environment is adopted to solve the problems mentioned in the background technology.
[0005] A dense image matching method applied in weakly textured environments includes the following steps:
[0006] Step S1: Obtain image data under weak texture environment;
[0007] Step S2: Extract features from the acquired image data based on the local feature extraction network to obtain coarse feature maps and fine feature maps;
[0008] Step S3: Based on the linear window attention network, the global attention is combined with the optimal micro-matching layer to extract the coarse feature map and obtain coarse feature matching pairs.
[0009] Step S4: Input the coarse feature matching pairs into the fine matching module built based on the multi-head multilayer perceptron to refine and obtain the final matching point pairs of the image data;
[0010] Step S5: Finally, the obtained matching point pairs are corrected and normalized using softmax to obtain the final matching detection result.
[0011] As a further aspect of the present invention, the specific steps in step S2 include:
[0012] An improved ResNet-18 is used as a local feature extraction network to extract two levels of coarse and fine feature maps from the acquired image data.
[0013] The coarse feature map is set to the feature map at 1 / 8 of the original image size, and the fine feature map is set to the feature map at 1 / 2 of the original image size.
[0014] As a further aspect of the present invention: the local feature extraction network is composed of three types of convolutions: Conv7×7, Conv3×3, and Conv1×1.
[0015] As a further aspect of the present invention, the specific steps in step S3 include:
[0016] Step S31: Construct a linear window-based attention network and perform self-attention in the global scope of the Transformer for subsequent dense prediction feature matching tasks;
[0017] Step S32: Divide the feature maps of the image data extracted by the local feature extraction network into non-overlapping windows and introduce the feature maps to reduce computational complexity and storage requirements.
[0018] Specifically, a linear window attention network block, or LWA block, is constructed using a shift window partitioning method. Each LWA block contains a linear window attention network (LWA) and a linear shift window attention layer (LSWA), which are used alternately within the LWA block.
[0019] As a further aspect of the present invention, the specific steps for calculating the computational complexity include:
[0020] Let x∈R N×C Let N be a sequence of C-dimensional eigenvectors. Then the formula for the computational complexity of Transformer is:
[0021]
[0022] Where Q, K, and V are N*C matrices;
[0023] First, similarity calculation QK T This yields an N*N matrix, with a computational complexity of (N... 2 C);
[0024] Then, a softmax calculation is performed on each row of the matrix, and the complexity of calculating an N-row matrix is (N... 2 );
[0025] Finally, multiply by the weighted V matrix, which is an N*N matrix multiplied by an N*C matrix. The computational complexity is (N... 2 C);
[0026] In this method, the feature maps extracted from image data by the local feature extraction network are evenly distributed into a non-overlapping window arrangement, and the window-based computation area is controlled.
[0027] Define the input feature map as H×W×C, and set the window size to M, then we have Given a window and an input sequence N, substituting these into the Transformer-based computational complexity formula yields Q, K, and V, which are M and M, respectively. 2 ×C matrix;
[0028] The complexity of calculating based on similarity is QK. T for:
[0029]
[0030] As a further aspect of the present invention, the specific steps in step S4 include:
[0031] Based on fine-grained features, a multi-head multilayer perceptron is used to refine the coarse feature matching pairs.
[0032] For each pair of coarse horizontal feature matching pairs, the position of the coarse feature matching pair is first located, then two local windows of size W×W are cropped, and W×W is assigned to the fine matching module of the multi-head multilayer perceptron. Finally, the matching point pairs of the final image data are obtained.
[0033] Compared with the prior art, the present invention has the following technical advantages:
[0034] The above technical solution employs a dense image matching method using a linear window attention network based on a Transformer network. A local feature extraction network is used to extract two levels of feature maps from the image, followed by coarse-grained and fine-grained processing and refinement to ultimately achieve high-accuracy detection and matching results. This addresses the problem that feature detectors may fail to extract enough repeatable interest points between images due to factors such as poor texture, repetitive patterns, viewpoint changes, illumination variations, and motion blur. It also addresses the issue that in low-texture regions, there are no repeatable interest points, making it impossible to find correct correspondences even with perfect descriptions. Furthermore, it significantly reduces errors in feature detection estimation, extracts more accurate, repeatable, and matchable features, and improves matching accuracy. Attached Figure Description
[0035] The specific embodiments of the present invention will now be described in detail with reference to the accompanying drawings:
[0036] Figure 1 This is a schematic diagram illustrating the steps of the dense image matching method according to an embodiment of this application;
[0037] Figure 2 This is a schematic diagram of the overall network structure of the image matching method disclosed in this application.
[0038] Figure 3 This is a schematic diagram of the structure of a local feature extraction network according to an embodiment of this application;
[0039] Figure 4 This is a schematic diagram of the structure of the LWA block in an embodiment of this application;
[0040] Figure 5 This is a schematic diagram of the structure of a linear window attention network according to an embodiment of this application;
[0041] Figure 6 This is a schematic diagram of the indoor visualization results of an embodiment disclosed in this application;
[0042] Figure 7 This is a schematic diagram of the outdoor visualization results of an embodiment disclosed in this application. Detailed Implementation
[0043] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0044] Please refer to Figure 1 and Figure 2 In this embodiment of the invention, a dense image matching method applied in a weakly textured environment includes the following steps:
[0045] In this embodiment, a linear window attention network (LWA) is proposed, and a feature matching method (TRLWAM) is constructed.
[0046] Step S1: Obtain image data under weak texture environment;
[0047] In this embodiment, image data based on a weak texture environment is acquired through an image acquisition device, and necessary preprocessing is performed;
[0048] Step S2: Extract features from the acquired image data based on a local feature extraction network to obtain coarse and fine feature maps. Specific steps include:
[0049] In this embodiment, an improved ResNet-18 is used as the local feature extraction network to extract two levels of features from the image. We define the coarse-layer features as the feature map at 1 / 8 of the original image size, and the fine-layer features as the feature map at 1 / 2 of the original image size. For example... Figure 3 As shown in the diagram, this is a schematic diagram of a local feature extraction network. The network consists of three types of convolutions: Conv7×7, Conv3×3, and Conv1×1.
[0050] Step S3: Based on a linear window attention network, the global attention layer is combined with the optimal micro-matching layer to extract the coarse feature map, resulting in coarse feature matching pairs. Specific steps include:
[0051] Step S31: Construct a linear window-based attention network and perform self-attention in the global scope of the Transformer for subsequent dense prediction feature matching tasks;
[0052] Step S32: Divide the feature maps of the image data extracted by the local feature extraction network into non-overlapping windows and introduce the feature maps to reduce computational complexity and storage requirements.
[0053] Among them, a linear window attention network block, namely an LWA block, is constructed using a shifted window partitioning method, such as... Figure 4 As shown in the figure, the diagram is a schematic diagram of the structure of an LWA block. Each LWA block contains a linear window attention network (LWA) and a linear shift window attention layer (LSWA), which are used alternately in the LWA block.
[0054] In a specific implementation, the LWA block structure is a crucial component of the TRLWAM image matching method of this invention. By dividing the feature maps into non-overlapping windows and introducing these windows, computational complexity and storage requirements are reduced. Cross-window connectivity is introduced to maintain efficient computation of non-overlapping windows. LWA blocks are constructed using a shifted window partitioning method, with each block containing a Linear Window Attention Network (LWA) and a Linear Shifted Window Attention Layer (LSWA). The LWA and LSWA layers are used alternately within the LWA blocks. Figure 4 As shown, in layer l (left), a linear window attention network (LWA) is used to compute attention within each window. In the next layer l+1 (right), the window partitioning is shifted (LSWA), generating new windows. The attention computation in the new windows spans the boundaries of the previous windows in layer l, providing connections between them.
[0055] In this embodiment, as Figure 5 As shown in the diagram, this is a schematic diagram of the Linear Window Attention Network (LWA). The specific steps for calculating the complexity include:
[0056] Transformers exhibit self-attention globally, making them widely applicable to various vision tasks. However, their quadratic computational complexity poses a major obstacle for feature matching tasks that require dense predictions.
[0057] Let x∈R N×C Let N be a sequence of C-dimensional eigenvectors. Then the formula for the computational complexity of Transformer is:
[0058]
[0059] Where Q, K, and V are N*C matrices;
[0060] First, similarity calculation QK T This yields an N*N matrix, with a computational complexity of (N... 2 C);
[0061] Then, a softmax calculation is performed on each row of the matrix, and the complexity of calculating an N-row matrix is (N... 2 );
[0062] Finally, multiply by the weighted V matrix, which is an N*N matrix multiplied by an N*C matrix. The computational complexity is (N... 2 C);
[0063] In this embodiment, the linear window attention network (LWA) distributes the feature maps of the image data extracted by the local feature extraction network evenly into a non-overlapping window arrangement, and reduces the computational load of the network by controlling the window-based computation area.
[0064] Define the input feature map as H×W×C, and set the window size to M, then we have Given a window and an input sequence N, substituting these into the Transformer-based computational complexity formula yields Q, K, and V, which are M and M, respectively. 2 ×C matrix;
[0065] The complexity of calculating based on similarity is QK. T for:
[0066]
[0067] The computational complexity of weighted multiplication with matrix V is (NM) 2 C). From the formula for the computational complexity of Transformer above, we can see that the computational cost of softmax attention is (N... 2 C) This is because the complete attention matrix must be stored to compute the gradients of the query, key, and value. Feature mapping is also used. This results in a positive similarity function, shifting from traditional softmax attention to dot-product attention based on feature maps. The performance of feature maps is comparable to a complete Transformer because we can compute it in one step. and This mapping method is reused for each query. Instead of performing softmax calculations within a linear window, this mapping method reduces computational workload and memory requirements.
[0068] Step S4: Input the coarse feature matching pairs into the fine matching module built based on a multi-head multilayer perceptron to refine the final image data into matching point pairs. Specific steps include:
[0069] Based on fine-grained features, a multi-head multilayer perceptron is used to refine the coarse feature matching pairs.
[0070] For each pair of coarse horizontal feature matching pairs, the position of the coarse feature matching pair is first located, then two local windows of size W×W are cropped, and W×W is assigned to the fine matching module of the multi-head multilayer perceptron. Finally, the matching point pairs of the final image data are obtained.
[0071] In this embodiment, a detectorless image matching method (TRLWAM) is designed that utilizes the global field of view characteristics of the attention mechanism.
[0072] First, the local feature extraction network extracts two different types of feature maps: coarse and fine feature maps. The tensor is then divided into windows using a window partitioning function, specifying the window size.
[0073] On the coarse-grained Linear-Window Attention Block (LWA), the coarse feature map is divided into (hum_windows*B, window_size, windowsize, C) by (N, H, W, C). After passing through the Linear-Window Attention Block (LWA), dense pixel-level matching MCs are extracted.
[0074] At a finer granular level, the final matching results are refined using a multi-head multilayer perceptron. For each coarse-level match, we first locate their positions (i, j), and then crop two local windows of size W×W, which are given to the fine-matching module.
[0075] Step S5: Finally, the obtained matching point pairs are corrected and normalized using softmax to obtain the final matching detection result (i, j′).
[0076] In this embodiment, we use grouped convolutions to implement a multi-head MLP. The MLP layer replaces the adaptation of all multi-head self-attention blocks, with each group representing one head. The input is a 3D array (N, w×w, f), where w is the size of the cropped local window on the fine-grained feature map. The multi-head MLP is implemented using one-dimensional convolutions, where each feature map is arranged in rows to ensure that different windows are not computed as convolutions between them, thus reducing computational workload. In the fine-level refinement module, TRLWAM further reduces computational cost by replacing the attention layer with an MLP layer.
[0077] The following describes the process of setting up three different experiments to verify the effectiveness of the method in this invention:
[0078] (1) Homography estimation experiment
[0079] Table 1 below shows that the image matching method of this invention significantly outperforms other methods when the error thresholds are 3, 5, and 10 pixels. Slightly lower than LoFTR, our model requires less computational power and captures good image details and long-range dependencies.
[0080] Table 1: Homography estimates for HPatches. The AUC as a percentage of cornering error is reported.
[0081]
[0082] (2) Indoor posture estimation
[0083] Compared to handcrafted and learned matchers, this invention significantly improves attitude accuracy and achieves comparable performance by matching optimal transport as a micro-matching layer. See Table 2 and... Figure 2 As shown, this invention has achieved excellent results in indoor pose estimation.
[0084] Table 2 Evaluation of Indoor Pose Estimation
[0085]
[0086] like Figure 6 As shown in the figure, the visualization results of finding dense correspondences on untextured walls (first row) and floors with repeating patterns (second row) are illustrated. Even in blurry areas with low texture or in repeating patterns, this invention can produce high-quality matches and achieve good matching results.
[0087] (3) Outdoor pose estimation
[0088] The AUC for position errors at thresholds (5°, 10°, 20°) is reported, where position error is defined as the maximum value of the angular errors of rotation and translation. To recover the camera pose, we utilize RANSAC to solve for the fundamental matrix from the predicted match. Outdoor pose estimation results are presented. As shown in Table 3, this invention achieves certain advantages, far superior to other methods.
[0089] Table 3 Evaluation of MegaDepth in Outdoor Attitude Estimation
[0090]
[0091] like Figure 7 As shown in the figure, the visualization results of outdoor image matching are presented. This invention can produce high-quality matching and achieve good matching results.
[0092] Based on the above test results, the beneficial effects of the present invention can be concluded:
[0093] This invention proposes a detector-free image matching method based on an attention mechanism, which also boasts lower computational costs. First, based on a local feature extraction network, coarse and fine feature maps are extracted, reducing the network's computational cost. Then, leveraging the advantages of a linear window attention mechanism, global attention is combined with an optimal micro-matching layer to extract coarse feature matching pairs. Finally, the coarse feature matching pairs are fed into a fine matching module to obtain the final matched point pairs. Experimental results show that this invention not only achieves better matching results but also reduces computational costs, making it more suitable for image matching in complex traffic scenarios.
[0094] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention. The scope of the invention is defined by the appended claims and their equivalents, all of which should be included within the scope of protection of the invention.
Claims
1. A dense image matching method applied in a weakly textured environment, characterized in that, Includes the following steps: Step S1: Obtain image data under weak texture environment; Step S2: Extract features from the acquired image data based on the local feature extraction network to obtain coarse feature maps and fine feature maps; Step S3: Based on the linear window attention network, the global attention layer is combined with the optimal micro-matching layer to extract the coarse feature map, resulting in coarse feature matching pairs. The specific steps include: Step S31: Construct a linear window-based attention network, and... Transformer Self-attention is performed globally for subsequent feature matching tasks involving dense prediction. Step S32: Divide the feature maps of the image data extracted by the local feature extraction network into non-overlapping windows and introduce the feature maps to reduce computational complexity and storage requirements. Specifically, a linear window attention network block, namely an LWA block, is constructed by using a shift window partitioning method. Each LWA block contains a linear window attention network (LWA) and a linear shift window attention layer (LSWA), which are used alternately within the LWA block. Step S4: Input the coarse feature matching pairs into the fine matching module built based on a multi-head multilayer perceptron to refine the final image data into matching point pairs. The specific steps include: Based on fine-grained features, a multi-head multilayer perceptron is used to refine the coarse feature matching pairs. For each pair of coarse horizontal feature matches, the position of the coarse feature match pair is first located, and then the two sets of size are cropped. A local window, while The fine matching module of the multi-head multilayer perceptron is used to obtain the final matching point pairs of the image data. Step S5: Finally, the obtained matching point pairs are corrected and normalized using softmax to obtain the final matching detection result.
2. The method of claim 1, wherein the method is applied to a dense image matching in a weakly textured environment. The specific steps in step S2 include: An improved ResNet-18 is used as a local feature extraction network to extract two levels of coarse and fine feature maps from the acquired image data. The coarse feature map is set to the feature map at 1 / 8 of the original image size, and the fine feature map is set to the feature map at 1 / 2 of the original image size.
3. The method of claim 2, wherein the method is applied to a dense image matching in a weakly textured environment. The local feature extraction network consists of three types of convolutions: Conv7×7, Conv3×3, and Conv1×1.
4. The method of claim 1, wherein the method is applied to a dense image matching in a weakly textured environment. The specific steps involved in calculating the computational complexity include: set up Indicates by indivual A sequence of eigenvectors, then Transformer The formula for the computational complexity is: ; in, for matrix; First, similarity calculation , get one The matrix has a computational complexity of O(n log n). ; A softmax computation is then performed on each row of the matrix, and the complexity of computing N rows of the matrix is ; Finally, multiply by the weighted V matrix, that is... Matrix multiplication The matrix has a computational complexity of O(n log n). ; In this method, the feature maps extracted from image data by the local feature extraction network are evenly distributed into a non-overlapping window arrangement, and the window-based computation area is controlled. Define the input feature map as C, Set the window size to Then there is Window, input sequence is Substitute based on Transformer The formula for the computational complexity is obtained. , respectively matrix; Calculate complexity based on similarity. for: 。