An underwater terrain image template matching method combining contrast learning idea

By combining contrastive learning with an underwater terrain image template matching method, and through data augmentation, a dual-tower branch feature network, and an attention mechanism, the difficulty of underwater terrain image matching is solved, and high-accuracy matching in complex environments is achieved.

CN115496926BActive Publication Date: 2026-06-26HARBIN ENG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HARBIN ENG UNIV
Filing Date
2022-09-16
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Underwater terrain image matching faces challenges such as high local self-similarity, weak texture characteristics, unstructured characteristics, and noise, making it difficult for traditional methods to match, especially in complex environments where it is difficult to provide accurate matching results.

Method used

By combining contrastive learning, we can achieve feature extraction, fusion, and matching through data augmentation, dual-tower branch feature networks, attention mechanisms, and model optimization, thereby improving the model's robustness to intensity and texture differences.

Benefits of technology

It improves the accuracy and robustness of underwater terrain image matching, and can effectively identify and match terrain features in complex environments.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115496926B_ABST
    Figure CN115496926B_ABST
Patent Text Reader

Abstract

The application relates to an underwater terrain image template matching method combining a contrast learning idea, and belongs to the technical field of digital image processing.The underwater terrain image template matching method combining the contrast learning idea is realized through the following steps: step one, view data enhancement is performed on input data; step two, feature extraction is performed on the enhanced data in step one; step three, sample extraction is performed on the extracted features in step two; step four, the extracted sample features in step three are fused; and step five, model optimization is performed on the fused features in step four.The device can solve the problem that end-to-end training is realized in a self-supervised form without additional data labeling, and the positive and negative sample contrast form can improve the discrimination ability of interference targets, and further provides the underwater terrain image template matching method combining the contrast learning idea.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to an underwater terrain image template matching method that incorporates the concept of contrastive learning, and belongs to the field of digital image processing technology. Background Technology

[0002] Underwater topographic image matching plays a crucial role in underwater topographic change monitoring, underwater topographic-assisted positioning and navigation, and other fields. Although image matching, as a core task of computer vision, has made significant progress over the past few decades, processing underwater topographic images remains a challenging task. Because most underwater topography exhibits a gradual change trend under large-scale observation, underwater topographic images often show high local self-similarity, lack significant point features, and instead exhibit weak regional texture characteristics. Furthermore, the unstructured nature of topographic features often makes it difficult to clearly define target boundaries in underwater topographic images. In addition, the raw elevation data inevitably contains anomalies due to sensor errors, environmental noise, irregular carrier movement, and spatiotemporal differences between multiple measurements, leading to unavoidable differences in intensity and texture between topographic images. Simple filtering processes for noise processing can result in the loss of some true details, introducing new noise pollution. Given the unique characteristics of underwater topographic images and the differences in intensity and texture caused by complex environments, effectively improving matching performance has become our focus.

[0003] Image matching mainly includes feature-based and template-based methods. Due to the complexity of underwater terrain images, feature-based matching methods are often difficult to apply alone because they cannot extract stable and repeatable features. Therefore, we will focus on template-based matching methods. In traditional template matching methods, Normalized Cross-Correlation (NCC) and Sum of Squared Differences (SSD) directly use the gray values ​​between the template and the search window to calculate the degree of matching, thus being very sensitive to noise and intensity value changes. Deformable Diversity Similarity (DDIS) expresses the similarity between the template and the target window by the diversity of nearest neighbor matching of points between the template and the target window. However, pixel-based matching is sensitive to gray value changes in underwater terrain images, and cannot provide accurate matching results when the intensity values ​​of the matched images differ greatly. Co-occurrence-based template matching (CoTM) proposes a similarity measure by quantizing the co-occurrence frequency of pixel pairs, instead of directly quantizing the intensity difference between pixel values. However, this method performs poorly in grayscale images. Quality-aware template matching (QATM) uses the matching quality between matching pairs to perform soft ranking to quantify the uniqueness of matching pairs, thereby improving matching performance. However, the use of this method is heavily dependent on the distinguishability of the extracted features. Therefore, when the self-similarity of the search images is high, the performance of this method will decrease. Summary of the Invention

[0004] This invention addresses the challenge of matching underwater terrain images due to the unique inherent features and the variations in intensity and texture caused by complex environments. It proposes an underwater terrain image template matching method incorporating contrastive learning principles.

[0005] The technical solution adopted by the present invention to solve the above problems is as follows: The underwater terrain image template matching method combining the idea of ​​contrastive learning described in the present invention is implemented through the following steps;

[0006] Step 1: Enhance the view data of the input data;

[0007] Step 2: Extract features from the enhanced data from Step 1;

[0008] Step 3: Extract samples from the features extracted in Step 2;

[0009] Step 4: Fuse the sample features extracted in Step 3;

[0010] Step 5: Optimize the model using the features fused in Step 4.

[0011] Furthermore, in step one, data augmentation is performed by adjusting the Gaussian noise variance and the terrain model resolution. The underwater terrain image is directly used as the model input, and each underwater terrain image is subjected to two random data augmentations to generate two corresponding variant images.

[0012] Furthermore, in step two, a convolutional neural network is used as the backbone of the feature extraction network. Feature extraction is performed through a dual-tower branch feature network. The two tower feature networks share feature parameters. The two variant images generated are respectively processed through the dual-tower branch feature network to extract features and generate two sets of feature maps.

[0013] Furthermore, in step three, sample extraction is based on patches. Two sets of patches of the same size as the window are obtained by using a sliding window to extract the two sets of feature maps. Patches at corresponding positions in the two sets of feature maps are positive samples, while patches at non-corresponding positions are negative samples. Multi-size samples are extracted by adding a mask.

[0014] Furthermore, in step four, feature fusion is based on an attention mechanism. By selecting the center position feature of the facet and multiplying it with the corresponding elements of the variable matrix, a query matrix, a key matrix, and a value matrix are generated respectively. Then, the similarity between the query matrix and the key matrix is ​​calculated to generate a weight matrix. The corresponding elements of the weight matrix and the value matrix are multiplied, and all elements are added to generate a fusion vector with a dimension of 1×1×N, where N is the number of channels of the feature vector.

[0015] Furthermore, the model optimization in step five is achieved through the following steps;

[0016] Step i: Calculate the cosine similarity of the two sets of fused vectors to generate a similarity matrix;

[0017] Step ii: Calculate the loss value using the loss function;

[0018] Step iii: Perform gradient backpropagation using mini-batch stochastic gradient descent.

[0019] Step iv: Maximize the similarity value of positive sample pairs and minimize the similarity value of negative sample pairs using the loss function.

[0020] The beneficial effects of this invention are: by combining the feature extraction capabilities of deep models with a contrast-based model training method, the robustness of the model in recognizing differences in intensity and texture, as well as its ability to discriminate interference windows, can be improved, thereby increasing the accuracy of matching. Attached Figure Description

[0021] Figure 1 This is an example of the matching result of the present invention;

[0022] Figure 2 This is a second example of the matching results of the present invention;

[0023] Figure 3 This is a flowchart illustrating the present invention. Detailed Implementation

[0024] Specific implementation method one: Combining Figures 1 to 3 This embodiment describes an underwater terrain image template matching method that incorporates contrastive learning principles, implemented through the following steps;

[0025] Step 1: Enhance the view data of the input data;

[0026] Step 2: Extract features from the enhanced data from Step 1;

[0027] Step 3: Extract samples from the features extracted in Step 2;

[0028] Step 4: Fuse the sample features extracted in Step 3;

[0029] Step 5: Optimize the model using the features fused in Step 4.

[0030] The above steps are used to match underwater terrain image templates.

[0031] Specific Implementation Method Two: Combining Figures 1 to 3In this implementation method, data augmentation is performed in step one by adjusting the Gaussian noise variance and the terrain model resolution. The underwater terrain image is directly used as the model input, and each underwater terrain image is generated into two corresponding variant images after being randomly augmented twice, so that the model framework can analyze and compare the data.

[0032] Specific implementation method three: Combining Figures 1 to 3 In this implementation method, in step two, a convolutional neural network is used as the backbone of the feature extraction network. Feature extraction is performed through a dual-tower branch feature network. The two tower feature networks share feature parameters. The two variant images generated are respectively processed through the dual-tower branch feature network to extract features and generate two sets of feature maps, thereby completing the feature extraction.

[0033] Specific implementation method four: Combination Figures 1 to 3 In this implementation method, sample extraction in step three is based on patches. Two sets of patches of the same size as the window are obtained by using a sliding window to process the two sets of feature maps. Patches at corresponding positions in the two sets of feature maps are positive samples, while patches at non-corresponding positions are negative samples. Multi-size samples are extracted by adding a mask. The sample extraction is completed through the above steps.

[0034] Specific Implementation Method Five: Combining Figures 1 to 3 In this implementation, step four, feature fusion is based on an attention mechanism. The center feature of a facet is selected and multiplied element-wise with all positional features within that facet to generate a query matrix, a key matrix, and a value matrix, respectively. The query matrix and the key matrix are then used to calculate similarity, generating a weight matrix. The weight matrix is ​​then multiplied element-wise with the value matrix, and all elements are summed to generate a fusion vector of dimension 1×1×N, where N is the number of channels in the feature vector. This method completes the feature fusion.

[0035] Specific Implementation Method Six: Combination Figures 1 to 3 In this implementation method, the model optimization in step five is achieved through the following steps;

[0036] Step i: Calculate the cosine similarity of the two sets of fused vectors to generate a similarity matrix;

[0037] Step ii: Calculate the loss value using the loss function;

[0038] Step iii: Perform gradient backpropagation using mini-batch stochastic gradient descent.

[0039] Step iv: Maximize the similarity value of positive sample pairs and minimize the similarity value of negative sample pairs using the loss function.

[0040] The model is optimized through the steps described above.

[0041] Example

[0042] In this invention, when performing underwater terrain image template matching, the model framework can achieve end-to-end training in a self-supervised manner without requiring additional data annotation. The comparison of positive and negative samples can improve the ability to distinguish interfering targets. After preliminary preparation, the training data is augmented by adjusting the Gaussian noise variance and the terrain model resolution. The accuracy of underwater terrain is mainly affected by the quality of bathymetry data, resulting in height deviations. Due to the complexity of the underwater environment, depth errors are usually the superposition of multiple errors (system navigation parameter errors, sensor measurement errors, chart errors, seabed terrain modeling errors, and tidal range, etc.), typically set as a zero-mean Gaussian distribution with variance δ, i.e. In terms of content, variations in intensity values, texture, and detail can occur due to the unstructured nature of underwater terrain, the accuracy of depth sensors, the resolution of underwater terrain models, and the influence of noise. These differences in accuracy and content ultimately affect the appearance of the terrain image. The enhanced data is then used as the backbone of a convolutional neural network for feature extraction; the output feature map of the backbone network is then processed. and Perform patch-level partitioning to construct positive and negative instances, where and Let S and H represent the output feature vectors of the upper and lower branches, respectively. N is the number of channels in the feature vector, and W and H are the dimensions of the vector. By setting the stride to S and the padding value to P, respectively... and Dense sampling yields a set of patches of size w×h. Where P i Let (x) represent the i-th facet. i ,y i ) is P i The coordinates of the center point. p The total number of extracted face pieces, and

[0043]

[0044] Treat each facet as an instance sample, if Then the corresponding and If they are positive samples of each other, Then the corresponding Sample extraction is completed by using mutually negative samples; a weight matrix W is generated based on the attention mechanism by measuring the similarity between the query vector Q and the key vector K, and then a weighted sum operation is performed on W and the value vector V to achieve the purpose of information extraction. The calculation of the attention mechanism can be expressed as:

[0045]

[0046] Where d is the dimension of the feature, and the softmax function is used to normalize W.

[0047] Select P i center vector After a linear transformation, Q, K, and V are derived from P. i The generated data is processed through two different linear transformations and then used together as input to the attention module to achieve information fusion within the patches. Because P... i It is obtained through a sliding window, therefore for each P i You can get different This leads to different correlation coefficients and fusion vectors for repeated feature vectors between adjacent facets. Furthermore, we introduce multi-head attention to enhance the fitting ability of the features.

[0048] Furthermore, since the sample extraction module needs to extract patches of different sizes, but the internal parameters of the attention calculation module are fixed, to ensure encoding consistency, we uniformly select the largest patch size as input. Then, within the attention module, we adjust the model's region of interest by adding a mask. Specifically, by adjusting the padding value P, we first adjust the padding at each center point (x... i ,y i Extract the maximum size of the facet, and then hide the excess area through a mask in the attention calculation module to achieve the purpose of extracting facets of different sizes.

[0049] After feature fusion is completed, cosine similarity is used as the similarity measure between patches, and the score matrix between the output features is calculated. Where τ is the temperature scalar, sim(·) denotes the calculation of cosine similarity, and z and z′ are the sample sets of the upper and lower branches, respectively. For a dimension n p ×n p A matrix, where n p This indicates the number of samples extracted. The main diagonal of the score matrix represents the similarity scores of positive instance pairs. The network's loss function is shown below:

[0050]

[0051] Where B represents the number of sample batches. express traces, express The sum of all elements in the model is used to complete the training of the model. The trained model can then perform image matching inference.

[0052] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention in any way. Although the present invention has been disclosed above with reference to preferred embodiments, it is not intended to limit the present invention. Any person skilled in the art can make some modifications or alterations to the above-disclosed technical content to create equivalent embodiments without departing from the scope of the present invention. Any simple modifications, equivalent substitutions, and improvements made to the above embodiments without departing from the scope of the present invention, based on the technical essence of the present invention and within the spirit and principles of the present invention, shall still fall within the protection scope of the present invention.

Claims

1. A method for underwater terrain image template matching that incorporates contrastive learning, characterized in that: The underwater terrain image template matching method combining contrastive learning is implemented through the following steps; Step 1: Perform view data augmentation on the input data; data augmentation is performed by adjusting the Gaussian noise variance and the terrain model resolution. The underwater terrain image is directly used as the model input, and each underwater terrain image is generated into two corresponding variant images after being randomly augmented twice. Step 2: Extract features from the enhanced data in Step 1; use a convolutional neural network as the backbone of the feature extraction network, and perform feature extraction through a dual-tower branch feature network. The two tower feature networks share feature parameters. The two variant images are then processed through the dual-tower branch feature network to extract features and generate two sets of feature maps respectively. Step 3: Extract samples from the features extracted in Step 2. Sample extraction is based on patches. Two sets of patches of the same size as the window are obtained by using a sliding window to obtain two sets of feature maps. Patches at corresponding positions in the two sets of feature maps are positive samples, while patches at non-corresponding positions are negative samples. Multi-size samples are extracted by adding a mask. Step 4: Fuse the sample features extracted in Step 3. Feature fusion is based on the attention mechanism. The center position feature of the selected patch and all position features in the patch are multiplied with the corresponding elements of the variable matrix to generate the query matrix, key matrix, and value matrix, respectively. The similarity between the query matrix and the key matrix is ​​calculated to generate the weight matrix. The corresponding elements of the weight matrix and the value matrix are multiplied and all elements are added to generate a fusion vector with a dimension of 1×1×N, where N is the number of channels of the feature vector. Step 5: Optimize the model using the features fused in Step 4.

2. The underwater terrain image template matching method combining contrastive learning as described in claim 1, characterized in that: The model optimization in step five is achieved through the following steps; Step i: Calculate the cosine similarity of the two sets of fused vectors to generate a similarity matrix; Step ii: Calculate the loss value using the loss function; Step iii: Perform gradient backpropagation using mini-batch stochastic gradient descent. Step iv: Maximize the similarity value of positive sample pairs and minimize the similarity value of negative sample pairs using the loss function.