A method for constructing a meta-learning-based heterogeneous image matching network and a method for matching
By using a meta-learning-based heterogeneous image matching network and employing feature pairing and saliency judgment branches to train a surface feature differential meta-selection network, the positioning accuracy problem of heterogeneous image matching in complex environments is solved, achieving autonomous matching capability and making it suitable for aircraft navigation in any scenario.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- XI AN JIAOTONG UNIV
- Filing Date
- 2023-10-26
- Publication Date
- 2026-06-26
AI Technical Summary
Existing heterogeneous image matching technologies based on road network features have insufficient positioning accuracy in areas with road changes or missing features, and their application range is limited, making it difficult to achieve high-precision aircraft navigation in electronic warfare environments.
A meta-learning-based heterogeneous image matching network construction method is adopted. Through feature pairing branch, saliency judgment branch and merging processing module, meta-learning is used to construct the final loss function, train the surface feature differential meta-selection network, and select highly salient regions for matching.
It achieves robust matching between aerial images and pre-stored satellite images in unknown scenarios, improving the aircraft's navigation and positioning capabilities in complex environments. It is applicable to any scenario and has autonomous matching capabilities.
Smart Images

Figure CN117436479B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of navigation and positioning and image matching, specifically involving a method for constructing a heterogeneous image matching network based on meta-learning and a matching method. Background Technology
[0002] Positioning is an indispensable and crucial function for all airborne platforms, including aircraft, missiles, and airships. Only with accurate coordinates of the aircraft in the geographic coordinate system can tasks such as flight navigation and target detection be reliably performed. Currently, high-precision positioning for airborne platforms primarily relies on satellite navigation systems. However, due to inherent limitations of satellite navigation, its usability is significantly reduced in harsh environments such as those involving electronic warfare. With the continuous development and advancement of satellite signal suppression and decoy techniques, satellite jamming equipment is becoming increasingly sophisticated and widespread. Even the most advanced reconnaissance drones are not immune to being deceived and captured by satellite jamming devices, and ordinary civilian drones are inevitably vulnerable to satellite signal interference. Against this backdrop, all types of aircraft will face increasingly higher risks of navigation interference during flight, potentially leading to fatal flight accidents.
[0003] Image matching is the foundation of visual positioning. Road network-based visual geolocation is an existing image matching method. This technology utilizes the widely distributed and stable feature of road networks to train an aircraft's ability to intelligently extract road networks from the ground. Then, during flight, the intelligent road extraction results are matched with a pre-stored road reference map to achieve the aircraft's geolocation. The application steps of this method include three steps: 1) preparation of a road distribution reference map of the aircraft's region; 2) extraction of the road network during flight; 3) matching the extracted road network results with the road network reference map of the flight area.
[0004] However, this heterogeneous image matching technique based on road network features has the following drawbacks: 1) It requires the preparation of a relatively accurate road reference map, which can cause positioning errors when roads change; 2) The method has a limited range of applicability, and matching accuracy cannot be guaranteed in areas where road features are missing or the road distribution is monotonous. Therefore, overcoming these shortcomings has become an urgent problem to be solved in this field. Summary of the Invention
[0005] To address the aforementioned problems in the existing technology, this invention provides a method for constructing a heterogeneous image matching network based on meta-learning and a matching method thereon. The technical problem to be solved by this invention is achieved through the following technical solution:
[0006] In a first aspect, embodiments of the present invention provide a method for constructing a heterogeneous image matching network based on meta-learning, comprising:
[0007] A target network is constructed, comprising a feature pairing branch, a saliency judgment branch, and a pooling processing module. The feature pairing branch is used to obtain the probability of point-to-point matching from the input paired heterogeneous images. The paired heterogeneous images include aerial images and satellite images of the same region. The saliency judgment branch is used to calculate the saliency value of each point in the satellite feature map output by the feature pairing branch to characterize whether the corresponding point is a credible landmark. The pooling processing module is used to perform matching based on the probability of point-to-point matching and credible landmarks, and output a confidence matrix.
[0008] Based on the obtained significance values, meta-learning is used to construct the final loss function of the target network;
[0009] A heterogeneous image dataset carrying point-pair matching annotation information is obtained. A meta-dataset is constructed by filtering salient points. The target network is trained using the heterogeneous image dataset, the meta-dataset, and the final loss function to obtain a trained surface feature differential meta-selection network. The surface feature differential meta-selection network is used to output the corresponding confidence matrix for input unknown paired heterogeneous images to obtain image matching and registration results.
[0010] In one embodiment of the present invention, the feature pairing branch includes a feature extractor, a position encoding module, a self-attention layer, and a cross-attention layer;
[0011] Accordingly, the satellite feature map is the output of the satellite image after passing through the feature extractor.
[0012] In one embodiment of the present invention, the step of constructing the final loss function of the target network based on the obtained saliency value using meta-learning includes:
[0013] Define the salient points on the satellite feature map;
[0014] The loss function of the target network is designed using the definition of the salient points, and an explicit initial loss function is constructed.
[0015] Meta-learning is used to rewrite the initial loss function into an implicitly expressed final loss function.
[0016] In one embodiment of the present invention, determining the definition of salient points on the satellite feature map includes:
[0017] A salient point is defined as being significant in both space and channel. The definition of a salient point is expressed as follows:
[0018] (i,j) is a significant point In D k The largest local area is in the middle, among which,
[0019] The input satellite image is denoted as I; the satellite feature image is denoted as F, where F is a three-dimensional tensor, F∈R. h ×w×n h×w is the size of the feature map output by one channel in the feature extractor, where h and w are user-defined positive integers, and n is the number of channels, also a positive integer. From a channel perspective, F is a dense set of h×w descriptor vectors d, where d... ij =F ij , 1≤i≤h, 1≤j≤w, d∈R n From a spatial perspective, F consists of n feature response maps D obtained from feature detection of n channels. k A set; D k =F k D k ∈R h×w (i,j) represents the pixel coordinates in the satellite feature map. D represents k The coordinates of the midpoint (i,j); if (i,j) is a significant point, it is equivalent to... In D k The maximum value is found in a local area, which is a grid area of a preset size, including 3×3; m represents the number of selected channels, m∈[1,n], such that... Take the largest value of m as k.
[0020] In one embodiment of the present invention, the loss function of the target network is designed using the definition of the salient points, and an explicitly expressed initial loss function is constructed, including:
[0021] Based on the definition of the salient point, s ij The significance value is used to quantitatively characterize the significance of each point on the satellite feature map, where s ij The expression is:
[0022]
[0023] in, This indicates the saliency of the point (i,j) in the spatial dimension; γ represents the significance of point (i,j) in the channel dimension; ij This represents the combined significance of point (i,j) in the spatial and channel dimensions; s ij The normalized significance value of point (i,j) on the satellite feature map is represented; (i',j') represents points on the satellite feature map other than (i,j); N(i,j) represents the 3*3 neighborhood centered on point (i,j); D represents m The value of the midpoint (i,j); γ i'j'This represents the overall significance of point (i', j').
[0024] Introducing s ij Design the loss function for the target network, and construct the explicit initial loss function as follows:
[0025]
[0026] Where L1 represents the initial loss function; It is represented as point-to-point annotation information of paired heterogeneous images in the dataset; express The point where the value of 1 is equal to 1; express neutralization The x-coordinate of a point in the same column but not equal to 1; Indicates the output of the feature pairing branch The probability of being a pairing point; λ represents the matching probability of the feature pairing branch for the output of a negative sample; λ represents the normalization parameter for the negative sample loss; s ij This represents the normalized significance value of point (i,j) on the satellite feature map.
[0027] In one embodiment of the present invention, rewriting the initial loss function into an implicitly expressed final loss function using meta-learning includes:
[0028] The feature pairing branch is redefined as N(ω); where ω is the parameter of the feature pairing branch;
[0029] A layer of MLP network V(·,θ) is introduced, and the saliency judgment branch is redefined as V(N(ω),θ);
[0030] Based on the redefined feature pairing branch and saliency judgment branch, the contents of the initial loss function are rewritten as follows:
[0031]
[0032]
[0033] Based on the rewritten content, the final loss function for the implicit expression is:
[0034]
[0035] Where L2 represents the final loss function.
[0036] In one embodiment of the present invention, the target network is trained using the heterogeneous image dataset, the meta-dataset, and the final loss function to obtain a trained surface feature differential meta-selection network, including:
[0037] Using the heterogeneous image dataset, the meta-dataset, and the final loss function, the target network is trained using multiple gradient descent steps to optimize parameters ω and θ, resulting in a trained surface feature differential meta-selection network.
[0038] In one embodiment of the present invention, the feature extractor is a ResNet+FPN network; the ResNet+FPN network outputs a feature map at one-eighth the size of the original image.
[0039] In one embodiment of the present invention, the position encoding module uses the Sinusoidal position encoding method.
[0040] Secondly, embodiments of the present invention provide a heterogeneous image matching method based on meta-learning, including:
[0041] Obtain the paired heterogeneous images to be matched;
[0042] The heterogeneous images to be matched are input into the surface feature differential meta-selection network to obtain the corresponding confidence matrix. The feature point pairing result is obtained based on the confidence matrix, and the image matching and registration result is obtained based on the feature point pairing result. The surface feature differential meta-selection network is obtained based on the meta-learning-based heterogeneous image matching network construction method described in the first aspect.
[0043] The beneficial effects of this invention are:
[0044] The heterogeneous image matching network construction method based on meta-learning provided in this invention first constructs a target network, including a feature pairing branch, a saliency judgment branch, and a pooling processing module. The feature pairing branch primarily extracts features from satellite and aerial images, then pairs the extracted feature points. The saliency judgment branch judges the saliency of the extracted feature points, selecting those with high saliency. Generally, high saliency points are special points in the image, such as buildings and roads with regional semantic features. The pooling processing module performs threshold filtering and geometric consistency checks on the paired feature point pairs to obtain the mapping relationship between images. A self-designed meta-learning-based loss function is used during training. Furthermore, the training dataset in this invention is a heterogeneous image dataset carrying point pair matching annotation information. However, due to the high cost of precise annotation, many mis-annotations and inaccurate annotations may exist. Therefore, after manual screening, a small number of reliable images with accurate point pair annotations are selected as the meta-dataset. The target network is trained using the heterogeneous image dataset, the meta-dataset, and the final loss function to obtain a trained surface feature differential meta-selection network. This network is then used to output the corresponding confidence matrix for input unknown paired heterogeneous images. This invention employs meta-learning techniques, enabling the saliency judgment branch to select matching points in highly saliency regions. Therefore, the meta-learning-based heterogeneous image matching method implemented using the surface feature differential meta-selection network trained according to this invention possesses the ability to self-select matching regions. It can solve the robust matching problem between aerial images and pre-stored satellite images during flight when matching unknown paired heterogeneous images, can select feature-robust regions more quickly, and is applicable to any scenario. It is a novel intelligent and autonomous matching method for situations where other navigation and positioning methods for aircraft fail. Attached Figure Description
[0045] Figure 1 This is a flowchart illustrating a method for constructing a heterogeneous image matching network based on meta-learning, as provided in an embodiment of the present invention.
[0046] Figure 2 This is a schematic diagram of network optimization during the training process of this invention;
[0047] Figure 3 This is a schematic diagram illustrating the actual application process of the present invention;
[0048] Figure 4 This is a flowchart illustrating a heterogeneous image matching method based on meta-learning provided in an embodiment of the present invention.
[0049] Figure 5(a) shows the distribution of feature points extracted using an existing feature extraction network from an aerial photograph of the first region;
[0050] Figure 5(b) shows the distribution of feature points extracted from the first region using an existing feature extraction network;
[0051] Figure 5(c) shows the registration result calculated from the paired feature points extracted from Figure 5(a) and Figure 5(b);
[0052] Figure 6(a) shows the distribution of feature points extracted using an existing feature extraction network from an aerial photograph of the second region;
[0053] Figure 6(b) shows the distribution of feature points extracted using an existing feature extraction network from a satellite image of the second region;
[0054] Figure 6(c) shows the registration result calculated from the paired feature points extracted from Figure 6(a) and Figure 6(b);
[0055] Figure 7(a) shows the distribution of feature points extracted using the network of this invention in an aerial photograph of the first region;
[0056] Figure 7(b) shows the distribution of feature points extracted using the network of this invention in a satellite image of the first region;
[0057] Figure 7(c) shows the registration result calculated from the paired feature points extracted from Figure 7(a) and Figure 7(b);
[0058] Figure 8(a) shows the distribution of feature points extracted using the network of this invention in the second region from an aerial photograph.
[0059] Figure 8(b) shows the distribution of feature points extracted using the network of this invention in the satellite image of the second region;
[0060] Figure 8(c) shows the registration result calculated from the paired feature points extracted from Figure 8(a) and Figure 8(b). Detailed Implementation
[0061] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0062] Image matching is fundamental to visual positioning, enabling accurate aircraft localization for tasks such as flight navigation and target detection. This invention proposes a meta-learning-based heterogeneous image matching network construction method and a meta-learning-based heterogeneous image matching method. The method employs meta-learning technology, utilizing Earth remote sensing big data to train the aircraft's cognitive abilities, enhancing its intelligent understanding of the Earth's surface during flight. It possesses the ability to self-select matching regions and can solve the robust matching problem between aerial images and pre-stored satellite images during flight, making it applicable in any scenario. Detailed explanations follow.
[0063] In a first aspect, embodiments of the present invention provide a method for constructing a heterogeneous image matching network based on meta-learning, such as... Figure 1 As shown, the method may include the following steps:
[0064] S1, Build the target network;
[0065] The target network includes a feature pairing branch, a saliency judgment branch, and a pooling processing module; specifically:
[0066] 1) The feature pairing branch
[0067] The input to the feature pairing branch is a pair of heterogeneous images, which include aerial and satellite images of the same region. The feature pairing branch mainly extracts features from the satellite and aerial images, and then pairs the extracted feature points. That is, the feature pairing branch is used to obtain the probability of point-to-point matching for the input pair of heterogeneous images. The probability of point-to-point matching refers to the probability that pixels at corresponding positions in the two images belong to the same landmark.
[0068] In one optional implementation, the feature pairing branch includes a feature extractor, a position encoding module, a self-attention layer, and a cross-attention layer;
[0069] The aerial and satellite images, after being processed by the feature extractor, yield their respective feature maps. The feature extractor can be implemented using any existing neural network architecture with feature extraction capabilities; for example, optionally, the feature extractor can be a ResNet+FPN network; the ResNet+FPN network outputs a feature map one-eighth the size of the original image.
[0070] Of course, the structure of the feature extractor described in the embodiments of the present invention is not limited thereto.
[0071] Considering the attention mechanism requires the use of the Transformer model, which abandons RNNs and CNNs as the basic models for sequence learning. Recurrent neural networks (RNNs) are inherently sequential structures, naturally containing word positional information within the sequence. When abandoning the RNN structure and completely replacing it with Attention, this word order information is lost, and the model cannot know the relative and absolute position of each word in the sentence. Therefore, it is necessary to add word order signals to word vectors to help the model learn this information; positional encoding is a method used to solve this problem. The role of positional encoding is to allow the input data to carry positional information, enabling the model to identify positional features.
[0072] Based on this method, the location encoding module in this embodiment of the invention employs a preset location encoding method to encode the feature maps obtained after the aerial and satellite images have passed through a feature extractor, and then feeds them into self-attention and cross-attention layers. In one optional embodiment, the location encoding module uses a Sinusoidal location encoding method. Of course, the location encoding method used by the location encoding module in this embodiment of the invention is not limited to this. For more information on location encoding and the Sinusoidal location encoding method, please refer to relevant technical explanations; detailed descriptions are not provided here.
[0073] Self-attention and cross-attention layers are used to process the input data. They represent two different attention mechanisms: self-attention and cross-attention layers take feature points from an image as input and are used to find relationships between points within the same image; cross-attention layers take feature points from two images as input and are used to find relationships between points in one image and points in another. By combining self-attention and cross-attention layers, the relationships between feature points within and between images can be better characterized, thus obtaining the probability of point-to-point matching between pairs of heterogeneous images. The self-attention and cross-attention layers can be implemented using existing structures, which will not be described in detail here.
[0074] 2) The significance judgment branch
[0075] The saliency judgment branch is used to calculate the saliency value of each point in the satellite feature map output by the feature pairing branch to characterize whether the corresponding point is a credible landmark.
[0076] The satellite feature map is the output of the satellite image after passing through the feature extractor.
[0077] Since the points on aerial and satellite images are one-to-one, this embodiment of the invention only needs to calculate the significance of the points on the satellite image. Significance, also known as statistical significance, refers to the level of risk one would incur to reject the null hypothesis if it were true; it is also called the probability level or significance level. A higher significance level indicates that the research result deviates further from the mean or expected result, and the greater the difference. Significance level is a concept in statistics used to determine whether the results of an experiment or study are statistically significant. Commonly used significance levels include 0.05, 0.01, and 0.001, with 0.05 considered a commonly used significance level. The higher the significance level of the research result, i.e., the smaller the p-value (the p-value is the probability of observing the current statistic or a more extreme case if the null hypothesis is true), the further the research result deviates from the mean or expected result, and the greater the difference.
[0078] In this embodiment of the invention, the input to the saliency judgment branch is only the feature map of the satellite image after passing through the feature extractor, and the output is a weighted feature map. Specifically, for example, if the input is a feature map of h*w*n, the output is a feature map of h*w*(n+1), with the added channel representing the weight of each point on the feature map. A higher weight indicates a more salient point.
[0079] The saliency judgment branch determines whether points in the satellite feature map are reliable landmarks at the feature map level, and these landmarks will not affect the matching results over time. The saliency judgment branch obtains the saliency value of each point in the satellite feature map, and filters out points with large saliency values as reliable landmarks. Generally, points with large saliency values are special points in the image, such as buildings, roads, and other points with regional semantic features. The purpose of this invention is to use only matching points from regions with strong saliency.
[0080] In this embodiment of the invention, the saliency judgment branch mainly determines the saliency of feature points by adding a saliency judgment to the loss function. The loss function used in the saliency judgment branch will be specifically explained in S2.
[0081] 3) The merging processing module
[0082] The merging processing module performs threshold filtering and geometric consistency checks on the paired feature point pairs to obtain the mapping relationship between images. This relationship is then used to perform matching based on the probability of matching between the point pairs and credible landmarks, and outputs a confidence matrix.
[0083] Specifically, the convergence processing module only extracts reliable landmarks for matching, and outputs a confidence matrix. The number of rows in the confidence matrix is the number of pixels in the satellite image after downsampling, and the number of columns is the number of pixels in the aerial image after downsampling. For example, if the satellite image has a width of W1 and a height of H1, and the aerial image has a width of W2 and a height of H2, then the number of rows in the confidence matrix is (1 / 8*H1)*(1 / 8*W1), and the number of columns is (1 / 8*H2)*(1 / 8*W2). The size of each element in the confidence matrix represents the probability that its location maps to a pair of points on the aerial and satellite images. The higher the confidence level, the higher the reliability of the point pair as a corresponding landmark, i.e., the higher the probability that the point pair is the corresponding landmark.
[0084] S2, Based on the obtained significance values, meta-learning is used to construct the final loss function of the target network;
[0085] In one optional implementation, S2 may include the following steps:
[0086] S21, Define the salient points on the satellite feature map;
[0087] Specifically, S21 includes:
[0088] A salient point is defined as being significant in both space and channel. The definition of a salient point is expressed as follows:
[0089] (i,j) is a significant point In D k The largest local area is in the middle, among which,
[0090] The input satellite image is denoted as I; the satellite feature image is denoted as F, where F is a three-dimensional tensor, F∈R. h ×w×n h×w is the size of the feature map output by one channel in the feature extractor, where h and w are user-defined positive integers, and n is the number of channels, also a positive integer. F contains all the information from satellite image I. From a channel perspective, F is a dense set of h×w descriptor vectors d, where d ij =F ij , 1≤i≤h, 1≤j≤w, d∈R n From a spatial perspective, F consists of n feature response maps D obtained from feature detection of n channels. k A set; D k =F k D k ∈R h×w Therefore, at the F level, a salient point can be defined as one that is salient in both space and channel, as shown in equation (1). Here, (i,j) represents the pixel coordinates in the satellite feature map. D representsk The coordinates of the midpoint (i,j); if (i,j) is a significant point, it is equivalent to... In D k The maximum value is found in a local area, which is a grid area of a preset size, including 3×3; m represents the number of selected channels, m∈[1,n], such that... Take the largest value of m as k.
[0091] S22, Design the loss function of the target network using the definition of the salient points, and construct an explicitly expressed initial loss function;
[0092] Specifically, S22 may include the following steps:
[0093] S221, according to the definition of the salient point, s ij The significance value is used to quantitatively characterize the significance of each point on the satellite feature map, where s ij The expression is:
[0094]
[0095] in, This indicates the saliency of the point (i,j) in the spatial dimension; γ represents the significance of point (i,j) in the channel dimension; ij This represents the combined significance of point (i,j) in the spatial and channel dimensions; s ij The normalized significance value of point (i,j) on the satellite feature map is represented; (i',j') represents points on the satellite feature map other than (i,j); N(i,j) represents the 3*3 neighborhood centered on point (i,j); D represents m The value of the midpoint (i,j); γ i'j' This represents the overall significance of point (i', j').
[0096] At this point ij The significance of each point (i,j) on F is characterized. Therefore, in this embodiment of the invention, it can be added to the loss function, so that the network pays more attention to the loss of significant points and appropriately ignores the loss of non-significant points. Therefore, step S222 is executed.
[0097] S222, introducing s ij Design the loss function for the target network, and construct the explicit initial loss function as follows:
[0098]
[0099] Where L1 represents the initial loss function; It is represented as point-to-point annotation information of paired heterogeneous images in the dataset; express The point where the value of 1 is equal to 1; express neutralization The x-coordinate of a point in the same column but not equal to 1; Indicates the output of the feature pairing branch The probability of being a pairing point; λ represents the matching probability of the feature pairing branch for the output of a negative sample; λ represents the normalization parameter for the negative sample loss; s ij This represents the normalized significance value of point (i,j) on the satellite feature map.
[0100] S23, using meta-learning, the initial loss function is rewritten into an implicitly expressed final loss function.
[0101] The above methods artificially define the saliency of points on the feature map as being composed of both spatial and channel dimensions, representing an explicit, manually defined expression. However, this invention aims to allow the network to learn this saliency itself, without manually constructing analytical expressions. Therefore, referencing meta-learning work, a single-layer MLP (Multilayer Perception, fully connected neural network) V(·,θ) is used to characterize saliency. The goal is for the network to automatically calculate the saliency of the feature representation for each point and update it automatically as the network iterates.
[0102] Please see Figure 2 and Figure 3 understand, Figure 2 This is a schematic diagram of network optimization during the training process of this invention. Figure 3 This is a schematic diagram illustrating the actual application process of the present invention; the overall network of the present invention is actually a cross-modal remote sensing image matching network with automatic assessment of the reliability of surface features. Wherein, I A ,I B Representing paired heterogeneous images; the feature pairing branch is represented by a feature pairing network, and the saliency judgment branch is represented by saliency judgment. The positional encoding module in the feature pairing branch is abbreviated as positional encoding. Following the self-attention and cross-attention mechanism layers... and The output consists of two sets of feature matrices, which are three-dimensional vectors representing the weighted sorted features after passing through the self-attention and cross-attention mechanism layers. The weights represent the probability that a point has a paired point in another image after passing through the self-attention and cross-attention mechanisms. Then, by calculating the similarity of the weighted sorted features, the pairing probability between point pairs is obtained.
[0103] S23 may specifically include the following steps:
[0104] S231, redefine the feature pairing branch as N(ω);
[0105] Specifically, the network of the feature pairing branch is defined as N(ω), whose parameters are composed of the parameters of the feature extractor and the self-attention and cross-attention mechanism layers, where ω is the parameter of the feature pairing branch.
[0106] S232, introduce a layer of MLP network V(·,θ), and redefine the saliency judgment branch as V(N(ω),θ);
[0107] S233, based on the redefined feature pairing branch and saliency judgment branch, the content of the initial loss function is rewritten as follows:
[0108]
[0109]
[0110] Specifically, by rewriting the two parts of the summation symbol in equation (3) using the above definitions, we can obtain equations (4) and (5).
[0111] S234, Based on the rewritten content, the final loss function of the implicit expression is:
[0112]
[0113] Where L2 represents the final loss function. Since s ij This represents the normalized significance value of point (i,j) on the satellite feature map. Using V(·,θ) achieves the same effect, but it shows that the weights of the previously explicitly expressed significance are now implicitly expressed through V(·,θ). The advantage of this is that it allows the network to learn based on the characteristics of the data, rather than relying solely on manually designed significance values, resulting in more robust salient features.
[0114] S3, obtain a heterogeneous image dataset carrying point pair matching annotation information, and construct a meta-dataset by filtering salient points in it. Use the heterogeneous image dataset, the meta-dataset and the final loss function to train the target network to obtain the trained surface feature differential meta-selection network.
[0115] The process of training the target network in this embodiment of the invention is a process of optimizing ω and θ.
[0116] First, a metadata dataset needs to be defined. After obtaining a heterogeneous image dataset carrying point-pair matching annotation information, salient points can be determined through this annotation information. By filtering salient points from the heterogeneous image dataset carrying point-pair matching annotation information as matching points, the metadata dataset is obtained; that is, all samples in the metadata dataset must be reliable and salient. Here, M manually annotated sample points from each image are selected as the metadata.
[0117] Specifically, the target network is trained using the heterogeneous image dataset, the meta-dataset, and the final loss function to obtain a trained surface feature differential meta-selection network, including:
[0118] Using the heterogeneous image dataset, the meta-dataset, and the final loss function, the target network is trained using multiple gradient descent steps to optimize parameters ω and θ, resulting in a trained surface feature differential meta-selection network.
[0119] The core of this invention lies in improving the loss function. Using the loss function for neural network training can be combined with existing neural network training methods. This step mainly involves using the loss function to backpropagate and update the network parameters. Specifically, the optimization method is bilateral optimization, i.e., fixing ω to optimize θ, and then fixing θ to optimize ω, achieving alternating optimization. In short:
[0120] Optimization of ω: If the current optimal value of θ is θ * If we determine this, then the optimal value of ω is ω. * The solution can be obtained using the following formula:
[0121] ω * (θ * ) = argmin ω L C (ω,θ * (7);
[0122] Among them, L C (ω,θ * This indicates that with θ fixed as the current optimal value θ... * In this case, the confidence matrix obtained by passing the data through the entire network is used to calculate the loss function obtained by equation (4), denoted as L. C (ω,θ * ).
[0123] Optimization of θ: Fix ω as the optimal value of the current ω. * Inputting metadata into V(·,θ) yields L meta (ω * ,(θ)), where L meta (ω *,(θ)) represents the optimal value of ω when ω is fixed at the current ω. * Next, the confidence matrix obtained by inputting metadata into the entire network is used to calculate the loss function obtained from equation (8), denoted as L. meta (ω * Then, by gradient descent, the optimal θ is selected such that L(θ)). meta (ω * The minimum value of (θ) is denoted as θ. * :
[0124]
[0125]
[0126] θ * =argmin θ L meta (ω * ,(θ)) (10);
[0127] ω and θ will be updated alternately according to the above steps, since it is impossible to obtain the optimal value ω under the current conditions. * and θ * Therefore, the results of multiple gradient descent operations are used. and To replace ω * and θ * For details on the process, please refer to the relevant technical documentation; further explanation will not be provided here.
[0128] The two parameters ω and θ are determined through the above process, thus obtaining the trained surface feature differential selection network.
[0129] The surface feature differential meta-selection network is used to output the corresponding confidence matrix for the input unknown paired heterogeneous images in order to obtain the image matching and registration results.
[0130] The confidence matrix represents the probability of pairing between point pairs, which is the result of point pair pairing on heterogeneous images. Figure 3 In the confidence matrix, (1 / 8) 2 H B W B and (1 / 8) 2 H A W A See the previous section on the confidence matrix; these represent the total number of pixels in the aerial and satellite images after downsampling. For details on obtaining the image matching and registration results, please refer to the relevant description in the second section below.
[0131] The heterogeneous image matching network construction method based on meta-learning provided in this invention first constructs a target network, including a feature pairing branch, a saliency judgment branch, and a pooling processing module. The feature pairing branch primarily extracts features from satellite and aerial images, then pairs the extracted feature points. The saliency judgment branch judges the saliency of the extracted feature points, selecting those with high saliency. Generally, high saliency points are special points in the image, such as buildings and roads with regional semantic features. The pooling processing module performs threshold filtering and geometric consistency checks on the paired feature point pairs to obtain the mapping relationship between images. A self-designed meta-learning-based loss function is used during training. Furthermore, the training dataset in this invention is a heterogeneous image dataset carrying point pair matching annotation information. However, due to the high cost of precise annotation, many mis-annotations and inaccurate annotations may exist. Therefore, after manual screening, a small number of reliable images with accurate point pair annotations are selected as the meta-dataset. The target network is trained using the heterogeneous image dataset, the meta-dataset, and the final loss function to obtain a trained surface feature differential meta-selection network. This network is then used to output the corresponding confidence matrix for input unknown paired heterogeneous images. This invention employs meta-learning techniques, enabling the saliency judgment branch to select matching points in highly saliency regions. Therefore, the meta-learning-based heterogeneous image matching method implemented using the surface feature differential meta-selection network trained according to this invention possesses the ability to self-select matching regions. It can solve the robust matching problem between aerial images and pre-stored satellite images during flight when matching unknown paired heterogeneous images, can select feature-robust regions more quickly, and is applicable to any scenario. It is a novel intelligent and autonomous matching method for situations where other navigation and positioning methods for aircraft fail.
[0132] Secondly, based on the meta-learning-based heterogeneous image matching network construction method provided by this invention, this embodiment of the invention provides a meta-learning-based heterogeneous image matching method, such as... Figure 4 As shown, the method includes:
[0133] S100, Obtain the paired heterogeneous images to be matched;
[0134] Among them, the paired heterogeneous images to be matched include aerial and satellite images of the same area, but pixel matching has not been performed.
[0135] S200, the heterogeneous images to be matched are input into the surface feature differential selection network to obtain the corresponding confidence matrix, the feature point pairing result is obtained according to the confidence matrix, and the image matching and registration result is obtained according to the feature point pairing result;
[0136] Please seeFigure 3 To understand, the confidence matrix only provides the probability of pairing between point pairs, and can only provide the results of point pair pairing on heterogeneous images. The confidence matrix needs to undergo thresholding, such as filtering out elements with a confidence threshold greater than 0.01, and geometric consistency constraints are applied, regressing to a homography transformation matrix H, which represents the image matching and registration result.
[0137] The surface feature differential meta-selection network is obtained based on the meta-learning-based heterogeneous image matching network construction method provided in the first aspect.
[0138] The specific process of constructing the heterogeneous image matching network based on meta-learning to obtain the differential meta-selection network for the surface features is detailed in the relevant content of the first aspect and will not be elaborated here.
[0139] In the heterogeneous image matching network construction method based on meta-learning provided in the first aspect of the present invention, meta-learning technology is used to enable the saliency judgment branch to select matching points in regions with strong saliency. Therefore, the heterogeneous image matching method based on meta-learning implemented by the trained surface feature differential meta-selection network has the ability to self-select matching regions. It can solve the robust matching problem between aerial images and pre-stored satellite images during flight when performing unknown pairwise heterogeneous image matching, and can be applied to any scenario.
[0140] To facilitate understanding of the embodiments of the present invention, experimental data are used for illustration below.
[0141] In the specific experiment, the feature extractor in the feature pairing branch was a ResNet+FPN network, and the final feature map size was [size missing]. The positional encoding of the input before the self-attention layer and cross-attention layer uses the sinusoidal positional encoding method. The optimization strategies for ω and θ are both SGD. The dataset contains 5,000 aerial and satellite images with human-annotated aerial-satellite pairings. Data augmentation methods include rotation, flipping, and scaling parameters ranging from 0.5 to 2, indicating a scaling range between 0.5 and 2 times the original size. The experimental platform is an NVIDIA GeForce RTX 3090*4, with 2000 training iterations and an initial learning rate of 0.001.
[0142] Before employing the meta-learning region self-selection technique proposed in this invention, the results of region matching for heterogeneous remote sensing images are shown in Figures 5(a), 5(b), 5(c), 6(a), 6(b), and 6(c). Figures 5(a) and 5(b) show the distribution of feature points extracted from the aerial and satellite images of the first region using existing feature extraction networks, indicated by red circles. Figure 5(c) shows the registration result calculated based on the paired feature points extracted above. Figures 6(a) and 6(b) show the distribution of feature points extracted from the aerial and satellite images of the second region using existing feature extraction networks, indicated by red circles. Figure 6(c) shows the registration result calculated based on the paired feature points extracted above. Both the registration results in Figures 5(c) and 6(c) contain errors.
[0143] The region selection results show that the feature points extracted in Figures 5(a), 5(b), and 6(a) and 6(b) are very scattered. In some rapidly changing areas, such as grassland, a large number of feature points were also extracted. Compared with the building area in the lower right corner, the features in these areas are not reliable enough, so there is also a corresponding deviation. Ultimately, the registration results of the two images are also biased, such as the road part in the upper left corner of Figure 5(c).
[0144] After the meta-learning region self-selection process, the matching results obtained in the heterogeneous remote sensing image matching problem are shown in Figures 7(a), 7(b), 7(c), 8(a), 8(b), and 8(c). Figures 7(a) and 7(b) show the distribution of feature points extracted using the network of this invention from the aerial and satellite images of the first region, respectively, indicated by red circles. Figure 7(c) shows the registration result calculated based on the paired feature points extracted above. Figures 8(a) and 8(b) show the distribution of feature points extracted using the network of this invention from the aerial and satellite images of the second region, respectively, indicated by red circles. Figure 8(c) shows the registration result calculated based on the paired feature points extracted above. It can be seen that compared with existing networks, the feature points extracted by this invention for image registration are concentrated in the salient areas of the image, and the feature points are very dense. For example, after using the method of this invention, there are almost no matching points in areas with large variations, such as grassland, and the final registration result will not have any deviation. This shows that the data self-selection strategy based on the meta-learning framework is effective.
[0145] The above description is merely a preferred embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention are included within the scope of protection of the present invention.
Claims
1. A method for constructing a heterogeneous image matching network based on meta-learning, characterized in that, include: A target network is constructed, comprising a feature pairing branch, a saliency judgment branch, and a pooling processing module. The feature pairing branch is used to obtain the probability of point-to-point matching from the input paired heterogeneous images. The paired heterogeneous images include aerial images and satellite images of the same region. The saliency judgment branch is used to calculate the saliency value of each point in the satellite feature map output by the feature pairing branch to characterize whether the corresponding point is a credible landmark. The pooling processing module is used to perform matching based on the probability of point-to-point matching and credible landmarks, and output a confidence matrix. Based on the obtained significance values, meta-learning is used to construct the final loss function of the target network; A heterogeneous image dataset carrying point-pair matching annotation information is obtained. A meta-dataset is constructed by filtering salient points. The target network is trained using the heterogeneous image dataset, the meta-dataset, and the final loss function to obtain a trained surface feature differential meta-selection network. The surface feature differential meta-selection network is used to output the corresponding confidence matrix for input unknown paired heterogeneous images to obtain image matching and registration results. The step of constructing the final loss function of the target network using meta-learning based on the obtained saliency values includes: Determining the definition of salient points on the satellite feature map includes: defining salient points as those that are salient in both space and channel; the definition of salient points is expressed as follows: ; The input satellite image is denoted as The satellite feature map is denoted as , It is a three-dimensional tensor. , It is the size of the feature map output by one channel in the feature extractor. and The value is a user-defined positive integer. It is the number of channels. It is a positive integer; from the perspective of the channel dimension, It is a by Descriptor vectors The dense set that is composed of , , , From a spatial perspective, It is by Feature detection of each channel Zhang's characteristic corresponding diagram A set; , ; This represents the pixel coordinates in the satellite feature map. express midpoint The coordinates; if As a salient point, it is equivalent to exist The maximum value is found in a local area, which is a grid area of a preset size. The preset size includes... ; Indicates the number of channels selected. ,make Take the largest Value as ; The loss function of the target network is designed using the definition of the salient points, and an explicit initial loss function is constructed, including: based on the definition of the salient points, using... The significance value is used to quantitatively characterize the significance of each point on the satellite feature map, where, The expression is: ; in, Indicates the point Significance in spatial dimension; Indicates the point Significance in the channel dimension; Indicates the point The combined significance of the spatial dimension and the channel dimension; Points on the satellite feature map The normalized significance value; Including satellite feature map Outside points; Indicated by point The 3x3 neighborhood centered on the center; express midpoint The value; Point The overall significance; Introduction Design the loss function for the target network, and construct the explicit initial loss function as follows: ; in, Represents the initial loss function; It is represented as point-to-point annotation information of paired heterogeneous images in the dataset; express The point where the value of 1 is equal to 1; express neutralization The x-coordinate of a point in the same column but not equal to 1; Indicates the output of the feature pairing branch The probability of being a pairing point; The matching probability of the feature pairing branch for the output of a negative sample; This represents the normalization parameter for negative sample loss; Points on the satellite feature map The normalized significance value; Meta-learning is used to rewrite the initial loss function into an implicitly expressed final loss function.
2. The method for constructing a heterogeneous image matching network based on meta-learning according to claim 1, characterized in that, The feature pairing branch includes a feature extractor, a position encoding module, a self-attention layer, and a cross-attention layer; Accordingly, the satellite feature map is the output of the satellite image after passing through the feature extractor.
3. The method for constructing a heterogeneous image matching network based on meta-learning according to claim 2, characterized in that, The initial loss function is rewritten into an implicitly expressed final loss function using meta-learning, including: The feature pairing branch is redefined ;in, The parameters for the feature pairing branch; Introducing a layer of MLP network The significance judgment branch is redefined as ; Based on the redefined feature pairing branch and saliency judgment branch, the contents of the initial loss function are rewritten as follows: ; ; Based on the rewritten content, the final loss function for the implicit expression is: ; in, This represents the final loss function.
4. The method for constructing a heterogeneous image matching network based on meta-learning according to claim 3, characterized in that, The target network is trained using the heterogeneous image dataset, the meta-dataset, and the final loss function to obtain a trained surface feature differential meta-selection network, including: Using the heterogeneous image dataset, the meta-dataset, and the final loss function, the target network is trained using multiple gradient descent steps to optimize the parameters. and The trained surface feature differential selection network is obtained.
5. The method for constructing a heterogeneous image matching network based on meta-learning according to claim 2, characterized in that, The feature extractor is a ResNet+FPN network; the ResNet+FPN network outputs a feature map that is one-eighth the size of the original image.
6. The method for constructing a heterogeneous image matching network based on meta-learning according to claim 2, characterized in that, The position encoding module uses the Sinusoidal position encoding method.
7. A heterogeneous image matching method based on meta-learning, characterized in that, include: Obtain the paired heterogeneous images to be matched; The heterogeneous images to be matched are input into the surface feature differential meta-selection network to obtain the corresponding confidence matrix. The feature point pairing result is obtained based on the confidence matrix, and the image matching and registration result is obtained based on the feature point pairing result. The surface feature differential meta-selection network is obtained based on the meta-learning-based heterogeneous image matching network construction method described in any one of claims 1 to 6.