A cross-view geolocation method based on information fusion joint representation learning

By employing a dual-branch network structure and information fusion strategy, the problem of global information neglect in cross-view geolocation is solved, thereby improving the accuracy and stability of the model in matching images from different viewpoints.

CN118230034BActive Publication Date: 2026-06-19NORTHEASTERN UNIV CHINA

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NORTHEASTERN UNIV CHINA
Filing Date
2024-03-12
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In cross-view geolocation tasks, existing technologies overemphasize local information while ignoring global information, leading to decreased model performance and increased model complexity due to changes in viewpoint.

Method used

A dual-branch network structure is adopted to extract global and local features from satellite and UAV images respectively. Information fusion is performed through a global information module, a local information module, and a global-local hybrid module. Global and local receptive layer strategies are introduced, and the feature robustness is improved by using a hybrid information receptive strategy and a convolutional block attention module.

Benefits of technology

It improves the accuracy and robustness of cross-view geolocation and enhances the accuracy and stability of model image matching from different viewpoints.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118230034B_ABST
    Figure CN118230034B_ABST
Patent Text Reader

Abstract

This invention belongs to the field of computer vision technology and discloses a cross-view geolocation method based on information fusion and joint representation learning. First, feature information from the image is acquired through a global information module and a local information module, helping the network to better learn information from the image. Furthermore, a global-local fusion module is introduced to allow local information to assist global features, thereby better learning the latent information in the image. Second, a global receptive layer is introduced into each module to enhance the extraction of contextual information from the image and improve model performance. Finally, tests are conducted on the University-1652 dataset, and the test results demonstrate that the proposed network framework outperforms state-of-the-art algorithms, verifying the effectiveness of the algorithm.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision technology, and more specifically to a cross-view geolocation method based on information fusion joint representation learning. Background Technology

[0002] Cross-view geolocation involves retrieving the most relevant images of the same geographical location from images acquired from different platforms and has been widely used in many fields, such as accurate delivery, autonomous driving, action recognition, event detection, and gait recognition. In the era of digital maps, it is often necessary to estimate the geospatial location of a given image. For example, given a drone view image, it is necessary to retrieve images of the same location from other viewpoints to obtain the geographic information in the current image. Therefore, cross-view geolocation has become an effective solution. However, view images acquired from different platforms have different viewpoints and imaging methods; for example, ground views are almost perpendicular to the horizon, while satellite views are almost parallel to the horizon. Therefore, cross-view geolocation is a highly challenging task.

[0003] The paper *IEEE Transactions on Geoscience and Remote Sensing, 1–11, 2023* proposes introducing a Target Enhancement Module (TFE) and a Feature Alignment and Unification Module (EM) to mine semantic information in images to help the model achieve better performance. However, this method focuses excessively on local information in the image while ignoring global information, causing the model to overlook some key information in the image, which also leads to a decrease in model performance. The paper *IEEE Transactions on Circuits and Systems for Video Technology, 4804-4815, 2021* proposes first converting UAV view images into satellite view images through projection transformation, and then using a square ring cutting method to segment the converted images to complete the cross-view geolocation task. However, this operation also increases the complexity of the model. In addition, this method only focuses on local information in the image, which also loses some key information in the global information, leading to a decrease in model performance. Summary of the Invention

[0004] The purpose of this invention is to propose a cross-view geolocation method based on information fusion joint representation learning, which improves the accuracy and robustness of cross-view geolocation.

[0005] The technical solution of this invention is as follows: A cross-view geolocation method based on information fusion joint representation learning, which acquires satellite view image data and UAV view image data respectively; establishes a cross-view geolocation model, wherein the cross-view geolocation model is a dual-branch network, the two branches have the same structure but do not share weights; each branch network includes a ResNet-50 backbone network module, a global information module branch, a local information module branch, a global-local hybrid module branch, and a classifier module; satellite view image data and UAV view image data are respectively input to different branch networks of the cross-view geolocation model; different images are processed by the ResNet-50 backbone network module to extract global features. global features The system is processed through a global information module branch, a local information module branch, and a global-local hybrid module branch. Each module branch is trained to learn a mapping function, which maps all images from different sources to a shared feature space. In this feature space, the feature distances of images with the same geographic labels from different platforms are closer, while the feature distances of images with different geographic labels are wider. The classifier module classifies the features in the feature space.

[0006] The global information module is used to deeply mine the global information of the image and provide two global feature descriptors for the final result; the global feature descriptors are the global information descriptor and the feature descriptor after introducing the global receptive layer.

[0007] The global information module includes a max pooling layer and a global accepting layer mechanism;

[0008] Global features The input is fed into the max pooling layer to obtain the global information descriptor, which is represented as follows:

[0009] (1)

[0010] in, This indicates a max pooling operation. This is the global information descriptor for the output;

[0011] A global receptive layer mechanism is introduced into the global information module to increase the global receptive field, thereby obtaining more effective global features;

[0012] The global receptive layer mechanism first uses global features Divide into blocks, into piece, The number of horizontal and vertical slices selected for the feature are respectively used to divide the segmented features based on channels. The newly generated segmented features are composed of a large number of features at the segmented part and a small number of features from other channel-segmented parts to synthesize a new complete global feature. The complete global feature is then used to obtain the global feature descriptor after being introduced into the global receptive layer through a max pooling operation.

[0013] The specific process for obtaining the complete global features is as follows:

[0014] By offset and offset Control the sampling position based on channel segmentation, and further split it. arrive There are subgroups, so that each subgroup has Each channel, based on which the block features are segmented according to spatial dimensions, each segmented part contains 1 pixel, For the channel segmentation array based on sampling, For the number of channels of the feature, For high global features, The width of the global feature;

[0015] The offset is defined as follows:

[0016] (2)

[0017] Where k and l are block indexes. ;

[0018] Apply block index to and generate and At this point, the sampling location is represented as follows:

[0019] (3)

[0020] This indicates a feature block operation;

[0021] Each segmented feature obtained after channel-based segmentation is combined to generate a complete global feature. The formula for each newly generated segmented feature is shown below:

[0022] (4)

[0023] in, This indicates the index of the newly generated segmented feature block after channel-based combination.

[0024] The local information module divides the global features into local blocks and introduces a local receiving layer to mine contextual information for the features after each local block.

[0025] (5)

[0026] in, Let J be the j-th feature after local segmentation, and N be the number of segments for the global feature. For the j-th block of features generated after passing through the local acceptor layer, For the characteristics of other blocks after partitioning, Characterized by high, For the characteristic width, Number of feature channels;

[0027] A square ring partitioning strategy is used to process the block features. Divide the area into regions;

[0028] The square ring partitioning strategy assumes that the center of the input image and the extracted block features The centers are aligned, and the entire block feature is determined based on the distance from the image center. Divide into blocks;

[0029] (6)

[0030] in, The j-th block feature after the square ring is divided. m is the number of regions divided by the square ring. This indicates a square ring partitioning process;

[0031] All features from the partitions are used to obtain feature descriptors through max pooling.

[0032] A global-local information fusion module assists in mining and utilizing potential information in images;

[0033] The global features are downsampled, and the downsampled global features have the same dimension as the global features after local segmentation. The two are merged to generate new multi-block local features. A hybrid information reception strategy is introduced to recombine the merged local features to generate multiple new block features. The specific process of the hybrid information reception strategy is similar to the global reception layer mechanism. These new block features are combined to generate hybrid global features with the same dimension as the original global features. A convolutional block attention module is introduced into the global-local information mixing module to mine the information of the hybrid global features. The features generated by the global-local information mixing module are max-pooled to generate a unified form of feature descriptor.

[0034] The classifier module includes a fully connected layer, a normalization layer, a dropout layer, and a classification layer; the classifier module predicts the geographic labels of corresponding features based on the features output by each module branch in the shared feature space.

[0035] The outputs of the global information module branch, the local information module branch, and the global-local hybrid module branch are collectively referred to as features. As input to the classifier module, the classifier module outputs a vector. , The dimension is equal to the number of geographic label categories C;

[0036] (7)

[0037] During training, the cross-entropy loss function is selected for training. The cross-entropy loss function is defined as follows:

[0038] (8)

[0039] (9)

[0040] in, It is the logit score of the real geographic label y. for The predicted probability of belonging to the geographical label y; It predicts the logit score of the geographic label c.

[0041] The beneficial effects of this invention are as follows: Addressing the issue of non-robust feature extraction in cross-view image geolocation algorithms due to changes in viewpoint and content, this invention proposes a cross-view geolocation model based on a global receptive layer to solve this problem. First, feature information from the image is acquired through a global information module and a local information module, helping the network better learn information from the image. Furthermore, a global-local fusion module is introduced to allow local information to assist global features, thereby better learning potential information in the image. Second, global receptive layers, local receptive layers, and a fusion information receptive strategy are introduced into the module parts respectively to enhance the extraction of contextual information from the image and improve model performance. Finally, testing on the University-1652 dataset demonstrates that the proposed network framework outperforms state-of-the-art algorithms, validating the algorithm's effectiveness. Attached Figure Description

[0042] Figure 1 This is a flowchart of the present invention. Detailed Implementation

[0043] Figure 1This is the main flowchart of the technical solution of this invention. For example... Figure 1 As shown, the cross-view geolocation method based on information fusion joint representation learning proposed in this invention includes the following steps:

[0044] (1) Training data: This invention constructs a relevant training set using RGB images from the University-1652 dataset and provides input images for the network;

[0045] (2) Model training: The network framework proposed in this invention is designed as a dual-branch architecture, with identical structures in both branches but no shared weights. Each branch in the framework is mainly divided into three modules: a global information module, a local information module, and a global-local hybrid module.

[0046] The network model designed in this invention uses ResNet-50 as the backbone network to extract features and obtains global features through this backbone network. This process can be represented as a function. The process of extracting global features can be represented by the following formula:

[0047] (1)

[0048] in, For the input image, For image Extracted global feature map.

[0049] a. Global Information Module

[0050] First, we will introduce the global information module. Although this invention sets the input images obtained from different platforms as different network branches, the model structure of each branch is consistent, even though the weights are not shared between the branches.

[0051] After obtaining the global features, the designed global information module is activated. We believe that global information in an image has a significant impact on the model's performance. Therefore, this invention designs this module to deeply mine the global information of the image and provide two global feature descriptors for the final result: a global information descriptor and a feature descriptor introduced after introducing the global receptive layer. First, we obtain the global information descriptor by inputting the features obtained from the backbone network into a max pooling layer. This process can be represented by the following formula:

[0052] (2)

[0053] in, This indicates a max pooling operation. This is the output feature descriptor.

[0054] Simply using pooling layers to extract global information from global features is insufficient to effectively capture complete data. Therefore, this invention introduces a global receptive layer mechanism into the global module to increase the network's global receptive field and thus obtain more effective global features. The global receptive layer mechanism first divides the features into blocks, assuming they are divided into... The blocks are then further segmented based on channels, and the newly generated block features are composed of a large number of features from the original location and a small number of features from other blocks, thus synthesizing a new complete global feature. This process is achieved through offsets. and Control the sampling location and further break it down. arrive There are subgroups, so that each subgroup has Each channel is used to segment the features according to spatial dimensions. Each part will contain... For each pixel, the offset can be defined as follows:

[0055] (3)

[0056] Where k and l are block indexes. Based on this, the block index will be applied to and generate and At this point, the sampling location can be represented as follows:

[0057] (4)

[0058] Based on this, each new block feature can be combined to generate a complete global feature. Each newly generated block feature can be formulated as follows:

[0059] (5)

[0060] in, This indicates the index of the newly generated feature block.

[0061] By dividing and merging the global features, each block in the merged feature set can obtain information about the original location and global context, which is equally important. This increases the global receptive field of the global features, allowing the merged global features to uncover potential key information in the input image and thus improve the robustness of the features. Finally, a new global feature descriptor is obtained through formula (2) to help the network model improve performance.

[0062] b. Local information module

[0063] In cross-view geolocation tasks, the content information in the image changes greatly due to the significant changes in the image viewpoint. Therefore, it is necessary to extract the context information in the image to assist the model. In order to better utilize the context information in the image, we adopt a local block approach to mine other information in the image as much as possible to improve the stability of the network model. After further dividing the extracted global features into blocks, each block of features has information in the global features, but the information content between them is different. In order to further increase the global receptive field of each part, we introduce a local receptive layer strategy to improve their ability to mine context information. Since the feature content of each part after block is different, we extract content information that is not in the block from other parts while preserving the feature content information of each block as much as possible, so as to fully mine and utilize the context information in the image and provide more sufficient feature information for the next operation, similar to formula (5). This process can be formulated as follows:

[0064] (6)

[0065] in, Let J be the j-th feature after segmentation, and N be the number of segments in the global feature.

[0066] Building upon this foundation, to better utilize contextual information, we employ a square ring partitioning strategy to divide the processed block features into regions. The square ring strategy aligns the center of the divided image with the center of the feature map approximately, and divides the entire region into blocks based on the distance from the image center. Since geographic targets are typically located at the center of the image, while other relevant information is distributed elsewhere, the block division also demonstrates that the divided regions are approximately spatially aligned across different viewpoints. This increases the similarity of features across different parts, effectively ensuring model accuracy. This process can be formalized as follows:

[0067] (7)

[0068] in, Let j be the processed j-th feature block, and m be the number of regions. This indicates that the square ring partitioning is processed.

[0069] This segmentation strategy can not only obtain geographic target information in the image, but also obtain contextual information regions at different distances from the geographic targets. Therefore, this strategy can effectively assist the network model in mining contextual information in the image. In addition, all segmented features will obtain feature descriptors through formula (2) to improve the accuracy of the final model.

[0070] c. Global-Local Information Hybrid Module

[0071] To better utilize global and contextual information in the input image, this invention designs a global-local information fusion module to assist the model in fully mining and utilizing the potential information in the image. Since each part of the global image after being segmented by the local information module contains a large amount of contextual information, and global features contain even more complete and crucial information, it may be difficult to discover this crucial information due to the excessive amount of content in global features. Therefore, this invention introduces local information to assist global features in acquiring more potential information, thereby improving model performance.

[0072] Since the global features and the segmented features have different dimensions, the global features first need to be downsampled to make the processed features consistent with the segmented features. Then, the processed global features and segmented features are merged to generate multiple features. Based on this, a hybrid information reception strategy is introduced to recombine the newly generated features to generate multiple new segmented features. Finally, these features are combined to generate a global feature with the same dimensions as the beginning. The newly generated global feature consists of the original global features and the segmented local features. This feature will also contain more contextual information. Since the attention mechanism can effectively obtain key information in the image, this invention introduces a convolutional block attention module (CBAM) in the global-local information fusion module to mine more useful information in the processed features and further improve the stability of the model. It is worth noting that the features finally generated in the global-local information fusion module also need to generate a unified form of feature descriptor through formula (2) to improve the accuracy of the final model.

[0073] d. Loss function

[0074] Through the three modules described above, each branch will obtain some features. However, these features are extracted from different branches, and due to different acquisition platforms, they may have different distributions and cannot be directly used for feature matching. To solve this problem, this invention will set a mapping function to map all images from different acquisition sources to a shared feature space. In this feature space, the feature distance of the same geographic label from different platforms will be closer, while the feature distance of different geographic labels will be wider.

[0075] This classifier consists of four parts: a fully connected layer (FC), a batch normalization layer (BN), a dropout layer, and a classification layer (CLS). It is a fully connected layer. The classifier module predicts the geographic label for each feature based on its features. Given partial features... As input, the classifier module outputs a vector. , The dimension is equal to the number of geographic label categories C. This process can be expressed by the following formula:

[0076] (8)

[0077] During training, this invention selects the cross-entropy loss function to train the network model. The cross-entropy loss function is defined as follows:

[0078] (9)

[0079] (10)

[0080] in, It is the logit score of the real geographic label y, which is the probability score normalized by the softmax function in formula (9). for The predicted probability of belonging to geographic label y. This invention optimizes the entire network model by accumulating loss across different parts of the image on different platforms using the cross-entropy loss function.

[0081] (3) Image retrieval: During the testing phase, we also output the features of different branches through the classifier module. Based on this, we determine whether the images of different platforms represent the same geographic target by comparing the similarity of the features of different input images, thereby matching accurate results.

[0082] To address the issue of robust feature extraction in previous cross-view geolocation algorithms, this invention proposes a global-local model network based on a global receptive layer to solve the cross-view geolocation problem. First, we acquire feature information from the image through global and local modules to help the network better learn information from the image. Furthermore, a global-local fusion module is introduced to allow local information to assist global features, thereby better learning the latent information in the image. Second, we introduce a global receptive layer into each module to enhance the extraction of contextual information from the image and improve model performance. Finally, testing on the University-1652 dataset demonstrates that the proposed network framework outperforms state-of-the-art algorithms, validating the algorithm's effectiveness.

[0083] To verify the effectiveness of the algorithm in long-term visual positioning accuracy, this invention was tested on the University-1652 dataset. According to the experimental results, the proposed algorithm achieves Recall@1 and AP of 85.85% and 87.88% respectively when retrieving satellite-view imagery from UAV-view imagery, and 91.73% and 85% respectively when retrieving UAV-view imagery from satellite-view imagery.

Claims

1. A cross-view geolocating method based on information fusion joint representation learning, characterized in that, Satellite and UAV-based imagery data are acquired separately. A cross-view geolocation model is established, which is a dual-branch network with identical structures but non-shared weights. Each branch network includes a ResNet-50 backbone module, a global information module branch, a local information module branch, a global-local hybrid module branch, and a classifier module. Satellite and UAV-based imagery data are input into different branch networks of the cross-view geolocation model. Global features are extracted from different images using the ResNet-50 backbone network module. global features The system is processed through a global information module branch, a local information module branch, and a global-local hybrid module branch. Each module branch is trained to learn a mapping function, which maps all images from different sources to a shared feature space. In this feature space, the feature distances of images with the same geographic labels from different platforms are closer, while the feature distances of images with different geographic labels are wider. The classifier module classifies the features within the feature space. The global information module is used to deeply mine the global information of the image and provide two global feature descriptors for the final result; the global feature descriptors are the global information descriptor and the feature descriptor after introducing the global receptive layer. The global information module includes a max pooling layer and a global accept layer mechanism; global feature The input to the max-pooling layer gets a global information descriptor, which is represented as follows: (1) wherein, denotes a max-pooling operation, is an output global information descriptor; A global receptive layer mechanism is introduced into the global information module to increase the global receptive field, thereby obtaining more effective global features; The global receptive layer mechanism first uses global features Divide into blocks, into piece, The number of horizontal and vertical slices selected for the feature are respectively used to divide the segmented features based on channels. The newly generated segmented feature is a new complete global feature composed of a large number of features in the segmented part and a small number of features in other channel-segmented parts. The complete global feature is then used to obtain the global feature descriptor after being introduced into the global receptive layer through a max pooling operation. The specific process for obtaining the complete global features is as follows: By offset and offset Control the sampling position based on channel segmentation, and further split it. arrive There are subgroups, so that each subgroup has Each channel, based on which the block features are segmented according to spatial dimensions, each segmented part contains 1 pixel, For the channel segmentation array based on sampling, The number of channels for the feature For high global features, The width of the global feature; The offset is defined as follows: (2) wherein k, l are block indices, ; Applying the chunk index to and generating and ; at this point, the sampling position is represented as follows: (3) representing a feature binning operation; Each segmented feature obtained after channel-based segmentation is combined to generate a complete global feature. The formula for each newly generated segmented feature is shown below: (4) wherein, denotes the number index of the newly generated patch feature block after the channel combination.

2. The cross-view geolocating method based on information fusion joint representation learning according to claim 1, characterized in that, The local information module divides the global features into local blocks and introduces a local receiving layer to mine contextual information for the features after each local block. (5) in, Let J be the j-th feature after local segmentation, and N be the number of segments for the global feature. For the j-th block of features generated after passing through the local receptive layer, For the characteristics of other blocks after partitioning, Characterized by high, For the characteristic width, Number of feature channels; using a square ring division strategy on the processed patch features performing region division; The square ring division strategy considers the center of the input image and the extracted patch feature The center is aligned, and the entire patch feature is divided according to the distance of the image center ; (6) in, The j-th block feature after the square ring is divided. m is the number of regions divided by the square ring. This indicates a square ring partitioning process; All features from the partitions are used to obtain feature descriptors through max pooling.

3. The cross-view geolocating method based on information fusion joint representation learning according to claim 2, characterized in that, A global-local information fusion module assists in mining and utilizing potential information in images; The global features are downsampled, and the downsampled global features have the same dimension as the global features after local segmentation. The two are merged to generate new multi-block local features. A hybrid information reception strategy is introduced to recombine the merged local features to generate multiple new block features. The specific process of the hybrid information reception strategy is similar to the global reception layer mechanism. These new block features are combined to generate hybrid global features with the same dimension as the original global features. A convolutional block attention module is introduced into the global-local information mixing module to mine the information of the hybrid global features. The features generated by the global-local information mixing module are max-pooled to generate a unified form of feature descriptor.

4. The cross-view geolocating method based on information fusion joint representation learning according to claim 3, characterized in that, The classifier module includes a fully connected layer, a normalization layer, a dropout layer, and a classification layer; the classifier module predicts the geographic labels of corresponding features based on the features output by each module branch in the shared feature space. The outputs of the global information module branch, the local information module branch, and the global-local hybrid module branch are collectively referred to as features. As input to the classifier module, the classifier module outputs a vector. , The dimension is equal to the number of geographic label categories C; (7) During training, the cross-entropy loss function is selected for training. The cross-entropy loss function is defined as follows: (8) (9) in, It is the logit score of the real geographic label y. for The predicted probability of belonging to the geographical label y; It predicts the logit score of the geographic label c.