Remote sensing image positioning method and system based on deep feature vector retrieval matching

By employing a deep feature vector retrieval and matching method, utilizing a remote sensing image sample fine-tuning model and inverted index structure, and combining it with the RANSAC algorithm, efficient and accurate positioning of remote sensing images is achieved. This solves the problem of insufficient positioning accuracy and efficiency of remote sensing images and is suitable for geographic information collection, environmental monitoring, and urban planning.

CN122244705APending Publication Date: 2026-06-19NAT UNIV OF DEFENSE TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NAT UNIV OF DEFENSE TECH
Filing Date
2026-03-17
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing remote sensing image positioning technologies have shortcomings in terms of illumination changes, shooting perspective differences, and high-dimensional feature vector calculation and storage, making it difficult to meet the requirements for efficient, accurate, and stable positioning. Furthermore, existing deep learning models do not fully consider the unique characteristics of remote sensing images, resulting in insufficient matching accuracy.

Method used

A depth feature extraction model fine-tuned from remote sensing image samples is used to extract global feature vectors. Clustering and storage are performed using an inverted index structure. Local feature matching is performed through nearest neighbor retrieval and the RANSAC algorithm. The affine transformation matrix is ​​then calculated to achieve accurate mapping.

Benefits of technology

It improves the initial accuracy and overall processing efficiency of remote sensing image positioning, solves the problem of high matching error rate in complex scenarios, and meets the accuracy requirements of practical applications such as geographic information collection, environmental monitoring, and urban planning.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244705A_ABST
    Figure CN122244705A_ABST
Patent Text Reader

Abstract

This invention relates to a remote sensing image positioning method and system based on depth feature vector retrieval and matching, belonging to the field of remote sensing image positioning technology. The method includes: firstly, using a depth feature extraction model finely tuned from remote sensing image samples, extracting the first global feature vector of global orthophoto tiles, and constructing a global feature vector database through inverted index clustering; secondly, after preprocessing the remote sensing image to be located, extracting its second global feature vector, and obtaining the top K candidate tile images through nearest neighbor retrieval in the database; thirdly, performing coarse and fine local feature matching on the two, eliminating mismatched points using the RANSAC algorithm, and obtaining the reference tile image and corresponding point information; finally, calculating the affine transformation matrix based on the corresponding points to complete the mapping from pixel coordinates to the geographic coordinate system, outputting accurate geographic location information. This invention effectively improves the accuracy and efficiency of remote sensing image positioning and is suitable for various civilian application scenarios.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of remote sensing image positioning technology, and in particular to a remote sensing image positioning method and system based on depth feature vector retrieval and matching. Background Technology

[0002] In the field of remote sensing imagery, accurate positioning of remote sensing images is a crucial link supporting core operations such as geographic information acquisition, environmental monitoring, emergency response, and urban planning. Its positioning accuracy and efficiency directly determine the reliability and practicality of subsequent applications. Currently, remote sensing image positioning mainly relies on obtaining initial geographic coordinates using Global Navigation Satellite System (GNSS) signals, or manually selecting feature points and comparing them with reference images containing known geographic information to achieve coordinate calibration of the remote sensing image to be positioned. However, these traditional positioning methods face numerous technical bottlenecks and struggle to meet the demands for efficient, accurate, and stable positioning.

[0003] To improve positioning efficiency, some technical solutions attempt to use automatic matching methods based on manually designed features. These methods extract local feature points from images using algorithms and compare them to achieve matching and positioning between the remote sensing image to be located and a reference image. However, such methods are sensitive to changes in image illumination, shooting angle, and temporal phase. In the complex scenes commonly encountered in remote sensing, inaccurate feature point extraction and high matching error rates are prone to occur, leading to decreased positioning accuracy. Furthermore, with the development of remote sensing technology, the amount of reference image data globally has exploded. Traditional feature matching methods require comparing all reference image data one by one. The computation and storage costs of high-dimensional feature vectors are enormous, making it difficult to achieve rapid retrieval of global-scale reference image databases and failing to meet the needs of real-time or near-real-time positioning.

[0004] Meanwhile, existing deep learning-based image matching technologies are mostly based on training models with natural scene images, without fully considering the unique characteristics of remote sensing images. Remote sensing images cover diverse scenes such as terrain, landforms, and artificial features, and have characteristics such as large resolution differences and sparse textures of features. Directly applying models trained with natural scenes can easily lead to inaccurate feature representation, resulting in matching accuracy that cannot meet the needs of scenarios with high positioning accuracy requirements, such as geographic information collection and refined urban planning. Summary of the Invention

[0005] Therefore, it is necessary to provide a remote sensing image positioning method and system based on depth feature vector retrieval and matching that can solve the problem that the positioning accuracy of existing remote sensing images is difficult to meet the needs of practical applications.

[0006] A remote sensing image localization method based on depth feature vector retrieval and matching, the method comprising:

[0007] Step 1: The first global feature vector of the global orthophoto tile is extracted using the depth feature extraction model after fine-tuning the remote sensing image samples. The first global feature vector is then clustered and stored using an inverted index structure to construct a global feature vector database. Step 2: Preprocess the remote sensing image to be located to obtain a preprocessed remote sensing image; Step 3: Use the fine-tuned depth feature extraction model to extract the second global feature vector of the preprocessed remote sensing image, and input the second global feature vector into the global feature vector database for nearest neighbor retrieval to obtain the candidate tile images ranked in the top K by similarity. Step 4: Perform coarse and fine local feature extraction and matching on the remote sensing image to be located and the candidate tile image, and output the feature matching result; then use the RANSAC algorithm to remove mismatched points in the feature matching result to obtain the reference tile image and corresponding point information. Step 5: Based on the corresponding point information, calculate the affine transformation matrix between the remote sensing image to be located and the reference tile image. Map the pixel coordinates of the remote sensing image to be located to the geographic coordinate system of the reference tile image through the affine transformation matrix, and output accurate geographic location information.

[0008] On the other hand, a remote sensing image positioning system based on deep feature vector retrieval and matching is also provided, including: The feature vector library construction module is used to extract the first global feature vector of global orthophoto tiles using a depth feature extraction model fine-tuned by remote sensing image samples, and to cluster and store the first global feature vector through an inverted index structure to build a global feature vector database. The remote sensing image preprocessing module is used to preprocess the remote sensing images to be located to obtain preprocessed remote sensing images. The global feature retrieval module is used to extract the second global feature vector of the preprocessed remote sensing image using the fine-tuned depth feature extraction model, and input the second global feature vector into the global feature vector database for nearest neighbor retrieval to obtain the candidate tile images ranked first K in similarity. The local feature fine matching module is used to extract and match local features at both coarse and fine granularity between the remote sensing image to be located and the candidate tile image, and output the feature matching results; then, the RANSAC algorithm is used to remove mismatched points in the feature matching results to obtain the reference tile image and corresponding point information. The coordinate mapping and positioning module is used to calculate the affine transformation matrix between the remote sensing image to be positioned and the reference tile image based on the corresponding point information, and to map the pixel coordinates of the remote sensing image to be positioned to the geographic coordinate system of the reference tile image through the affine transformation matrix, thereby outputting accurate geographic location information.

[0009] Compared with existing technologies, the remote sensing image localization method and system based on deep feature vector retrieval and matching provided by this invention has the following beneficial effects: 1. A deep feature extraction model fine-tuned from remote sensing image samples is used to extract global feature vectors. This model adapts to the unique features of remote sensing images, such as terrain, landforms, and man-made features. It solves the problem of inaccurate feature representation of remote sensing images by existing deep learning models trained on natural scenes. This improves the recognition and representation capabilities of global feature vectors for remote sensing images, laying a precise foundation for subsequent retrieval and matching work from the source of feature extraction, and improving the initial accuracy of positioning.

[0010] 2. By clustering and storing the first global feature vector using an inverted index structure and constructing a global feature vector database, combined with a nearest neighbor retrieval strategy, efficient retrieval of massive global orthophoto tiles is achieved. This avoids the high computational overhead of comparing all reference image data one by one in traditional feature matching methods. It can quickly lock the candidate tile images in the top K positions of similarity ranking, significantly reducing the processing range of subsequent local feature matching. While ensuring positioning accuracy, it significantly improves the overall processing efficiency of remote sensing image positioning.

[0011] 3. By performing coarse and fine local feature extraction and matching on remote sensing images, sub-pixel precision matching point locations were obtained. At the same time, by removing mismatched points in the feature matching results, mismatched outgoing points that do not conform to geometric constraints can be effectively identified and eliminated, while retaining globally consistent high-precision interior points and corresponding point information. This solves the problems of traditional manually designed feature matching methods being sensitive to complex scenes and having a high matching error rate, significantly improving the accuracy and robustness of local feature matching, and providing a reliable geometric basis for subsequent coordinate mapping.

[0012] 4. Based on high-precision affine transformation matrix calculation using corresponding point information, a precise mapping from the pixel coordinates of the remote sensing image to the geographic coordinate system of the reference tile image is achieved. The entire method forms an end-to-end remote sensing image positioning process that includes precise global feature extraction, efficient retrieval of massive data, fine matching of local features, and precise coordinate mapping. The technical means of each link support each other and cooperate with each other, effectively improving the positioning accuracy of remote sensing images as a whole. This enables it to meet the accuracy requirements of remote sensing image positioning in practical application scenarios such as geographic information collection, environmental monitoring, emergency response, and urban planning, and has strong practical application value. Attached Figure Description

[0013] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings required in the embodiments will be briefly described below. It should be understood that the following drawings only show some embodiments of the present invention, and those skilled in the art can obtain other related drawings based on these drawings without creative effort.

[0014] Figure 1 This is a flowchart illustrating the remote sensing image localization method based on depth feature vector retrieval and matching in Example 1. Figure 2 This is a schematic diagram of the deep feature extraction model architecture in Example 1; Figure 3 This is a flowchart illustrating the LoFTR algorithm in Example 1; Figure 4 This is a block diagram of the remote sensing image positioning system based on depth feature vector retrieval and matching in Example 2; Figure 5 This is a diagram of the internal structure of the computer device in Example 3.

[0015] The objectives, features, and advantages of this invention will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation

[0016] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0017] It should be noted that in this invention, the use of terms such as "first," "second," etc., is for descriptive purposes only and should not be construed as indicating or implying their relative importance or implicitly specifying the number of technical features indicated. Therefore, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this invention, "a plurality of" means at least two, such as two, three, etc., unless otherwise explicitly specified.

[0018] It is understood that the technical solutions of the various embodiments of the present invention can be combined with each other, but only if they are based on the ability of those skilled in the art to implement them. When the combination of technical solutions is contradictory or cannot be implemented, it should be considered that such combination of technical solutions does not exist and is not within the scope of protection claimed by the present invention.

[0019] The embodiments of the present invention will now be described in detail with reference to the accompanying drawings.

[0020] Example 1 This embodiment discloses a remote sensing image positioning method based on depth feature vector retrieval and matching. By combining depth feature extraction, efficient index retrieval, coarse and fine local matching and accurate coordinate mapping, it solves the problem that the positioning accuracy of existing remote sensing images is difficult to meet the needs of practical applications.

[0021] like Figure 1 As shown, the remote sensing image localization method based on depth feature vector retrieval and matching provided in this embodiment includes the following steps: Step 1: The first global feature vector of the global orthophoto tile is extracted using the depth feature extraction model fine-tuned by the remote sensing image samples. The first global feature vector is then clustered and stored using an inverted index structure to construct a global feature vector database.

[0022] Step 2: Preprocess the remote sensing image to be located to obtain a preprocessed remote sensing image.

[0023] Step 3: The fine-tuned depth feature extraction model is used to extract the second global feature vector of the preprocessed remote sensing image. The second global feature vector is then input into the global feature vector database for nearest neighbor retrieval to obtain the candidate tile images ranked in the top K by similarity.

[0024] Step 4: Perform coarse and fine local feature extraction and matching between the remote sensing image to be located and the candidate tile image, and output the feature matching results; then use the RANSAC algorithm to remove mismatched points in the feature matching results to obtain the reference tile image and corresponding point information.

[0025] Step 5: Based on the corresponding point information, calculate the affine transformation matrix between the remote sensing image to be located and the reference tile image. The pixel coordinates of the remote sensing image to be located are mapped to the geographic coordinate system of the reference tile image through the affine transformation matrix, and the accurate geographic location information is output.

[0026] In the specific implementation of step 1, the depth feature extraction model is first fine-tuned using remote sensing image samples. Preferably, the depth feature extraction model is the DINOv2 model. Since the original DINOv2 model is mainly trained on natural images, in order to eliminate the domain differences between remote sensing images and natural images, this embodiment collects 5000 sets of remote sensing image sample data to fine-tune the model. Through fine-tuning, the model can more accurately capture the unique terrain, landform, and texture features in the remote sensing images.

[0027] After fine-tuning the model, global orthophoto tiles are input into the fine-tuned deep feature extraction model. The ViT encoder in the model performs multi-scale feature extraction on the global orthophoto tiles. The network architecture of the deep feature extraction model is as follows: Figure 2 As shown. Given a global orthophoto. , Indicates the number of feature channels. These represent the image height and width, respectively. First, a convolutional layer with a kernel size of 16×16 and a stride of 16 is used to segment the input image into multiple image blocks, each block being 16×16 in size. Next, each image block is converted into a one-dimensional vector through a linear mapping. Then, the one-dimensional vector is input into a Transformer encoder, which consists of multiple VisionTransformer (ViT) encoding blocks. The number of encoding blocks, m, depends on the specific scale of the model; preferably, m is set to 12. Each encoding block includes three parts: layer normalization, multi-head self-attention, and a multilayer perceptron. The internal processing flow of each encoding block is as follows: the one-dimensional vector first passes through an embedding block to generate position-aware features, then undergoes layer normalization before being input into the multi-head attention module to capture global dependencies between features, and element-wise addition is used to achieve residual connections. Subsequently, it passes through a normalization layer again before being input into the multilayer perceptron module to complete the nonlinear transformation of the features. Finally, element-wise addition is used to achieve a second residual connection, outputting the encoded features.

[0028] After processing by the Transformer encoder, a feature matrix of size c×D (number of channels × feature vector dimension) is obtained. Layer normalization is then performed on the feature matrix, transforming it into a 1×n one-dimensional feature vector. Finally, the first global feature vector is output through the head module. It is worth noting that the head module can be flexibly selected according to the specific image task requirements.

[0029] After extracting the first global feature vector, the global feature vector database is constructed. Specifically, to achieve efficient retrieval of global orthophoto tiles, this embodiment uses the FAISS library to construct the global feature vector database. First, the extracted first global feature vector is input into the IndexIVFFlat inverted index structure of the FAISS library for cluster space partitioning, resulting in the partitioned first global feature vector. Then, the partitioned first global feature vectors are stored according to their cluster spaces, and a spatial indexing mechanism is established for each cluster space, ultimately constructing the global feature vector database. During storage, each vector is assigned to the nearest cluster center; during retrieval, the system first locates cluster centers that may contain similar vectors, performing a precise search only within that specific space, thereby significantly reducing the computational overhead of high-dimensional vectors and accelerating the retrieval process.

[0030] This step fine-tunes the deep feature extraction model using remote sensing domain samples, enabling the model to accurately capture the unique features of remote sensing images. This enhances the representational ability of the first global feature vector for remote sensing images, laying a precise feature foundation for subsequent retrieval and matching. Simultaneously, by clustering and storing the first global feature vector using an inverted index structure and establishing a spatial index, the orderly storage of massive orthophoto tile features at a global scale is achieved. This significantly reduces the computational overhead of subsequent nearest neighbor retrieval and improves the storage and retrieval efficiency of massive data.

[0031] In the specific implementation of step 2, Gaussian smoothing and downsampling are first performed sequentially on the remote sensing images to be located, constructing a multi-resolution image pyramid arranged from low-level high-resolution images to upper-level low-resolution images. The specific algorithm is the pyramid downsampling method, which includes: setting the remote sensing image to be located as layer 0 of the pyramid; performing Gaussian filter convolution smoothing on the current layer of the remote sensing image to be located to eliminate high-frequency noise and prevent aliasing caused by downsampling, where the Gaussian filter is the convolution kernel used for image smoothing and denoising; after smoothing, removing even-numbered rows and columns of the current layer image to complete downsampling, generating an image with the original size... Figure 1 The next layer of image is / 4. The above Gaussian smoothing and downsampling operations are repeated until the preset number of layers is reached, resulting in a set of multi-resolution remote sensing images to be located.

[0032] After constructing the multi-resolution image pyramid, random augmentation processing is performed on the multi-resolution remote sensing image set to be located. Random augmentation processing includes geometric transformation and illumination transformation. Geometric transformation is one or more combinations of rotation, translation, scaling, and flipping to change the spatial position and shape of the image, simulating the perspective differences when an aircraft takes pictures at different headings, altitudes, and attitudes. Illumination transformation is one or two combinations of brightness adjustment and contrast adjustment to simulate the imaging effects under different weather conditions, time of day, or light intensity.

[0033] The images that have undergone Gaussian smoothing, downsampling, and random augmentation are used as preprocessed remote sensing images.

[0034] This step constructs a multi-resolution image pyramid, enabling multi-scale representation of the remote sensing image to be located. This reduces the computational load of subsequent retrieval of high-resolution images while ensuring matching accuracy, significantly improving image matching speed. Simultaneously, through dual-dimensional random enhancement processing of geometry and illumination, the usability of features in complex real-world scenarios is improved, allowing subsequent feature matching to adapt to remote sensing images captured under different conditions, thus enhancing the scenario robustness of the localization method.

[0035] In the specific implementation of step 3, the fine-tuned depth feature extraction model in step 1 is called to extract features from the preprocessed remote sensing image obtained in step 2. The extraction process is consistent with the extraction process of the first global feature vector, which will not be elaborated here. Finally, the second global feature vector that can characterize the features of the remote sensing image to be located is obtained.

[0036] The second global feature vector is then input into the global feature vector database constructed in step 1 for nearest neighbor retrieval. Nearest neighbor retrieval employs a two-level retrieval strategy based on cluster space: first, the distance between the second global feature vector and each cluster center in the global feature vector database is calculated to quickly locate the cluster center to which similar vectors belong; the cluster center serves as the feature reference point for each cluster space. Then, nearest neighbor search is performed only within this cluster center and its adjacent cluster spaces, and the similarity between vectors is calculated, avoiding brute-force comparison of all feature vectors. This two-level search strategy significantly reduces retrieval complexity, enabling tile matching globally to achieve response times in seconds or even milliseconds.

[0037] After similarity calculation, all search results are sorted in descending order of similarity. The second global feature vectors of the top K positions are extracted. The orthophoto tiles corresponding to these second global feature vectors are considered to be the regions with the geographical location closest to the remote sensing image to be located. Therefore, based on the storage association relationship of the second global feature vector, the corresponding global orthophoto tiles are retrieved, and finally the candidate tile images are obtained.

[0038] This step extracts a second global feature vector by reusing the finely tuned deep feature extraction model, ensuring consistency in feature extraction between the remote sensing image to be located and the reference tile image, thus improving the matching basis for retrieval. At the same time, through a nearest neighbor retrieval strategy based on cluster space, it achieves rapid retrieval of massive global reference images, enabling the approximate geographical area of ​​the remote sensing image to be located to be locked in seconds or even milliseconds, significantly narrowing the processing range of subsequent local feature fine matching, and significantly improving the processing efficiency of the overall positioning process while ensuring retrieval accuracy.

[0039] In the specific implementation of step 4, firstly, coarse and fine granular local feature extraction and matching are performed between the remote sensing image to be located and the candidate tile image obtained in step 3. This process is implemented using the LoFTR algorithm, such as... Figure 3 As shown, the LoFTR algorithm includes the following steps: Step 401: Input the remote sensing image to be located and the candidate tile image, and use a standard convolutional structure with a feature map pyramid network to extract the coarse feature map and fine feature map of each image respectively. Preferably, the coarse feature map is 1 / 8 of the original image dimension, and the fine feature map is 1 / 2 of the original image dimension.

[0040] Step 402: Flatten the coarse feature map into a one-dimensional vector, add position encoding, and input it into the LoFTR module. The coarse feature maps corresponding to the localized remote sensing image and the candidate tile image are transformed through the self-attention layer and the mutual attention layer. The self-attention layer integrates the neighborhood context feature information of a single image, and the mutual attention layer integrates the interaction feature information between the remote sensing image to be localized and the candidate tile image. Finally, the coarse feature map is transformed into a feature form that is easy to match, and the coarse matching feature representations of the two are output respectively.

[0041] Step 403: Based on the coarse matching feature representation, calculate the matching score matrix for all positions using a product method; then, based on the matching score matrix, calculate the optimal matching probability of the coarse matching feature representation at all positions using the dual-Softmax method; subsequently, based on this optimal matching probability, filter the matching score matrix using the nearest neighbor algorithm, filtering out outlier matching pairs whose optimal matching probability scores are lower than a preset threshold, and retaining only matching pairs whose scores are higher than the threshold to obtain coarse matching pairs.

[0042] Step 404, if a certain matching point pair In the coarse matching pairs, the coarse matching pairs are mapped to the corresponding fine feature maps, and the fine feature map regions corresponding to the coarse matching pairs are cropped and input into the LoFTR module. Through the self-attention layer and the mutual attention layer, the fine feature maps corresponding to the localized remote sensing images and candidate tile images are converted into fine matching feature representations, denoted as FA and FB, respectively. The feature similarity between the central feature of FA and all features in FB is calculated, and then the sub-pixel precision matching point positions in FB are calculated based on the feature similarity. The feature matching results are obtained by integrating all sub-pixel precision matching point positions.

[0043] After the feature matching results are output, the RANSAC algorithm is used to remove mismatched points from the feature matching results. This includes: taking the feature matching results as a set of matching points, and repeatedly randomly sampling from the set of matching points according to the RANSAC algorithm to obtain a set of sampling points.

[0044] A geometric transformation model between the remote sensing image to be located and the candidate tile image is estimated based on the sampled point set. The consistency of the geometric constraints between the remaining matching points in the matching point set and the geometric transformation model is verified. Outer points that do not conform to the geometric constraints are identified and removed, while inner points that conform to the global geometric constraints are retained. Among them, outer points are mismatched points, and inner points are accurately matched points.

[0045] Finally, based on the consistency of the matching of the inliers, a unique reference tile image is determined from the candidate tile images. The point information of the remote sensing image to be located and the point information of the reference tile image corresponding to the inliers are extracted respectively. The two sets of point information are integrated to form the homonymous point information. The homonymous point information is the precise feature point information of the position in the remote sensing image to be located and the candidate tile image.

[0046] This step utilizes the LoFTR algorithm to achieve coarse-grained and fine-grained local feature extraction and matching. By leveraging the feature fusion capabilities of self-attention and mutual attention layers, it can extract highly discriminative features even in remote sensing image scenes with sparse textures. Furthermore, the sub-pixel precision of the matching point locations significantly improves the accuracy of local feature matching. Simultaneously, the geometric verification and mismatch point removal using the RANSAC algorithm effectively filter out outliers that do not conform to geometric constraints, ensuring the reliability and accuracy of the corresponding point information and providing a reliable geometric basis for subsequent coordinate mapping and geolocation.

[0047] In the specific implementation of step 5, the pixel coordinates of the remote sensing image to be located and the geographic coordinates of the reference tile image are extracted from the corresponding point information obtained in step 4, forming two sets of corresponding coordinates.

[0048] Based on these two sets of coordinates, an affine transformation matrix between the remote sensing image to be located and the reference tile image is obtained through a linear regression scheme. Preferably, the linear regression scheme uses the least squares method; the affine transformation matrix is ​​a matrix that can cover all geometric relationships between images, such as translation, rotation, scaling, and shearing, and can accurately describe the spatial mapping relationship between the remote sensing image to be located and the reference tile image.

[0049] After calculating the affine transformation matrix, input all pixel coordinates of the remote sensing image to be located into the affine transformation matrix to complete the mapping of the pixel coordinates of the remote sensing image to be located to the geographic coordinate system of the reference tile image.

[0050] Because global orthophoto tiles possess a pre-defined standardized geographic coordinate system, the precise geographic location information of the center point or key target of the image to be located is calculated using a coordinate mapping formula, combined with the latitude, longitude, and resolution parameters of the reference tile image itself, and this precise geographic location information is then output. Preferably, the output information may also include positioning accuracy evaluation indicators and the corrected positioning image.

[0051] This step solves the affine transformation matrix based on high-precision corresponding points through linear regression, realizing the accurate mapping of the remote sensing image to be located from pixel coordinates to geographic coordinates, and completing the automatic conversion from image visual features to geographic coordinates. The whole process forms an end-to-end remote sensing image positioning system, which makes the output geographic location information have the characteristics of high precision and high reliability, and can directly meet the usage needs of actual application scenarios, effectively solving the core problem of insufficient positioning accuracy of remote sensing images in existing technologies.

[0052] It should be understood that, although this embodiment Figure 1 , Figure 3The steps are shown sequentially as indicated by the arrows, but they are not necessarily executed in the order indicated by the arrows. Unless otherwise specified in this document, there is no strict order in which these steps are performed; they can be executed in other orders. Figure 1 , Figure 3 At least some of the steps in the process may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but can be executed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.

[0053] Example 2 Based on the remote sensing image localization method based on depth feature vector retrieval and matching in Example 1, this example discloses a remote sensing image localization system based on depth feature vector retrieval and matching, such as... Figure 4 As shown, the remote sensing image localization system based on deep feature vector retrieval and matching includes: a feature vector library construction module 601, a remote sensing image preprocessing module 602, a global feature retrieval module 603, a local feature fine matching module 604, and a coordinate mapping localization module 605, wherein: The feature vector library construction module 601 is used to extract the first global feature vector of global orthophoto tiles using a depth feature extraction model finely tuned by remote sensing image samples, and to cluster and store the first global feature vector through an inverted index structure to construct a global feature vector database.

[0054] The remote sensing image preprocessing module 602 is used to preprocess the remote sensing image to be located to obtain a preprocessed remote sensing image.

[0055] The global feature retrieval module 603 is used to extract the second global feature vector of the preprocessed remote sensing image using a fine-tuned depth feature extraction model, and input the second global feature vector into the global feature vector database for nearest neighbor retrieval to obtain the candidate tile images ranked first K times in similarity.

[0056] The local feature fine matching module 604 is used to extract and match local features at both coarse and fine granularity between the remote sensing image to be located and the candidate tile image, and output the feature matching results. Then, the RANSAC algorithm is used to remove mismatched points in the feature matching results to obtain the reference tile image and corresponding point information.

[0057] The coordinate mapping and positioning module 605 is used to calculate the affine transformation matrix between the remote sensing image to be positioned and the reference tile image based on the corresponding point information. The affine transformation matrix maps the pixel coordinates of the remote sensing image to be positioned to the geographic coordinate system of the reference tile image, and outputs accurate geographic location information.

[0058] In this embodiment, the specific working process and working principle of the feature vector library construction module 601, remote sensing image preprocessing module 602, global feature retrieval module 603, local feature fine matching module 604, and coordinate mapping and positioning module 605 are the same as those in Embodiment 1, and therefore will not be described again in this embodiment. Each unit module can be implemented entirely or partially through software, hardware, or a combination thereof. Each unit module can be embedded in or independent of the processor in a computer device in hardware form, or it can be stored in the memory of a computer device in software form, so that the processor can call and execute the operations corresponding to the above unit modules.

[0059] Example 3 like Figure 5 The diagram illustrates a computer device disclosed in this embodiment, including a transmitter, a receiver, a memory, and a processor. The transmitter is used to send instructions and data, the receiver is used to receive instructions and data, the memory is used to store computer execution instructions, and the processor is used to execute the computer execution instructions stored in the memory to implement the method in Embodiment 1 above.

[0060] It is important to note that the aforementioned memory can be either standalone or integrated with the processor. When the memory is set up independently, the terminal device also includes a bus for connecting the memory and the processor.

[0061] Example 4 This embodiment discloses a computer-readable storage medium storing computer-executable instructions. When a processor executes the computer-executable instructions, it implements the method in Embodiment 1 above.

[0062] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), Rambus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

[0063] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0064] The embodiments described above are merely examples of several implementations of the present invention, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of the invention. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention, and these modifications and improvements all fall within the scope of protection of the present invention.

Claims

1. A remote sensing image localization method based on deep feature vector retrieval and matching, characterized in that, The method includes: Step 1: The first global feature vector of the global orthophoto tile is extracted using the depth feature extraction model after fine-tuning the remote sensing image samples. The first global feature vector is then clustered and stored using an inverted index structure to construct a global feature vector database. Step 2: Preprocess the remote sensing image to be located to obtain a preprocessed remote sensing image; Step 3: Use the fine-tuned depth feature extraction model to extract the second global feature vector of the preprocessed remote sensing image, and input the second global feature vector into the global feature vector database for nearest neighbor retrieval to obtain the candidate tile images ranked in the top K by similarity. Step 4: Perform coarse and fine local feature extraction and matching on the remote sensing image to be located and the candidate tile image, and output the feature matching result; then use the RANSAC algorithm to remove mismatched points in the feature matching result to obtain the reference tile image and corresponding point information. Step 5: Based on the corresponding point information, calculate the affine transformation matrix between the remote sensing image to be located and the reference tile image. Map the pixel coordinates of the remote sensing image to be located to the geographic coordinate system of the reference tile image through the affine transformation matrix, and output accurate geographic location information.

2. The remote sensing image localization method based on depth feature vector retrieval and matching according to claim 1, characterized in that, In step 1, the first global feature vector of the global orthophoto tile is extracted using a depth feature extraction model fine-tuned from remote sensing image samples, including: The global orthophoto tiles are input into the fine-tuned DINOv2 model, and the ViT encoder in the model is used to extract multi-scale features from the global orthophoto tiles to obtain the feature matrix. The feature matrix is ​​subjected to layer normalization to obtain the first global feature vector.

3. The remote sensing image localization method based on depth feature vector retrieval and matching according to claim 2, characterized in that, The first global feature vector is clustered and stored using an inverted index structure to construct a global feature vector database, including: The first global feature vector is input into the IndexIVFFlat inverted index structure of the FAISS library to perform cluster space partitioning, and the partitioned first global feature vector is obtained. The first global feature vector after partitioning is stored according to the cluster space and a spatial indexing mechanism is established to construct the global feature vector database.

4. The remote sensing image localization method based on depth feature vector retrieval and matching according to claim 1, characterized in that, Step 2 involves preprocessing the remote sensing image to be located, including: Gaussian smoothing and downsampling are performed sequentially on the remote sensing images to be located to construct a multi-resolution image pyramid and obtain a set of multi-resolution remote sensing images to be located. Random augmentation processing is performed on the multi-resolution remote sensing image set to be located to obtain the preprocessed remote sensing image.

5. The remote sensing image localization method based on deep feature vector retrieval and matching according to claim 4, characterized in that, Gaussian smoothing and downsampling are sequentially performed on the remote sensing images to be located to construct a multi-resolution image pyramid, resulting in a set of multi-resolution remote sensing images to be located, including: The remote sensing image to be located is set as the 0th layer of the pyramid. After performing Gaussian filter convolution smoothing on the current layer image, even rows and even columns are removed to complete downsampling and generate the next layer image. Repeat the Gaussian smoothing and downsampling operations as described above until the preset number of layers is reached to obtain the set of multi-resolution remote sensing images to be located.

6. The remote sensing image localization method based on depth feature vector retrieval and matching according to any one of claims 1 to 5, characterized in that, In step 3, the second global feature vector is input into the global feature vector database for nearest neighbor retrieval, including: The second global feature vector is input into the global feature vector database, and the distance between the second global feature vector and each cluster center in the global feature vector database is calculated to locate the cluster center to which the similar vector belongs. Within the cluster center and adjacent cluster spaces, a nearest neighbor search is performed on the second global feature vector and the similarity is calculated. The search results are sorted in descending order of similarity, and the top K second global feature vectors are extracted. The corresponding global orthophoto tiles are retrieved to obtain the candidate tile image.

7. The remote sensing image localization method based on depth feature vector retrieval and matching according to any one of claims 1 to 5, characterized in that, In step 4, coarse and fine granular local feature extraction and matching are performed on the remote sensing image to be located and the candidate tile image, and the feature matching results are output, including: Extract coarse and fine feature maps from the remote sensing image to be located and the candidate tile image; The coarse feature map is converted into a coarse matching feature representation using the LoFTR module; the matching score matrix of the coarse matching feature representation is calculated, and coarse matching pairs are obtained by filtering the matching score matrix using the nearest neighbor algorithm; The coarse matching pairs are mapped to the fine feature map, the fine feature map regions corresponding to the coarse matching pairs are cropped, and the fine feature map is converted into a fine matching feature representation through the LoFTR module; The sub-pixel precision matching point positions are calculated based on the fine matching feature representation, and the feature matching results are obtained by integrating all sub-pixel precision matching point positions.

8. The remote sensing image localization method based on depth feature vector retrieval and matching according to claim 7, characterized in that, In step 4, the RANSAC algorithm is used to remove mismatches from the feature matching results, including: The feature matching results are used as a set of matching points, and a set of sampling points is obtained by repeatedly sampling randomly from the set of matching points according to the RANSAC algorithm. Based on the sampling point set, estimate the geometric transformation model between the remote sensing image to be located and the candidate tile image; Verify the consistency between the remaining matching points in the matching point set and the geometric constraints of the geometric transformation model, identify and remove external points that do not conform to the geometric constraints, and retain internal points that conform to the consistency of global geometric constraints; Based on the matching consistency of the interior points, a unique reference tile image is determined. The point location information of the interior points corresponding to the remote sensing image to be located and the point location information of the reference tile image are extracted and integrated to form the same point information.

9. The remote sensing image localization method based on depth feature vector retrieval and matching according to claim 8, characterized in that, In step 5, based on the corresponding point information, the affine transformation matrix between the remote sensing image to be located and the reference tile image is calculated, including: Extract the pixel coordinates of the remote sensing image to be located and the geographic coordinates of the reference tile image from the corresponding point information; Based on the pixel coordinates and the geographic coordinates, an affine transformation matrix between the remote sensing image to be located and the reference tile image is obtained by linear regression.

10. A remote sensing image positioning system based on deep feature vector retrieval and matching, characterized in that, The system includes: The feature vector library construction module is used to extract the first global feature vector of global orthophoto tiles using a depth feature extraction model fine-tuned by remote sensing image samples, and to cluster and store the first global feature vector through an inverted index structure to build a global feature vector database. The remote sensing image preprocessing module is used to preprocess the remote sensing images to be located to obtain preprocessed remote sensing images. The global feature retrieval module is used to extract the second global feature vector of the preprocessed remote sensing image using the fine-tuned depth feature extraction model, and input the second global feature vector into the global feature vector database for nearest neighbor retrieval to obtain the candidate tile images ranked first K in similarity. The local feature fine matching module is used to extract and match local features at both coarse and fine granularity between the remote sensing image to be located and the candidate tile image, and output the feature matching results; then, the RANSAC algorithm is used to remove mismatched points in the feature matching results to obtain the reference tile image and corresponding point information. The coordinate mapping and positioning module is used to calculate the affine transformation matrix between the remote sensing image to be positioned and the reference tile image based on the corresponding point information, and to map the pixel coordinates of the remote sensing image to be positioned to the geographic coordinate system of the reference tile image through the affine transformation matrix, thereby outputting accurate geographic location information.