Unmanned aerial vehicle photograph and ground three-dimensional real scene matching coordinate calculation method and system

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By acquiring images using an onboard camera on a drone, combining semantic segmentation and motion detection to identify dynamic interference, extracting static background image features, matching them with a 3D real-world model, and using the PnP algorithm and bundle adjustment optimization, the problem of low accuracy in drone coordinate estimation was solved, achieving high-precision drone positioning.

CN122244145APending Publication Date: 2026-06-19BEIJING ZHONGYAO ZITU TECH CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: BEIJING ZHONGYAO ZITU TECH CO LTD
Filing Date: 2026-01-27
Publication Date: 2026-06-19

Application Information

Patent Timeline

27 Jan 2026

Application

19 Jun 2026

Publication

CN122244145A

IPC: G06T7/73; G06V20/17; G06V10/26; G06V10/40; G06V20/64

AI Tagging

Application Domain

Image analysis Three-dimensional object recognition

Technology Topics

Background image Aerial image

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN122244145A_ABST

Patent Text Reader

Abstract

This application provides a coordinate estimation method and system for matching UAV photos with 3D ground real-world scenes, belonging to the field of geographic coordinate estimation technology. The method includes: acquiring aerial images of the task area using an UAV's onboard camera; identifying and masking dynamic interference areas in the aerial images using semantic segmentation and motion detection to obtain static background images and extracting features to obtain two-dimensional image features; extracting features from a preset static urban real-world 3D model to obtain 3D model features; matching the 2D image features with the 3D model features to establish a correspondence between 2D image points and 3D model points; based on the correspondence, using the PnP algorithm to calculate the initial exterior orientation elements of the UAV in the absolute geographic coordinate system, and optimizing it using bundle adjustment to obtain the optimized UAV visual pose; fusing it with the observation data from the inertial measurement unit to obtain the final 3D position and attitude of the UAV in the absolute geographic coordinate system.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the technical field of geographic coordinate calculation, and in particular to a coordinate calculation method and system for matching UAV photos with three-dimensional ground scenes. Background Technology

[0002] To calculate the coordinates of a UAV, several methods are employed, including GPS-based positioning, which determines the UAV's position by receiving satellite signals. However, this method suffers from signal instability in complex environments, leading to inaccurate coordinate calculations. Another method uses inertial measurement units (IMUs) to measure the UAV's acceleration and angular velocity using accelerometers and gyroscopes to calculate its position and attitude. However, this method suffers from accumulated errors over time, reducing the reliability of the coordinate results. A third method utilizes ground control points. Control points with known coordinates are established on the ground, and the UAV's coordinates are determined by their relative positions to these control points. However, this method requires pre-positioning control points in the mission area, making it cumbersome. These methods fail to meet the demands for high-precision geographic information acquisition.

[0003] Therefore, there is an urgent need for a coordinate calculation method and system for matching drone photos with 3D ground reality. Summary of the Invention

[0004] To address the aforementioned technical issues, this application provides a coordinate calculation method and system for matching UAV photographs with 3D ground real-world scenes.

[0005] A first aspect of this application provides a coordinate calculation method for matching UAV photographs with 3D ground real-world scenes, including: Aerial images of the mission area are obtained by acquiring images using an airborne camera on a drone; the aerial images include: dynamic interference objects and static background. By combining semantic segmentation and motion detection, dynamic interference areas in the aerial images are identified and masked to obtain static background images; Feature extraction is performed on the static background image to obtain two-dimensional image features; Feature extraction is performed on a pre-defined static 3D urban landscape model to obtain the 3D model features; The two-dimensional image features are matched with the three-dimensional model features to establish a correspondence between two-dimensional image points and three-dimensional model points; Based on the aforementioned correspondence, the PnP algorithm is used to calculate the initial exterior orientation elements of the UAV in the absolute geographic coordinate system, and the bundle adjustment method is used to optimize the initial exterior orientation elements to obtain the optimized UAV visual pose. The optimized UAV visual pose is fused with the observation data from the inertial measurement unit to obtain the final three-dimensional position and attitude of the UAV in the absolute geographic coordinate system; wherein the observation data from the inertial measurement unit includes angular velocity and acceleration.

[0006] A second aspect of this application provides a coordinate estimation system for matching UAV photographs with 3D ground real-world scenes, comprising: The image acquisition module is used to acquire images through the UAV's onboard camera to obtain aerial images of the mission area; the aerial images include: dynamic interference objects and static background. The static extraction module is used to identify and mask dynamic interference areas in the aerial image by combining semantic segmentation and motion detection to obtain a static background image. The image extraction module is used to extract features from the static background image to obtain two-dimensional image features; The feature extraction module is used to extract features from a preset static urban real-scene 3D model to obtain the 3D model features; The feature matching module is used to match the two-dimensional image features with the three-dimensional model features to establish the correspondence between two-dimensional image points and three-dimensional model points; The visual pose module is used to calculate the initial exterior orientation elements of the UAV in the absolute geographic coordinate system based on the correspondence relationship using the PnP algorithm, and optimize the initial exterior orientation elements using bundle adjustment to obtain the optimized UAV visual pose. The position and attitude module is used to fuse the optimized UAV visual pose with the observation data of the inertial measurement unit to obtain the final three-dimensional position and attitude of the UAV in the absolute geographic coordinate system; wherein, the observation data of the inertial measurement unit includes angular velocity and acceleration.

[0007] A third aspect of this application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor executes the computer program to implement the steps of the coordinate calculation method for matching UAV photos with three-dimensional ground reality described above.

[0008] In a fourth aspect of this application, a computer-readable storage medium is provided, the computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps of the coordinate calculation method for matching UAV photos with three-dimensional ground reality described above.

[0009] The beneficial effects of the coordinate estimation method and system for matching UAV photos with ground 3D reality provided in this application are as follows: This application obtains static background images by acquiring aerial images of the mission area, identifying and masking dynamic interference areas, thus avoiding the influence of dynamic interference on subsequent processing; it extracts and matches features from the static background images and 3D models to establish a correspondence, providing a basis for calculation; it uses the PnP algorithm to calculate and optimize the UAV's visual pose, which can improve the accuracy of pose calculation; and it fuses the optimized visual pose with the observation data of the inertial measurement unit to obtain a more accurate final 3D position and attitude of the UAV. Attached Figure Description

[0010] Figure 1 A flowchart illustrating a method for coordinate estimation of matching UAV photos with 3D ground reality, provided in an embodiment of this application; Figure 2 A structural block diagram of a coordinate calculation system for matching drone photos with 3D ground real-world scenes, provided in an embodiment of this application; Figure 3 This is a schematic block diagram of an electronic device provided in an embodiment of this application. Detailed Implementation

[0011] In the following description, specific details such as particular system architectures and techniques are set forth for illustrative purposes and not for limitation, in order to provide a thorough understanding of the embodiments of this application. However, those skilled in the art will understand that this application may also be implemented in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, apparatuses, circuits, and methods have been omitted so as not to obscure the description of this application with unnecessary detail.

[0012] To make the purpose, technical solution, and advantages of this application clearer, the following will be described in conjunction with the appendix. Figure 1-3 The following is an explanation using specific examples.

[0013] Please refer to Figure 1 , Figure 1 This is a flowchart illustrating a coordinate estimation method for matching UAV photographs with 3D ground reality according to an embodiment of this application. The method includes: S101: Collects images using the drone's onboard camera to obtain aerial images of the mission area; the aerial images include: dynamic interference objects and static background.

[0014] In this embodiment, the task area is a pre-planned geographical range for UAV operations, with clearly defined geographical boundaries, and is the target area for aerial image acquisition. The aerial imagery includes dynamic interference objects and a static background. Dynamic interference objects are those in the aerial imagery that change in time or spatial location, interfering with static background feature matching; examples include pedestrians and vehicles. The static background consists of fixed spatially located and morphologically stable ground features / environmental elements in the aerial imagery, serving as the matching benchmark for coordinate calculation; typical types include building facades, road markings, and fixed streetlights / utility poles.

[0015] S102: By combining semantic segmentation and motion detection, dynamic interference areas in aerial images are identified and masked to obtain static background images.

[0016] In this embodiment, semantic segmentation is a computer vision technique that uses a deep learning model to label each pixel of an aerial image with a category, such as pedestrians, vehicles, buildings, and roads, to achieve pixel-level ground feature classification. This embodiment uses a semantic segmentation network model to identify dynamic objects with specific semantics in the image.

[0017] Specifically, this embodiment employs an improved semantic segmentation network model. To address the need for identifying static features (buildings, roads, bridges, streetlights, etc.) in drone aerial photography scenarios, the feature extraction and segmentation head structure has been optimized. The specific modules are as follows: the input layer receives the original aerial image data; the backbone network extracts multi-scale features from the image; the feature pyramid network fuses multi-scale features to improve the ability to identify small targets; the region proposal network generates candidate regions for ground features; region of interest alignment unifies the feature scale of candidate regions; the segmentation head accurately segments ground features and outputs semantic labels; and the post-processing module filters invalid segmentation results and outputs the final semantic labels.

[0018] The semantic segmentation network model's input data includes raw images captured by the UAV's onboard camera, camera intrinsic parameters, and scene prior information (surveys of ground feature distribution and illumination condition annotations). The data format is fully compatible with the UAV hardware parameters and preprocessing workflow. Output data includes pixel-level semantic label maps, semantic confidence maps (floating-point, confidence range [0,1]), and a set of ground feature bounding boxes (JSON format), directly used for subsequent 2D-3D feature semantic consistency comparison, providing semantic constraints for UAV pose calculation. The training dataset includes 10,000 UAV aerial images (8,000 for training, 1,000 for validation, and 1,000 for testing), covering multiple scenes such as densely built-up areas and low-light conditions. Data augmentation strategies such as random rotation, scaling, and pixel transformation are used to improve generalization ability. The final test set achieved a recall rate of 0.89, with a small target recall rate of 0.87, meeting the semantic recognition requirements for UAV coordinate inference scenarios.

[0019] In this embodiment, motion detection is an analysis technique based on the temporal changes of image sequences. By comparing the pixel grayscale / feature changes of adjacent frames, it identifies objects whose position and shape move over time, i.e., dynamic targets. Methods include frame difference, optical flow, and background modeling. A combined approach is a fusion strategy of semantic segmentation and motion detection: first, semantic segmentation is used to initially screen candidate regions for dynamic objects at the category semantic level; then, motion detection is used to verify whether the region is actually moving at the temporal change level. The two complement each other to improve the accuracy of dynamic interference identification. Masking is a process of shielding and repairing the identified dynamic interference regions: first, the pixel information of the dynamic interference is obscured using a mask; then, an image inpainting algorithm is used to fill the obscured area, making the repaired area visually coherent with the surrounding static background and eliminating the influence of dynamic interference.

[0020] Specifically, the process includes: First, using a semantic segmentation model to identify aerial images and generate a first mask that identifies vehicles and pedestrians. Based on the aerial images, a motion detection algorithm is used to identify moving regions and generate a second mask. Then, the first and second masks are fused to generate a comprehensive mask that identifies potential dynamic interference. This involves first obtaining the semantic category confidence of each pixel in the first mask and the motion saliency of each pixel in the second mask, then performing a fusion decision to generate a comprehensive mask that identifies interference regions. Next, a permanent static structure confidence map generated from a static urban 3D model is obtained. This confidence map is used to identify building facades and road surface areas. Finally, the comprehensive mask and the permanent static structure confidence map are subjected to a logical AND-NOT operation to generate the final dynamic interference mask. This final dynamic interference mask is then used to mask the aerial images, resulting in a static background image. This ensures that only targets that are semantically dynamic and exhibit motion in the time dimension are ultimately identified as dynamic interference requiring masking, greatly improving the system's accuracy and avoiding false detections or incorrect masking caused by stationary vehicles.

[0021] In this embodiment, semantic segmentation technology, based on a deep learning model, identifies dynamic interference such as vehicles and pedestrians and generates masks. Features are extracted only from the static background area outside the mask. These static scene features include building outlines, road signs, and corner points and edge lines of fixed objects. During processing, the robustness of the localization can be ensured through the following enhancements: In the preprocessing stage, a static basis model that has already filtered out dynamic objects is prioritized; in the core algorithm for coordinate extrapolation, robust estimation methods such as RANSAC are integrated to further eliminate outliers caused by dynamic objects or mismatches, ensuring that the solution depends only on correct static feature points; the coordinates calculated by this method can also be fused with Kalman filtering data from inertial measurement units, GPS, and other sensors to provide continuous position output when visual matching temporarily fails.

[0022] In this embodiment, the static background image is an image product obtained after dynamic interference object identification and masking repair. It retains only static ground features with fixed spatial positions and stable shapes in the aerial image, such as buildings, roads, and fixed facilities, without dynamic object interference.

[0023] S103: Extract features from the static background image to obtain two-dimensional image features.

[0024] In this embodiment, feature extraction is performed on the static background image. First, the static background image is converted into a grayscale image, and then denoising is performed using a Gaussian filter with a 5×5 convolution kernel and a standard deviation σ=1.6 to eliminate noise interference caused by aerial camera shake. Then, the SIFT algorithm, which has scale invariance and rotation invariance, is selected. The initial parameter configuration is set to a feature point detection threshold of 0.04, an edge threshold of 10, a maximum number of feature points of 2000, and a scale space layer of 3. By constructing a 6-layer scale space (each layer including 3 blurred images with different standard deviations), it adapts to ground feature characteristics at different scales. In the blurred images at adjacent scales, the difference Gaussian response is calculated and local extrema are detected. Low-quality keypoints are removed after thresholding and edge suppression. Then, a 16×16 pixel neighborhood window is selected with each effective feature point as the center, which is divided into 4×4 sub-regions. The gradient histograms of each sub-region in 8 directions are calculated to generate a 128-dimensional normalized feature descriptor. The final output is a two-dimensional image feature including a set of feature points (represented by (x,y) pixel coordinates) and a 128-dimensional feature descriptor matrix. The feature data is stored in binary format, which lays the foundation for subsequent feature matching with the three-dimensional model.

[0025] Specifically, two-dimensional image feature extraction algorithms can also include ORB, SuperPoint, etc., with corresponding two-dimensional descriptors being binary features or gradient features. During the extraction process, image preprocessing (such as histogram equalization) is performed to mitigate the influence of lighting. At the same time, structural features that are not sensitive to changes in lighting, such as the corners and edges of buildings, are prioritized over volatile texture features, such as the color of lawns.

[0026] S104: Extract features from the preset static urban real-scene 3D model to obtain the 3D model features.

[0027] In this embodiment, the real-scene 3D model is obtained through detailed acquisition using a vehicle-mounted, ground-based 3D laser scanning system. LiDAR and cameras are mounted on the vehicle to collect street-view level 3D data. The accuracy of the real-scene 3D model is: absolute accuracy (geographic reference accuracy): 5:20 cm; relative accuracy (point cloud internal accuracy): 1:5 cm. The position of a ground feature in the point cloud has an error in the tens of millimeters between its absolute coordinates in the real world (e.g., WGS-84); while the relative distance error between two points within the point cloud can reach the centimeter or even millimeter level. By emitting a laser beam and measuring its return time, the distance between each point and the sensor can be accurately calculated, thus obtaining the 3D coordinates of the point.

[0028] In this embodiment, the coverage area of the real-scene 3D model is flexibly determined according to the specific application scenario. This area can be at the local facility level (e.g., a single building or park), the city level (e.g., the entire built-up area of a city), or even the regional level (e.g., a specific geographical area). The coverage area, spatial accuracy, and update frequency of the real-scene 3D model should meet the accuracy and reliability requirements of UAV positioning in the specific application scenario. The update mechanism varies for different application levels of the real-scene 3D model, with different update strategies and frequencies. Terrain-level models focus on large-scale topographic changes. They can utilize change patches monitored by remote sensing satellites and be updated based on UAV laser point cloud data. City-level models have a relatively higher update frequency, focusing on changes in urban construction and road reconstruction, and are updated regularly using UAV aerial photography. Component-level models require refined, on-demand, and rapid updates for key components such as road facilities and landmark buildings, which relies on specialized, detailed measurement and modeling work.

[0029] In this embodiment, feature extraction is performed on a preset static urban real-scene 3D model. First, the preset static urban real-scene 3D model is loaded in .obj format, including 1.2 million triangular faces and corresponding texture maps and geographic coordinate information. Redundant triangular faces are removed using a mesh simplification algorithm, while retaining the 3D geometric information of key structures such as building outlines, road edges, and fixed facilities. Then, the ISS keypoint detection algorithm is used, setting the keypoint sampling radius to 0.5 meters, the neighbor point threshold to 50, and the curvature threshold to 0.1. The vertex cloud data of the 3D model is traversed, calculating the shape feature value and curvature of each vertex, and selecting 3000 3D keypoints with significant geometric structures, represented by (X,Y,Z) coordinates in the absolute geographic coordinate system. Next, for each 3D keypoint, its features are... A spherical neighborhood with a radius of 1.0 meter is constructed at the center. The normal vectors and curvature features of all vertices within the neighborhood are extracted. Principal component analysis is used to reduce the dimensionality of the neighborhood point cloud, generating a 64-dimensional geometric feature vector. Then, based on the texture map of the 3D model, the texture region corresponding to each key point is sampled, and a 128-dimensional texture feature vector is extracted. The geometric feature vector and the texture feature vector are concatenated and then L2 normalized to obtain a 192-dimensional 3D model feature descriptor. The final output includes a set of 3D key points (3000 absolute geographic coordinates) and a 192-dimensional feature descriptor matrix, providing support for subsequent matching with 2D image features.

[0030] S105: Match the features of the two-dimensional image with the features of the three-dimensional model to establish the correspondence between the two-dimensional image points and the three-dimensional model points.

[0031] In this embodiment, two-dimensional image features are structured visual representations extracted from static background images, including two-dimensional feature points (pixel coordinates (x, y)) and two-dimensional feature descriptors, such as 128-dimensional / 512-dimensional vectors. Three-dimensional model features are structured geometric and textural representations extracted from static urban 3D models, including three-dimensional feature points ((X, Y, Z) coordinates in an absolute geographic coordinate system) and three-dimensional feature descriptors, such as 192-dimensional vectors. Feature matching is a process of finding semantically and geometrically consistent correspondences between two-dimensional image features and three-dimensional model features based on the similarity measure of feature descriptors. It determines feature similarity by calculating the distance between descriptors.

[0032] This embodiment employs a direct 2D-3D matching strategy, avoiding the projection of the 3D model into a 2D view, thus preventing additional calculations and distortions introduced during the projection process, while maintaining geometric consistency. During the matching process, semantic information is used to constrain the matching range, meaning only static elements in the photograph are matched with elements of the same category in the model. A coarse-to-fine matching strategy is adopted: first, global features are used for coarse localization, and then local features are used for fine matching. For environmental changes such as scale, rotation, and illumination, corresponding processing strategies are adopted: For scale changes, scale-insensitive feature descriptors, such as SIFT and ORB through image pyramids, are used. 3D descriptors, such as FPFH and SHOT, are inherently robust to scale changes. At the same time, scale invariance is enhanced through multi-scale feature extraction (image pyramids, multi-radius 3D features). For rotation changes, 2D features such as SIFT, ORB, and SuperPoint are rotation-invariant, while 3D features such as FPFH and SHOT achieve rotation invariance through local reference frames (LRF). For illumination changes, features insensitive to illumination changes (such as ORB binary features and SIFT gradient features) are used. At the same time, the influence of illumination is mitigated through image preprocessing (such as histogram equalization).

[0033] In this embodiment, a 2D image point is the actual image pixel corresponding to a 2D feature point in the 2D image features, i.e., a key ground feature point in the static background image that can be used for matching (e.g., building corners, road marking intersections), uniquely identified by (x, y) pixel coordinates. A 3D model point is the actual geospatial point corresponding to a 3D feature point in the 3D model features, i.e., a key ground feature point in the static urban real-scene 3D model that has the same semantics as the 2D image point, such as the actual geographic coordinates of a building corner or the 3D position of a road marking intersection, uniquely identified by (X, Y, Z) absolute geographic coordinates. The correspondence is a one-to-one mapping relationship between 2D image points and 3D model points established through feature matching, which is the data foundation for subsequently calculating the UAV pose.

[0034] In this embodiment, registration failure is a critical point of failure, and the complete response mechanism is as follows: Phase 1: Instantaneous Retry and Parameter Adjustment (Lightweight). When registration failure is detected, such as insufficient number of matching point pairs or excessive reprojection error, the system first attempts the fastest recovery method. This involves adjusting feature matching parameters: relaxing the matching threshold for feature descriptor points to obtain more matching point pairs; increasing the number of feature points to be extracted; the RANSAC algorithm plays a crucial role as it can find the correct geometric transformation relationship from noisy matching points; and leveraging spatiotemporal continuity: attempting adjacent frames, as the drone video stream is continuous. If the current frame fails to match, the system immediately tries again with the next frame.

[0035] Phase Two: Intelligent Recovery Strategy (requires more computation). If lightweight retries fail, the system will initiate a more advanced recovery strategy. Enabling Dynamic Object Masking and Semantic Matching: Failures are due to the presence of numerous temporary objects (e.g., vehicles, pedestrians) not present in the model, or seasonal changes (e.g., lush foliage). Solution: Use semantic segmentation to identify and mask these dynamic or volatile areas in real time, forcing the feature extraction algorithm to extract features only from stable static structures, such as building corners, windows, streetlights, and road markings, improving the matching success rate. Multi-Frame Information Fusion and Motion Inference: Even if absolute localization fails, the system can still use visual odometry or inertial measurement unit data to infer the position and attitude of the current frame based on successful localization in previous frames. This inferred position can serve as the initial iteration value for PnP solving, narrowing the solution range and helping the algorithm converge to the correct solution. Switching Alternate Regions or Global Search: The drone flies into a feature-poor area, such as an open grassland or a single-colored wall. Solution: Perform a local search, using the last known location as the center, and search within a large, uncertain area, without fixating on the current viewpoint; perform global relocalization, if the image is completely lost (e.g., flying out of an indoor space), the system needs to be restarted to quickly match the current image with the entire real-world 3D model database.

[0036] Phase 3: Degradation Scheme and Sensor Fusion (Backup Strategy). When all vision methods temporarily fail, the system must be able to persist until recovery. Switch to Backup Navigation System: Immediately switch to a fusion navigation scheme based on inertial measurement unit (IMU) and GPS. How it works: The IMU provides short-term, high-frequency attitude and acceleration information for dead reckoning. GPS provides absolute position to correct IMU drift. Limitations: This is a degradation mode. Accuracy decreases over time (especially without GPS), but it maintains stable UAV flight and control, buying time for the vision system to recover. Trigger Safety Strategies: Hover: The simplest and most direct operation. The UAV hovers in place, waiting for operator intervention or attempting repositioning. Return along the original path: Control the UAV to fly in reverse along the flight path until returning to a previously successfully positioned location. Climb: Increasing flight altitude provides a wider and more distinctive field of view, improving the success rate of re-registration. Emergency Landing / Return to Home: If positioning cannot be restored for an extended period and the backup navigation system is unreliable, automatic return to home or finding a safe area for emergency landing should be initiated.

[0037] In this embodiment, scene changes (e.g., seasons, lighting, new buildings, temporary vehicles) can cause differences in appearance and geometry between real-time UAV images and pre-stored 3D models, leading to the failure of traditional feature matching methods. The method for registering UAV photos with real-world 3D models in changing scenes includes the following steps: Step 1: Preprocessing stage: Constructing a real-world 3D model with semantic information. This model is formed by fusing multi-temporal data, strengthening static structural elements, and extracting feature descriptions robust to lighting and seasonal changes, such as features based on deep learning. Step 2: Real-time matching stage: After the UAV takes photos, semantic segmentation is first performed to identify dynamic objects (e.g., vehicles, pedestrians) and static elements (e.g., buildings, roads). Robust features of the photos are extracted and corresponding points are found in the real-world 3D model. During the matching process, semantic information is used to constrain the matching range, i.e., only static elements in the photos are matched with elements of the same category in the model. A coarse-to-fine matching strategy is adopted: first, global features are used for coarse localization, and then local features are used for fine matching. Step 3: Robust Optimization Stage: Improved robust estimation algorithms (e.g., MAGSAC++) are used to filter matching point pairs and eliminate erroneous matches. Temporal information (e.g., video sequences) is utilized to further optimize the matching results through multi-frame consistency. Step 4: Model Update Stage: When registration is successful and a permanent change in the scene is detected, the drone image is used as new data to incrementally update the real-world 3D model. Step 5: Failure Recovery Mechanism: If registration fails, prior positioning provided by GPS or an inertial measurement unit is used to re-match within a large area of the real-world 3D model, or the drone is allowed to fly to a region with richer features before matching. These methods effectively address the problems caused by scene changes, achieving stable and high-precision registration between drone images and the real-world 3D model.

[0038] The matching process matches 2D and 3D feature points based on the distance between descriptors (e.g., Hamming distance for ORB, Euclidean distance for SIFT and FPFH). Nearest neighbor search (e.g., KNN) and ratio testing are used to initially filter matching pairs. Geometric verification (e.g., RANSAC combined with PnP) is then used to eliminate false matches, and the camera pose is solved. To further improve matching accuracy, a three-level false match handling mechanism is employed: the first level uses KNN ratio testing (threshold 0.8) and descriptor distance threshold filtering to reduce candidate false matches; the second level uses the RANSAC+PnP algorithm (200-500 iterations, reprojection error threshold 3-5 pixels) and 3D spatial consistency constraints (e.g., coplanarity test, spatial distance constraints, normal vector consistency) to eliminate false matches; the third level uses bundle adjustment optimization to eliminate residual pseudo-interior points.

[0039] S106: Based on the correspondence, the PnP algorithm is used to solve the initial exterior orientation elements of the UAV in the absolute geographic coordinate system. The bundle adjustment method is used to optimize the initial exterior orientation elements to obtain the optimized UAV visual pose.

[0040] In this embodiment, the PnP (Perspective-n-Point) algorithm is a pose estimation algorithm based on a perspective projection model. It solves the position and attitude parameters of the UAV's onboard camera relative to the three-dimensional spatial coordinate system (absolute geographic coordinate system) by using the known correspondence between two-dimensional image points and three-dimensional spatial points. It is an algorithm for camera pose estimation in computer vision.

[0041] Specifically, the PnP algorithm modules include: an input preprocessing module for data format standardization and noise filtering; a centering and scaling module for reducing numerical computation errors and improving solution stability; a basis vector estimation module for constructing an affine basis for the 3D point set to simplify pose solving; a pose initial estimation module for solving the initial values of the camera rotation matrix and translation vector; a nonlinear optimization module for minimizing reprojection errors and optimizing pose parameters; a pose inverse transformation module for recovering the final pose in the original coordinate system; and a post-processing and verification module for filtering invalid pose solutions to improve solution reliability.

[0042] In this embodiment, the PnP algorithm input includes a 3D reference point set, a 2D matching point set, camera intrinsic parameters, and semantic label information. The input data all come from the 3D model of the UAV mission area and the semantically segmented aerial images. The number of point pairs is controlled between 300-500 to balance accuracy and efficiency. The output data is the UAV visual pose in absolute geographic coordinates, including position and attitude. It also outputs the reprojection error matrix and pose confidence, which are directly used as input to the visual-inertial measurement unit (VIS) fusion module. The PnP algorithm is an analytical algorithm based on geometric constraints, eliminating the need for traditional dataset training. Its parameter optimization is achieved through scene adaptation experiments—collecting 1000 sets of measured data at flight altitudes of 50-100m and different ground cover densities, iteratively adjusting parameters such as the error threshold and the number of iterations, ultimately achieving an adaptation effect where the single-frame calculation time is less than or equal to 5ms and the pose calculation success rate in complex scenes is greater than or equal to 98%.

[0043] In this embodiment, the absolute geographic coordinate system is a coordinate system used to define the true geographic location of a point in three-dimensional space. The coordinate values (X, Y, Z) directly correspond to the latitude and longitude or planar projection coordinates and altitude of the Earth's surface, and serve as the final reference coordinate system for the UAV's pose. The initial exterior orientation elements are the exterior orientation parameters of the UAV's onboard camera, obtained through preliminary calculation using the PnP algorithm. These parameters include position and attitude parameters and serve as the initial input for optimization.

[0044] In this embodiment, bundle adjustment is a global optimization algorithm based on the least squares principle. It constructs a reprojection error objective function for all 2D image point-3D model point matching pairs, while simultaneously optimizing the camera exterior orientation elements and 3D point coordinates to minimize the overall reprojection error and achieve pose parameter correction.

[0045] Bundle adjustment employs a reprojection error minimization strategy, constructing a joint optimization objective function that includes camera extrinsic parameters and 3D point positions. This is solved iteratively, with a convergence threshold of 1e-6 and a maximum of 50 iterations. Simultaneously, camera intrinsic parameters can be selectively incorporated for joint optimization to offset 70%-80% of the intrinsic parameter calibration error. The propagation of camera intrinsic parameter calibration errors can be suppressed through the following methods: introducing an intrinsic parameter covariance matrix and using weighted least squares to reduce the influence of high-uncertainty parameters; prioritizing control points in the image center region (where distortion has minimal impact); and increasing the number of control points to 8 or more to dilute the error through redundant information.

[0046] In this embodiment, the optimized UAV visual pose is the final visual pose result of the UAV obtained after bundle adjustment optimization, including the optimized position and the optimized attitude. It has smaller reprojection error and higher accuracy, and can be directly used for data fusion with inertial measurement unit.

[0047] S107: The optimized UAV visual pose is fused with the observation data from the inertial measurement unit to obtain the final three-dimensional position and attitude of the UAV in the absolute geographic coordinate system; the observation data from the inertial measurement unit includes angular velocity and acceleration.

[0048] In this embodiment, the inertial measurement unit (IMU) is a miniature sensor module integrated on the UAV, consisting of a gyroscope and an accelerometer. It can output real-time observation data of the UAV's angular velocity (rotational motion) and acceleration (translational motion). It has a high sampling frequency and strong anti-interference capability, but it has cumulative errors and is fused inertial end data. The observation data are physical quantities collected in real time by the IMU, specifically including: angular velocity, the rotational rate around the UAV's body coordinate system X, Y, and Z axes, in rad / s; and acceleration, the linear acceleration along the UAV's body coordinate system X, Y, and Z axes, in m / s². These are the basic inertial data for calculating the UAV's motion state.

[0049] In this embodiment, data fusion is based on filtering algorithms, such as Kalman filtering, which is a process of complementary fusion of visual pose (high precision, low frequency, jitter) and inertial measurement unit observation data (high frequency, no jitter, cumulative error). It uses visual data to correct the cumulative error of the inertial measurement unit, and at the same time uses the inertial measurement unit data to smooth the jitter of the visual pose, so as to obtain a pose result with high precision and high stability.

[0050] The data fusion adopts a tightly coupled fusion method, which is achieved through Kalman filtering or nonlinear optimization. The raw data (angular velocity, acceleration) of the inertial measurement unit and the visual observation data are tightly coupled and processed in the same nonlinear optimization framework. Global optimization is performed by combining multi-view geometric constraints and bundle adjustment. The data weight of the inertial measurement unit is set to 0.3 to update the motion state and suppress the instantaneous drift of visual positioning, so that the system state estimation is more consistent and more stable.

[0051] In this embodiment, the final 3D position is the precise spatial coordinates (X, Y, Z) of the UAV in the absolute geographic coordinate system obtained after fusion, in meters. This eliminates visual pose jitter and cumulative errors of the inertial measurement unit, resulting in higher accuracy than single sensor output. The final attitude is the attitude state of the UAV in the absolute geographic coordinate system obtained after fusion, represented by quaternions or Euler angles (roll angle φ, pitch angle ω, yaw angle κ), in rad or °, exhibiting high stability and high accuracy.

[0052] As can be seen from the above, this application obtains static background images by acquiring aerial images of the mission area, identifying and masking areas of dynamic interference, thus avoiding the influence of dynamic interference on subsequent processing; feature extraction and matching of static background images and 3D models to establish a correspondence provides a basis for calculation; using the PnP algorithm to calculate and optimize the UAV's visual pose can improve the accuracy of pose calculation; fusing the optimized visual pose with inertial measurement unit observation data can obtain a more accurate final 3D position and attitude of the UAV.

[0053] In one embodiment of this application, matching two-dimensional image features with three-dimensional model features to establish a correspondence between two-dimensional image points and three-dimensional model points includes: The first stage of matching is performed using a method based on the nearest neighbor distance ratio to obtain initial matching pairs; The initial matching pairs are processed using a random sampling consensus algorithm to obtain the two-dimensional-three-dimensional point correspondence.

[0054] In this embodiment, the nearest neighbor distance ratio method is a preliminary screening strategy for feature matching: for each two-dimensional feature descriptor, the two nearest candidate descriptors are searched in the three-dimensional feature descriptor, and the ratio of the nearest distance to the second nearest distance is calculated. When the ratio is less than a preset threshold, it is determined to be a valid initial matching pair, which can efficiently eliminate false matches. The first stage of matching is the preliminary screening step of feature matching. The nearest neighbor distance ratio method is used to quickly screen out potential valid matching pairs without involving complex geometric constraints. The goal is to quickly narrow down the matching range and retain high-quality candidates. The initial matching pair is a temporary mapping set of two-dimensional image points to three-dimensional model points obtained after the first stage of matching, including valid matches and a small number of false matches (outside points), which need to be further refined with subsequent geometric constraints.

[0055] In this embodiment, the random sampling consensus algorithm is a robust geometric constraint algorithm. It constructs an initial model, such as a PnP pose model, by randomly sampling a small number of matching pairs. It calculates the model error of all matching pairs, such as the reprojection error, and filters out inliers (valid matches) with errors less than a threshold. After iterative optimization, it obtains the optimal set of inliers, which serves to remove outliers from the initial matching pairs.

[0056] As can be seen from the above, this embodiment uses the nearest neighbor distance ratio method to perform the first stage of matching to obtain the initial matching pairs, which helps to initially screen out the matching feature point pairs. On this basis, the random sampling consensus algorithm is used to process the initial matching pairs, which can eliminate the erroneous matches and obtain a more accurate two-dimensional-three-dimensional point correspondence, providing a reliable foundation for solving the UAV visual pose.

[0057] In one embodiment of this application, it further includes: When the matching results of two-dimensional image features and three-dimensional model features contain multiple spatially separated candidate matching regions, the multiple candidate matching regions are first filtered based on geometric consistency and semantic consistency to obtain effective candidate regions. By utilizing the multi-view geometric constraints of the current frame and several adjacent preceding frames, the consistency of valid candidate regions is verified and optimized. Based on short-time motion prediction using inertial measurement units, pose continuity is verified for candidate regions after multi-view validation to obtain the final matching pair, which is then used as the initial matching pair.

[0058] In this embodiment, the spatially separated candidate matching region clusters are multiple non-overlapping sets (clusters) of matching pairs formed by clustering according to spatial coordinates in the matching results. Spatial separation means that the pixel coordinates of the 2D image points in the image and the (X,Y,Z) coordinates of the 3D model points in the geographic space of different clusters do not intersect, which is caused by similar features in the scene (e.g., multiple buildings of the same style). Geometric consistency is the degree to which the geometric constraints of all matching pairs within the candidate matching region clusters are satisfied. By calculating indicators such as the reprojection error distribution of matching pairs within the cluster, the spatial flatness of the 3D point cloud, and the distribution density of 2D image points, it is determined whether the cluster conforms to the geometric rules of the real geographic scene, such as no obvious spatial distortion and concentrated reprojection errors. Semantic consistency is the degree of semantic label matching between 2D image points and 3D model points within the candidate matching region clusters; the semantic labels of 2D image points come from the semantic segmentation results, such as building corners and road markings, while the semantic labels of 3D model points are preset attributes, such as office building corners and main road markings. Semantic consistency means that the two labels are of the same or similar categories. The first screening step is a preliminary filtering process for multiple candidate matching region clusters. Through dual constraints of geometric consistency and semantic consistency, invalid clusters that do not conform to the real-world scenario are eliminated, while potentially valid clusters are retained. Valid candidate regions are candidate matching region clusters that, after the first screening, meet preset thresholds for both geometric and semantic consistency.

[0059] In this embodiment, the current frame is the static background image currently being processed and serves as the reference frame for matching. The adjacent preceding frames refer to the static background images of the preceding frames that are temporally continuous with the current frame, including continuous visual information from the same scene. Multi-view geometric constraints are based on the geometric relationships between images of the same scene from different perspectives (the current frame and the preceding frames), such as the fundamental matrix F and the essential matrix E. These constraints require matching pairs to satisfy rules such as coplanarity and epipolar constraints, ensuring that the spatial positions of the matching pairs are consistent (without abrupt changes) in consecutive frames. Consistency verification and optimization utilize multi-view geometric constraints to verify whether matching pairs within the valid candidate region satisfy the spatial consistency of consecutive frames, eliminating erroneous matching pairs that violate the constraints and supplementing missing matching pairs that meet the constraints, thereby optimizing the cluster quality.

[0060] In a specific embodiment, the consistency verification and optimization of the effective candidate regions are performed using the multi-view geometric constraints of the current frame and several adjacent preceding frames. This includes: for each effective candidate region, the initial pose assumption is used as the initial value, and local bundle adjustment is performed within a sliding window including the current frame and several adjacent preceding frames; the total reprojection error of all frames within the sliding window after local bundle adjustment optimization is calculated, and the candidate region corresponding to the pose assumption with the smallest total reprojection error is selected as the preferred result after multi-view verification.

[0061] In this embodiment, short-term motion prediction is based on historical observation data from the inertial measurement unit (IMU), such as the data from the previous 50ms. Integration is used to predict the displacement and attitude changes of the UAV relative to several previous frames at the current frame, resulting in a predicted pose. Pose continuity verification verifies whether the difference between the UAV pose corresponding to the matched pair within the candidate region after multi-view verification and the IMU-predicted pose is within a preset threshold, such as a position difference less than or equal to 0.5 meters or an attitude difference less than or equal to 1°. Unstable matched pairs with excessively large differences are eliminated, ensuring a smooth pose transition. The final matched pair is a high-quality set of 2D image points to 3D model points, obtained after three levels of processing: initial screening, multi-view verification, and pose continuity verification. The initial matched pairs, after three levels of optimization, are used as input and possess higher initial quality.

[0062] In a specific embodiment, based on the short-time motion prediction of the inertial measurement unit, the pose continuity of the candidate region after multi-view verification is checked, including: predicting the pose prior value and its uncertainty range of the current frame based on the UAV pose and the observation data of the inertial measurement unit of the current frame and several adjacent previous frames; comparing the pose of the current frame calculated by the candidate matching region after multi-view verification with the pose prior value and calculating the geometric distance; selecting the region corresponding to the candidate pose that falls within the uncertainty range and is closest to the pose prior value as the final matching region, which is the final matching result.

[0063] As can be seen from the above, this embodiment utilizes semantic segmentation and motion detection to mask dynamic interference areas in aerial images to obtain static background images. Features are extracted and matched between these static background images and static 3D urban scene models to establish a correspondence between 2D image points and 3D model points. After matching using a nearest neighbor distance ratio method and a random sampling consensus algorithm, when multiple candidate matching regions exist, effective candidate regions are selected based on geometric and semantic consistency. Verification and optimization are performed using multi-view geometric constraints, and pose continuity is verified based on short-time motion prediction using inertial measurement units. This improves the accuracy and reliability of matching, reduces the impact of dynamic interference, obtains more accurate initial matching pairs, and enhances the accuracy of UAV visual pose calculation and final 3D position and attitude determination.

[0064] In one embodiment of this application, multiple candidate matching regions are first filtered based on geometric consistency and semantic consistency to obtain valid candidate regions, including: The geometric consistency score is calculated based on the reprojection error distribution, spatial distribution, and inlier ratio of matching point pairs within the candidate matching region. The semantic consistency score is determined by comparing the degree of match between the static features identified in aerial images and the pre-defined semantic information of the corresponding candidate areas in the 3D model. The geometric consistency score and semantic consistency score of each candidate region are fused to obtain a comprehensive score. Candidate regions with a comprehensive score greater than a preset comprehensive score threshold are selected to obtain valid candidate regions.

[0065] In this embodiment, the candidate matching region is a set (cluster) of spatially separated matching pairs obtained through spatial clustering after preliminary matching of 2D image features and 3D model features. Each region includes a set of 2D image point-3D model point mapping relationships and is the object of screening. The geometric consistency score is an index calculated based on the geometric characteristics of matching point pairs within the candidate matching region. It is used to evaluate whether the matching pairs within the region conform to the geometric rules of the real geographic scene, indicating the spatial rationality of the matching pairs. The value range is [0,1], and the higher the score, the stronger the geometric consistency. The reprojection error distribution is the statistical distribution characteristics of the reprojection error (the Euclidean distance between the 2D coordinates of the 3D model points projected by the camera and the actual 2D image point coordinates) of all matching pairs within the candidate region, including the mean, standard deviation, and maximum value of the error, used to determine whether the error is concentrated (concentration indicates good geometric consistency). The spatial distribution is the pixel coordinate distribution of 2D image points in the aerial image and the (X,Y,Z) coordinate distribution of 3D model points in the absolute geographic coordinate system within the candidate region. It requires that the matching points are evenly distributed, without obvious clustering or dispersion, and conform to the spatial structure of real ground features. The inlier ratio is the proportion of the number of matching pairs in a candidate region whose reprojection error is less than a preset error threshold to the total number of matching pairs in that region. The higher the inlier ratio, the more effective matches there are in the region, and the stronger the geometric consistency.

[0066] In this embodiment, the semantic consistency score is an index calculated based on the degree of match between the semantics of static features corresponding to two-dimensional image points within the candidate matching area and the preset semantics of the three-dimensional model. It is used to evaluate the semantic rationality of the matching pair, with a value range of [0,1]. The higher the score, the stronger the semantic consistency. Static features identified in aerial images are spatially fixed features identified from the processed static background image through semantic segmentation algorithms, such as buildings, roads, bridges, streetlights, etc. Each static feature corresponds to a unique semantic label, such as office building or main road. The preset semantic information of the candidate area corresponding to the three-dimensional model is the preset semantic label labeled for each feature element when the static urban real scene three-dimensional model is constructed. It is associated with the three-dimensional model points corresponding to the candidate matching area and serves as the benchmark for semantic comparison. The degree of match is the degree of matching between the semantic labels of static features corresponding to two-dimensional image points within the candidate area and the preset semantic labels of the three-dimensional model points. This includes complete match (e.g., office building corner and office building corner), close match (e.g., main road marking and urban main road marking), and excludes completely unrelated (e.g., buildings and trees).

[0067] In this embodiment, the comprehensive score is a final index calculated by weighting the geometric consistency score and semantic consistency score of the same candidate region according to preset weights (e.g., geometric weight 0.6, semantic weight 0.4), and is used to comprehensively evaluate the overall quality of the candidate regions. The preset comprehensive score threshold is a pre-set critical value used to filter valid candidate regions. Candidate regions with a comprehensive score greater than the preset comprehensive score threshold are determined to be valid, while those with a score less than the preset comprehensive score threshold are eliminated. Valid candidate regions are candidate matching regions whose comprehensive score, after fusing geometric consistency and semantic consistency scores, is greater than the preset comprehensive score threshold, and thus possesses both geometric and semantic rationality.

[0068] As can be seen from the above, this embodiment calculates the geometric consistency score by utilizing the reprojection error distribution, spatial distribution, and inlier ratio of matching point pairs within the candidate matching area. It also combines the semantic consistency score determined by the degree of agreement between static features in aerial images and the corresponding candidate areas of the 3D model with preset semantic information. By fusing the two scores, a comprehensive score is obtained, and candidate areas with a score greater than the preset comprehensive score threshold are selected as valid candidate areas. This approach can effectively select more suitable candidate areas and improve the accuracy and reliability of coordinate estimation.

[0069] In one embodiment of this application, after matching two-dimensional image features with three-dimensional model features to establish a correspondence between two-dimensional image points and three-dimensional model points, the method further includes: Calculate the overall matching error and determine whether the overall error is greater than the preset matching threshold; If the overall error is greater than the preset matching threshold, then the reprojection error threshold in the PnP algorithm parameters is reduced based on the first step length; and / or the threshold of the kernel function in the PnP algorithm parameters is increased based on the second step length. If the overall error is less than or equal to the preset matching threshold, the reprojection error threshold in the PnP algorithm parameters is increased based on the third step length; and / or the threshold of the kernel function in the PnP algorithm parameters is decreased based on the fourth step length.

[0070] In this embodiment, the overall matching error is a quantitative indicator of the overall matching results between 2D image points and 3D model points. It comprehensively represents the geometric consistency deviation of all valid matching pairs and is calculated using the average reprojection error or the root mean square error (RMSE). It serves as the basis for judging the matching quality. The preset matching threshold is a pre-calibrated critical value used to determine whether the matching quality meets the standard. For example, the average reprojection error is 2.5 pixels. This is set based on the accuracy requirements of UAV coordinate estimation, such as a position error less than or equal to 0.5 meters, and serves as the criterion for triggering the adjustment of PnP algorithm parameters. PnP algorithm parameters refer to the parameters used in the PnP algorithm to constrain the accuracy and robustness of pose calculation, including the reprojection error threshold and the kernel function threshold. The parameter values directly affect the accuracy and stability of the pose calculation results.

[0071] Specifically, the reprojection error threshold is a geometric constraint threshold (unit: pixels) used in the PnP algorithm to select valid matching pairs. It represents the maximum allowable Euclidean distance between the coordinates of the 3D model points projected onto the 2D image after PnP calculation and the actual 2D image point coordinates. The smaller the reprojection error threshold, the higher the accuracy of the selected matching pairs. The kernel function threshold is a parameter used in the PnP algorithm to control the influence range of the kernel function, used to balance the weight distribution of matching pairs, for example, giving high weights to core matching pairs and low weights to edge matching pairs. The higher the kernel function threshold, the larger the influence range of the kernel function and the higher the tolerance to noise.

[0072] In this embodiment, the first step, second step, third step, and fourth step are the adjustment step sizes of the PnP algorithm parameters. Specifically, the first step is the step size for reducing the reprojection error threshold, the second step is the step size for increasing the kernel function threshold, the third step is the step size for increasing the reprojection error threshold, and the fourth step is the step size for reducing the kernel function threshold. The step sizes are calibrated based on parameter sensitivity analysis and the accuracy requirements of the actual scenario, ensuring the stability and effectiveness of parameter adjustment.

[0073] The effective range of the reprojection error threshold is [1.0 pixel, 5.0 pixels], calibrated based on the UAV's onboard camera resolution of 5472×3648 and an aerial photography altitude of 50-100 meters. The effective range of the kernel function threshold is [0.3, 1.0], based on the characteristics of the PnP algorithm kernel function, where 0.3 prioritizes accuracy and 1.0 prioritizes robustness. The preset matching threshold T is 2.5 pixels, corresponding to the accuracy requirement of the UAV's position error being less than or equal to 0.5 meters. The allowable error fluctuation coefficient k is 0.1, and the overall error change corresponding to the step size is less than or equal to T×k=0.25 pixels, ensuring smooth adjustment.

[0074] Specifically, the first step length = min(effective range of reprojection error threshold × ... ,T×k), where, The range percentage coefficient is set to 2% to balance adjustment efficiency and stability; T×k is the maximum allowable error fluctuation, 0.25 pixels, to avoid a sudden decrease in effective matching pairs due to a single adjustment. For example: the effective range of the reprojection error threshold = 5.0 - 1.0 = 4.0 pixels; the step size corresponding to the range percentage = 4.0 × 2% = 0.08 pixels; the step size corresponding to the error constraint = 2.5 × 0.1 = 0.25 pixels; taking the minimum of the two, we get: the first step size = 0.08 pixels.

[0075] The second step size = effective range of kernel function threshold × β. Here, β is the range percentage coefficient, taken as 10%. The kernel function is less sensitive to error than the reprojection error threshold, so the step size can be increased. For example: effective range of kernel function threshold = 1.0 - 0.3 = 0.7; second step size = 0.7 × 10% = 0.07, corrected to 0.1, rounded down for easier iterative calculation.

[0076] Third step length = max(effective range of reprojection error threshold × ... ×1.5, T×k×1.2). A 1.5-fold coefficient represents the adjustment range that can be relaxed when the error meets the standard, improving efficiency; a 1.2-fold error fluctuation allows for a slight increase in error after the standard is met, in exchange for more matching pairs. Example: Effective range of reprojection error threshold × ×1.5=4.0×2%×1.5=0.12 pixels; T×k×1.2=2.5×0.1×1.2=0.3 pixels; taking the maximum of the two, we get the third step length = 0.3 pixels.

[0077] The fourth step length = the second step length × 0.8, taking 80% of the adjusted step length to balance accuracy and robustness. For example: the fourth step length = 0.1 × 0.8 = 0.08 pixels.

[0078] As can be seen from the above, this embodiment first calculates the overall error of matching two-dimensional image features with three-dimensional model features, and then adjusts the reprojection error threshold and kernel function threshold in the PnP algorithm parameters according to the relationship between the overall error and the preset matching threshold. This enables adaptive optimization of the matching process, improves matching accuracy and stability, and makes the solved UAV visual pose more accurate.

[0079] In one embodiment of this application, the preset matching threshold is adjusted based on the drone's flight speed, altitude, and prior environmental complexity, including: The flight speed and altitude of the UAV, as well as the spatial density of static feature points obtained based on the 3D model of the mission area, are used as the prior environmental complexity. The speed influence factor, altitude influence factor, and environmental complexity influence factor were calculated separately. Among them, the speed influence factor was positively correlated with flight speed, the altitude influence factor was negatively correlated with flight altitude, and the environmental complexity influence factor was negatively correlated with the spatial density of static feature points. The weighted sum of the speed influence factor, height influence factor, and complexity influence factor is used to obtain the operating condition adjustment coefficient. If the working condition adjustment coefficient is greater than the preset working condition threshold, the preset matching threshold is increased based on the first adjustment step size. If the operating condition adjustment coefficient is less than the preset operating condition threshold, the preset matching threshold is reduced based on the second adjustment step size.

[0080] In this embodiment, flight speed is the translational speed of the UAV in the absolute geographic coordinate system (unit: m / s), representing the UAV's motion state. The faster the speed, the more blurred or offset the matching pairs may appear, requiring a more relaxed preset matching threshold. Flight altitude is the altitude of the UAV relative to the ground (unit: m), affecting the resolution of aerial images and the clarity of feature points. The higher the altitude, the sparser the feature points, requiring a more relaxed preset matching threshold. Environmental prior complexity is an indicator of environmental complexity calculated based on the 3D model of the task area. In this embodiment, it is represented by the spatial density of static feature points, indicating the richness of matchable ground features in the scene. The lower the density, the more complex the environment, requiring a more relaxed preset matching threshold. Static feature point spatial density is the number of static feature points per unit geographic space in the 3D model of the task area, such as building corners and road marking intersections, with units of points / m³. The calculation formula is: Static feature point spatial density = total number of static feature points in the task area / volume of the task area.

[0081] In this embodiment, the speed influence factor is a coefficient positively correlated with flight speed, ranging from [1.0, 2.0], used to represent the degree of influence of flight speed on the matching threshold. The faster the speed, the larger the factor, and the greater the adjustment range of the matching threshold. The altitude influence factor is a coefficient negatively correlated with flight altitude, ranging from [0.8, 1.5], used to represent the degree of influence of flight altitude on the matching threshold. The higher the altitude, the larger the factor, and the greater the adjustment range of the matching threshold. The environmental complexity influence factor is a coefficient negatively correlated with the spatial density of static feature points, ranging from [0.9, 1.8], used to represent the degree of influence of environmental complexity on the matching threshold. The lower the density, the larger the factor, and the greater the adjustment range of the matching threshold.

[0082] The operating condition adjustment coefficient is a weighted sum of the speed influence factor, altitude influence factor, and environmental complexity influence factor, with a value range of [2.7, 5.3]. It comprehensively represents the current UAV operating condition's requirement for the matching threshold; the larger the coefficient, the higher the matching threshold needs to be set. The preset operating condition threshold is a critical value used to determine whether the operating condition adjustment coefficient is greater than the normal range. The first adjustment step is to increase the single adjustment magnitude of the preset matching threshold when the operating condition adjustment coefficient is greater than the preset operating condition threshold, enabling rapid adaptation to complex operating conditions. The second adjustment step is to decrease the single adjustment magnitude of the preset matching threshold when the operating condition adjustment coefficient is less than the preset operating condition threshold, enabling precise adaptation to simple operating conditions.

[0083] Specifically, set the basic parameters: Preset matching threshold effective range: T∈[1.5,4.0] pixels; Preset working condition threshold: K0=3.5 (the critical value that distinguishes between complex and simple working conditions, calibrated through 100 sets of working condition experiments); Threshold allowable fluctuation coefficient: η=0.1; Nonlinear coefficient: =0.3 (controls the step size growth rate); Operating condition deviation is defined as: D=|K1-K0|, which represents the degree of difference between the current operating condition and the normal operating condition, and K1 is the adjustment coefficient of the current operating condition.

[0084] The first adjustment step calculation steps include: when K1 is greater than K0, indicating a complex working condition; calculating the working condition deviation D = K1 - K0, since K1 is greater than K0, D is a positive value; calculating the nonlinear adjustment coefficient = 1 + ×ln(D+1), where ln is the natural logarithm; calculate the adjustment range B = T0 × η × (1 + ×ln(D+1)); First adjustment step size = min(B, 0.3 pixels) (upper limit 0.3 pixels, to avoid excessive step size in extreme working conditions); The calculated first adjustment step size is rounded down to an integer multiple of 0.05 pixels.

[0085] For example, K1=4.3, D=4.3-3.5=0.8, T0=3.0 pixels, non-linear adjustment coefficient=1+0.3×ln(0.8+1)=1+0.3×0.5878≈1.176; B=3.0×0.1×1.176≈0.353 pixels; first adjustment step=min(0.353,0.3)=0.3 pixels, which is taken as 0.3 pixels in engineering.

[0086] The second adjustment step calculation steps include: when K1 is less than K0, which is a simple working condition, calculate the working condition deviation D = K0 - K1. Since K1 is less than K0, D is a positive value; calculate the nonlinear adjustment coefficient = 1 + ×ln(D+1); Calculate the adjustment range B = T0 × η × (1+ ×ln(D+1))×0.8 (×0.8 makes the second adjustment step size smaller than the first adjustment step size); second adjustment step size = max(B,0.08 pixels) (lower limit 0.08 pixels, to avoid slow adjustment); the calculated second adjustment step size is rounded down to an integer multiple of 0.05 pixels.

[0087] For example, K1=2.9, D=3.5-2.9=0.6, T0=2.0 pixels: non-linear adjustment coefficient=1+0.3×ln(0.6+1)=1+0.3×0.4700≈1.141; B=2.0×0.1×1.141×0.8≈0.183 pixels; second adjustment step=max(0.183,0.08)=0.183 pixels, which is taken as 0.2 pixels in engineering.

[0088] As can be seen from the above, this embodiment adjusts the preset matching threshold according to the UAV's flight speed, altitude, and prior environmental complexity, so that the preset matching threshold can adapt to different flight conditions, thereby improving the matching accuracy and the accuracy of coordinate calculation.

[0089] In one embodiment of this application, the optimized UAV visual pose is fused with observation data from the inertial measurement unit to obtain the final three-dimensional position and attitude of the UAV in the absolute geographic coordinate system, including: Establish the state vector and observation model of the fusion filter; the state vector includes the position error, velocity error, attitude error of the UAV, and the zero bias error of the inertial measurement unit; the observation model takes the optimized UAV pose as input and outputs a linear observation equation of the state vector. By using the angular velocity and acceleration of the inertial measurement unit, the state vector and its covariance matrix are updated through the error state dynamics equation to obtain the predicted state value at the current moment. Using the observation model, the state prediction values are transformed into the observation space to obtain the corresponding predicted observations; The optimized UAV pose is used as the actual observation, and the residual between the actual observation and the predicted observation is calculated. The Kalman gain is calculated based on the covariance matrix of the state prediction values, the pre-defined observation noise covariance matrix, and the observation model. The residuals are weighted using Kalman gain and used to correct the state prediction and its covariance matrix to obtain the optimal state estimate and covariance matrix at the current time. The position and attitude errors in the optimal state estimation are fed back to the navigation state of the UAV to obtain the final three-dimensional position and attitude output.

[0090] In this embodiment, the fusion filter is a state estimator based on Kalman filtering. Its function is to fuse the high accuracy of visual pose with the high real-time performance of the inertial measurement unit, and output stable, drift-free navigation results. This embodiment uses extended Kalman filtering.

[0091] In this embodiment, the error state dynamics equation is a differential equation describing the evolution of the state vector over time, derived based on the kinematic model of the inertial measurement unit. =F·X+W, where: Let X be the derivative of the state vector, F be the state transition matrix, and W be the process noise (Gaussian white noise), used to update the state prediction value. The covariance matrix represents the statistical distribution of errors in each dimension of the state vector, including the process covariance matrix Q (describing process noise characteristics) and the state prediction covariance matrix P (describing the uncertainty of the state prediction value), with dimensions consistent with the state vector (15×15). The state prediction value is the estimated value of the current state vector updated using observation data from the inertial measurement unit based on the error state dynamics equation, representing the error state prediction before incorporating visual observations. The observation space is the mathematical space where the observation model resides, used to convert the state prediction value into a form comparable to the actual observations. The predicted observations are the observation estimates obtained after the state prediction values are transformed by the observation model, with dimensions consistent with the actual observations, used to calculate the residuals.

[0092] In this embodiment, the residual is the difference between the actual observation and the predicted observation, representing the deviation between the predicted state and the actual observation. It serves as the input for Kalman gain calculation and state correction. The observation noise covariance matrix represents the positive definite matrix of the noise characteristics of the actual observation (visual pose), calibrated based on the error statistics of the visual pose, for example, position noise variance of 0.01 m² and attitude noise variance of 0.001 rad². The Kalman gain is a weighted coefficient matrix (15×6) used to balance the uncertainty of state prediction and the uncertainty of observation noise. The larger the gain, the more confident the observation data; conversely, the more confident the predicted data. The optimal state estimate is the final error state estimate obtained by correcting the predicted state value after weighting the residual with Kalman gain, possessing the minimum mean square error characteristic. The navigation state is the navigation output of the UAV, including position and attitude, initially obtained by integration by the inertial measurement unit, and output after correction by the optimal state estimate.

[0093] As can be seen from the above, this embodiment utilizes the state vector and observation model of the fusion filter, combined with the observation data of the inertial measurement unit and the optimized UAV visual pose, and through steps such as state prediction, observation transformation, residual calculation, Kalman gain calculation and state correction, feeds back the position error and attitude error to the UAV navigation state, thereby obtaining a more accurate final three-dimensional position and attitude of the UAV in the absolute geographic coordinate system.

[0094] Corresponding to the coordinate calculation method for matching drone photos with ground 3D reality in the above embodiment, Figure 2 This is a structural block diagram of a coordinate estimation system for matching UAV photographs with 3D ground reality, provided as an embodiment of this application. For ease of explanation, only the parts relevant to this embodiment are shown. References Figure 2The coordinate calculation system 20 for matching drone photos with ground 3D reality includes: image acquisition module 21, static extraction module 22, image extraction module 23, feature extraction module 24, feature matching module 25, visual pose module 26, and position and attitude module 27.

[0095] Among them, the image acquisition module 21 is used to acquire images through the UAV's onboard camera to obtain aerial images of the mission area; the aerial images include: dynamic interference objects and static background. The static extraction module 22 is used to identify and mask dynamic interference areas in aerial images by combining semantic segmentation and motion detection to obtain static background images. Image extraction module 23 is used to extract features from static background images to obtain two-dimensional image features; The feature extraction module 24 is used to extract features from a preset static urban real-scene 3D model to obtain 3D model features; The feature matching module 25 is used to match the features of the two-dimensional image with the features of the three-dimensional model to establish the correspondence between the two-dimensional image points and the three-dimensional model points. The visual pose module 26 is used to calculate the initial exterior orientation elements of the UAV in the absolute geographic coordinate system based on the correspondence relationship using the PnP algorithm, and optimize the initial exterior orientation elements using the bundle adjustment method to obtain the optimized UAV visual pose. The position and attitude module 27 is used to fuse the optimized UAV visual pose with the observation data of the inertial measurement unit to obtain the final three-dimensional position and attitude of the UAV in the absolute geographic coordinate system; wherein, the observation data of the inertial measurement unit includes angular velocity and acceleration.

[0096] See Figure 3 , Figure 3 This is a schematic block diagram of an electronic device provided according to an embodiment of this application. Figure 3 The electronic device 300 in this embodiment may include one or more processors 301, one or more input devices 302, one or more output devices 303, and one or more memories 304. The processors 301, input devices 302, output devices 303, and memories 304 communicate with each other via a communication bus 305. The memories 304 store computer programs, including program instructions. The processors 301 execute the program instructions stored in the memories 304. Specifically, the processors 301 are configured to invoke the program instructions to perform the functions of the modules in the aforementioned device embodiments, for example... Figure 2 The functions of the image acquisition module 21, static extraction module 22, image extraction module 23, feature extraction module 24, feature matching module 25, visual pose module 26, and position and pose module 27 are shown.

[0097] It should be understood that, in the embodiments of this application, the processor 301 may be a central processing unit (CPU), or it may be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or any conventional processor.

[0098] Input device 302 may include a touchpad, a fingerprint sensor (for collecting the user's fingerprint information and fingerprint orientation information), a microphone, etc., and output device 303 may include a display (LCD, etc.), a speaker, etc.

[0099] The memory 304 may include read-only memory and random access memory, and provides instructions and data to the processor 301. A portion of the memory 304 may also include non-volatile random access memory. For example, the memory 304 may also store device type information.

[0100] In specific implementations, the processor 301, input device 302, and output device 303 described in the embodiments of this application can execute the implementation methods described in any embodiment of the coordinate calculation method for matching UAV photos with ground three-dimensional real scenes provided in the embodiments of this application, or they can execute the implementation methods of the electronic devices described in the embodiments of this application, which will not be repeated here.

[0101] In another embodiment of this application, a computer-readable storage medium is provided. This computer-readable storage medium stores a computer program, which includes program instructions. When executed by a processor, the program instructions implement all or part of the processes in the methods described above. Alternatively, the computer program can instruct related hardware to complete the process. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. The computer-readable medium can include any entity or device capable of carrying computer program code, a recording medium, a USB flash drive, a portable hard drive, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM), a random access memory (RAM), an electrical carrier signal, a telecommunication signal, and a software distribution medium, etc.

[0102] The computer-readable storage medium can be an internal storage unit of the electronic device in any of the foregoing embodiments, such as a hard disk or memory of the electronic device. The computer-readable storage medium can also be an external storage device of the electronic device, such as a plug-in hard disk, smart media card (SMC), secure digital card (SD), flash card, etc., equipped on the electronic device. Furthermore, the computer-readable storage medium can include both internal and external storage units of the electronic device. The computer-readable storage medium is used to store computer programs and other programs and data required by the electronic device. The computer-readable storage medium can also be used to temporarily store data that has been output or will be output.

[0103] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this application.

[0104] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working process of the electronic devices and units described above can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.

[0105] In the several embodiments provided in this application, it should be understood that the disclosed electronic devices and methods can be implemented in other ways. For example, the device embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be indirect coupling or communication connection through some interfaces or units, or it may be an electrical, mechanical, or other form of connection.

[0106] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of the embodiments of this application, depending on actual needs.

[0107] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0108] The above are merely specific embodiments of this application, but the scope of protection of this application is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in this application, and these modifications or substitutions should all be covered within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A coordinate calculation method for matching drone photos with 3D ground reality, characterized in that, include: Aerial images of the mission area are obtained by acquiring images using an airborne camera on a drone; the aerial images include: dynamic interference objects and static background. By combining semantic segmentation and motion detection, dynamic interference areas in the aerial images are identified and masked to obtain static background images; Feature extraction is performed on the static background image to obtain two-dimensional image features; Feature extraction is performed on a pre-defined static 3D urban landscape model to obtain the 3D model features; The two-dimensional image features are matched with the three-dimensional model features to establish a correspondence between two-dimensional image points and three-dimensional model points; Based on the aforementioned correspondence, the PnP algorithm is used to calculate the initial exterior orientation elements of the UAV in the absolute geographic coordinate system, and the bundle adjustment method is used to optimize the initial exterior orientation elements to obtain the optimized UAV visual pose. The optimized UAV visual pose is fused with the observation data from the inertial measurement unit to obtain the final three-dimensional position and attitude of the UAV in the absolute geographic coordinate system; wherein the observation data from the inertial measurement unit includes angular velocity and acceleration.

2. The coordinate calculation method for matching UAV photos with 3D ground reality as described in claim 1, characterized in that, The step of matching the two-dimensional image features with the three-dimensional model features to establish a correspondence between two-dimensional image points and three-dimensional model points includes: The first stage of matching is performed using a method based on the nearest neighbor distance ratio to obtain initial matching pairs; The initial matching pairs are processed using a random sampling consensus algorithm to obtain a two-dimensional to three-dimensional point correspondence.

3. The coordinate calculation method for matching UAV photos with 3D ground reality according to claim 2, characterized in that, Also includes: When the matching results of the two-dimensional image features and the three-dimensional model features contain multiple spatially separated candidate matching regions, the multiple candidate matching regions are first filtered based on geometric consistency and semantic consistency to obtain effective candidate regions. The consistency verification and optimization of the effective candidate regions are performed by utilizing the multi-view geometric constraints of the current frame and several adjacent preceding frames. Based on short-time motion prediction using an inertial measurement unit, pose continuity is verified on candidate regions after multi-view validation to obtain the final matching pair, which is then used as the initial matching pair.

4. The coordinate calculation method for matching UAV photos with 3D ground reality according to claim 3, characterized in that, The first screening of multiple candidate matching regions based on geometric and semantic consistency yields valid candidate regions, including: The geometric consistency score is calculated based on the reprojection error distribution, spatial distribution and inlier ratio of the matching point pairs in the candidate matching region. The semantic consistency score is determined by comparing the degree of match between the static features identified in the aerial images and the preset semantic information of the corresponding candidate areas in the 3D model. The geometric consistency score and semantic consistency score of each candidate region are fused to obtain a comprehensive score. Candidate regions with a comprehensive score greater than a preset comprehensive score threshold are selected to obtain the effective candidate regions.

5. The coordinate calculation method for matching UAV photos with 3D ground reality as described in claim 1, characterized in that, After matching the two-dimensional image features with the three-dimensional model features to establish the correspondence between two-dimensional image points and three-dimensional model points, the method further includes: Calculate the overall matching error and determine whether the overall error is greater than a preset matching threshold; If the overall error is greater than the preset matching threshold, then the reprojection error threshold in the PnP algorithm parameters is reduced based on the first step length; and / or the threshold of the kernel function in the PnP algorithm parameters is increased based on the second step length; If the overall error is less than or equal to the preset matching threshold, then the reprojection error threshold in the PnP algorithm parameters is increased based on the third step length; and / or the threshold of the kernel function in the PnP algorithm parameters is decreased based on the fourth step length.

6. The coordinate calculation method for matching UAV photos with 3D ground reality as described in claim 5, characterized in that, The preset matching threshold is adjusted based on the drone's flight speed, altitude, and prior environmental complexity, including: The flight speed and altitude of the UAV, as well as the spatial density of static feature points obtained based on the 3D model of the mission area, are used as the prior environmental complexity. Calculate the speed influence factor, altitude influence factor, and environmental complexity influence factor respectively; wherein, the speed influence factor is positively correlated with flight speed, the altitude influence factor is negatively correlated with flight altitude, and the environmental complexity influence factor is negatively correlated with the spatial density of static feature points; The weighted sum of the speed influence factor, height influence factor, and complexity influence factor is calculated to obtain the operating condition adjustment coefficient; If the operating condition adjustment coefficient is greater than the preset operating condition threshold, then the preset matching threshold is increased based on the first adjustment step size; If the operating condition adjustment coefficient is less than the preset operating condition threshold, then the preset matching threshold is reduced based on the second adjustment step size.

7. The coordinate calculation method for matching UAV photos with three-dimensional ground reality according to claim 1, characterized in that, The process of fusing the optimized UAV visual pose with the observation data from the inertial measurement unit to obtain the final three-dimensional position and attitude of the UAV in the absolute geographic coordinate system includes: Establish a state vector and observation model for the fusion filter; the state vector includes the UAV's position error, velocity error, attitude error, and zero bias error of the inertial measurement unit; the observation model takes the optimized UAV pose as input and outputs a linear observation equation for the state vector. Using the angular velocity and acceleration of the inertial measurement unit, the state vector and its covariance matrix are updated through the error state dynamics equation to obtain the predicted state value at the current moment; Using the observation model, the predicted state values are transformed into the observation space to obtain the corresponding predicted observations; Using the optimized UAV pose as the actual observation, calculate the residual between the actual observation and the predicted observation; Based on the covariance matrix of the predicted state values, the preset observation noise covariance matrix, and the observation model, the Kalman gain is calculated. The residuals are weighted using the Kalman gain and used to correct the state predictions and their covariance matrix to obtain the optimal state estimate and covariance matrix at the current time. The position and attitude errors in the optimal state estimation are fed back to the navigation state of the UAV to obtain the final three-dimensional position and attitude output.

8. A coordinate calculation system for matching drone photos with 3D ground reality, characterized in that, include: The image acquisition module is used to acquire images through the drone's onboard camera to obtain aerial images of the mission area; The aerial images include: dynamic interference objects and static background; The static extraction module is used to identify and mask dynamic interference areas in the aerial image by combining semantic segmentation and motion detection to obtain a static background image. The image extraction module is used to extract features from the static background image to obtain two-dimensional image features; The feature extraction module is used to extract features from a preset static urban real-scene 3D model to obtain the 3D model features; The feature matching module is used to match the two-dimensional image features with the three-dimensional model features to establish the correspondence between two-dimensional image points and three-dimensional model points; The visual pose module is used to calculate the initial exterior orientation elements of the UAV in the absolute geographic coordinate system based on the correspondence relationship using the PnP algorithm, and optimize the initial exterior orientation elements using bundle adjustment to obtain the optimized UAV visual pose. The position and attitude module is used to fuse the optimized UAV visual pose with the observation data of the inertial measurement unit to obtain the final three-dimensional position and attitude of the UAV in the absolute geographic coordinate system; wherein, the observation data of the inertial measurement unit includes angular velocity and acceleration.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the method as described in any one of claims 1 to 7.

10. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method as described in any one of claims 1 to 7.