A 3D Gaussian reconstruction method for dynamic object removal of unmanned aerial vehicle field scene
By using a drone equipped with a high-resolution camera and a 3D Gaussian implicit representation model, combined with instance semantic labels and residual statistics across training cycles, the problem of dynamic object recognition in complex outdoor scenes was solved, achieving efficient 3D reconstruction results.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XINJIANG UNIVERSITY
- Filing Date
- 2026-03-11
- Publication Date
- 2026-06-12
AI Technical Summary
Existing 3D reconstruction methods struggle to effectively handle dynamic objects in complex outdoor scenes, especially those with poor recognition capabilities for irregularly moving objects. Furthermore, traditional methods have high hardware requirements and are difficult to adapt to dynamically changing environments.
By using a drone equipped with a high-resolution camera, a three-dimensional Gaussian implicit representation model is constructed through feature extraction and segmentation of multi-view image sequences. Combined with instance semantic labels and residual statistics across training cycles, dynamic object removal is performed. A depth estimation network is used for depth map completion and confidence weight calibration to optimize the processing strategy for dynamic regions.
It significantly improves the accuracy of distinguishing between dynamic and static objects, enhances the geometric consistency and reliability of 3D reconstruction results, reduces hardware dependence, and improves the integrity and accuracy of 3D models.
Smart Images

Figure CN122199802A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image analysis technology, specifically a 3D Gaussian reconstruction method for removing dynamic objects in unmanned aerial vehicle (UAV) field scenes. Background Technology
[0002] Most existing 3D reconstruction methods rely on static scene modeling, neglecting the impact of dynamic objects on the reconstruction results. In complex outdoor scenes, dynamic objects such as pedestrians, vehicles, and aircraft, together with static backgrounds such as buildings, trees, and rocks, constitute the diversity and complexity of the scene. Optical flow methods usually rely on pixel-level motion estimation. Although they can capture the motion trajectory of dynamic objects well, they are prone to misjudgment in outdoor scenes when they are affected by complex backgrounds and noise. Deep learning-based dynamic object recognition and segmentation methods perform well in static scenes, but when dealing with dynamic objects, they often cannot accurately determine the motion state of the objects, especially the recognition ability of irregularly moving objects in complex scenes is weak.
[0003] Most existing 3D reconstruction technologies use depth information based on structured light or lidar. Although they can provide relatively accurate 3D models, these methods usually have high hardware requirements and are limited in application in large-scale field environments. Traditional 3D reconstruction systems are difficult to adapt to dynamically changing environments. Therefore, there is a need for a 3D reconstruction technology that can adapt to the characteristics of complex and dynamic outdoor scenes, while reducing dependence on hardware and computing resources and effectively mitigating the impact of dynamic objects on the reconstruction results, so as to improve the integrity and accuracy of the 3D model. Summary of the Invention
[0004] The purpose of this invention is to provide a 3D Gaussian reconstruction method for removing dynamic objects in unmanned aerial vehicle (UAV) field scenes, in order to solve the problems raised in the prior art.
[0005] To address the aforementioned technical problems, this invention provides the following technical solution: a 3D Gaussian reconstruction method for removing dynamic objects in unmanned aerial vehicle (UAV) field scenes, the method comprising: Step S100: Collect a series of continuous images from multiple perspectives using a high-resolution camera mounted on a drone, extract features from the image sequence, generate a camera pose matrix, segment the images and identify the state of objects in the images, generate static and dynamic candidate sets, and construct a three-dimensional Gaussian implicit representation model. Step S200: Set the training period, filter instance IDs that are suspected to be static candidates, calculate the 3D Gaussian rendering image based on the 3D Gaussian implicit representation model, combine the real acquired image to generate the residual map, calculate the judgment threshold, and determine the preliminary dynamic region. Step S300: Divide the instance IDs based on the instance semantic tags, calculate the calibration coefficient of each semantic group, and determine the dynamic attributes of each instance ID; Step S400: Calculate the normalized depth value and dynamic adjustment coefficient of the dynamic instance region, and calculate the confidence weight based on the normalized depth value and dynamic adjustment coefficient. Based on the confidence weight, determine the optimization processing strategy for each depth feature in the dynamic instance region.
[0006] Furthermore, step S100 includes: Step S101: Install a stabilizing gimbal on the front end of the drone and deploy a high-resolution camera on the stabilizing gimbal. Preset the flight path of the drone through the dynamic removal platform and cruise and shoot according to the preset flight path. During the cruise and shooting, simultaneously collect multi-view RGB image sequences and corresponding pose data of the field scene. Timestamp align and combine the collected multi-view RGB image sequences and pose data to generate a shooting record, and upload the shooting record to the dynamic removal platform. Step S102: Acquire historical shooting records, collect multi-view RGB image sequences from historical shooting records, extract features from each image, and generate a camera pose matrix; Step S103: Segment each image, identify dynamic and static objects in the segmented images, and generate static and dynamic candidate sets; Step S104: Based on the camera pose matrix and static candidate set, process the multi-view image data collected by the UAV and the corresponding sparse 3D point cloud to construct a 3D Gaussian implicit representation model of the scene. By installing a stabilizing gimbal and deploying a high-resolution camera at the front of the drone, and combining it with a preset flight path for automatic cruise shooting, the impact of drone flight jitter and attitude changes on image quality is effectively reduced, ensuring the consistency of multi-view RGB image sequences in space and time, and providing high-quality basic data for subsequent 3D reconstruction and modeling. During cruise shooting, multi-view RGB image sequences and corresponding pose data are acquired simultaneously, and shooting records are generated by aligning with timestamps to avoid errors caused by the asynchrony between images and pose information, improve the accuracy of camera pose estimation, and thus enhance the geometric consistency and reliability of 3D reconstruction results. By acquiring historical shooting records and extracting features from historical multi-view RGB images, a camera pose matrix is generated, enabling the current shooting data to be correlated with historical data. This effectively alleviates the problem of insufficient data or limited viewpoints in a single flight, and improves the stability and robustness of camera pose solving.
[0007] Furthermore, in step S102, generating the camera pose matrix includes the following steps: Step S102-1: Use the multi-view RGB image sequence as input to the COLMAP platform, perform feature extraction operation on each frame of RGB image to obtain the two-dimensional pixel coordinates (u, v, 1) of feature points in the image under the camera pixel plane, and establish the feature correspondence between images from different viewpoints through feature matching. Step S102-2: Based on the multi-view feature matching relationship established in step S102-1, the incremental motion recovery structure algorithm of the COLMAP platform is used to jointly optimize the three-dimensional spatial points (X, Y, Z, 1) corresponding to the feature points and the camera pose parameters of each frame image according to the following formula: ; Where s is the scale factor, R and t are the rotation matrix and translation vector in the camera extrinsic matrix, respectively, and K is the camera intrinsic matrix. By minimizing the reprojection error of feature points, the camera pose parameters are optimized to obtain the accurate camera pose of each frame in the world coordinate system. By using multi-view RGB image sequences as input to the COLMAP platform, feature extraction is performed on each frame of the image to obtain the two-dimensional pixel coordinates of the feature points in the camera pixel plane. Combined with feature matching, feature correspondence between images from different viewpoints is established, thereby constructing stable and reliable geometric constraints between multi-view images, providing an accurate data foundation for subsequent 3D point reconstruction and pose estimation. Based on multi-view feature matching relationships, incremental motion recovery structure algorithm is used to gradually recover and optimize the three-dimensional spatial points corresponding to the feature points, so that the three-dimensional spatial points can be continuously updated under multi-view constraints, effectively reducing the impact of mismatch and noise on the reconstruction results, and improving the spatial consistency and geometric accuracy of sparse three-dimensional point clouds. In the pose estimation process, by introducing the camera intrinsic parameter matrix, rotation matrix and translation vector, and using the reprojection error of feature points as the optimization objective, the camera pose parameters and 3D spatial points are jointly optimized. This avoids the cumulative error caused by the separation of pose and structure estimation in traditional methods, thus obtaining more accurate and stable camera pose results.
[0008] By introducing a scale factor and applying uniform constraints during the optimization process, the spatial scale of images from multiple perspectives is kept consistent, effectively suppressing scale drift and pose accumulation errors that may occur during incremental reconstruction, and improving the global consistency of camera pose in the world coordinate system.
[0009] Furthermore, step S103 generates static and dynamic candidate sets, including the following steps: Step S103-1: Preset image segmentation parameters, segment the multi-view RGB image sequence according to the image segmentation parameters, and assign a unique instance ID to each segmented region; Step S103-2: Perform semantic recognition on the pixel region corresponding to each instance ID, obtain the semantic category label of the instance, call the preset semantic motion attribute mapping rule, perform attribute matching on the instance semantic label, and divide the semantic space into static attribute domain and potential dynamic attribute domain. Step S103-3: Traverse each instance ID in the full scene instance set. When the semantic tag of an instance hits the potential dynamic attribute domain, add the instance ID to the suspected dynamic instance candidate set. When the semantic tag of an instance hits the static attribute domain, add the instance ID to the suspected static instance candidate set. When the semantic tag of an instance is associated with both the static attribute domain and the potential dynamic attribute domain, determine the instance as a hybrid semantic instance and determine the corresponding motion attribute. By segmenting multi-view RGB image sequences using preset image segmentation parameters and assigning a unique instance ID to each segmented region, different targets can be clearly distinguished at the pixel level, avoiding the target aliasing problem caused by traditional whole-image or coarse-grained region segmentation, and providing a refined data foundation for subsequent semantic recognition and motion attribute judgment. Semantic recognition is performed on the pixel region corresponding to each instance ID, and the semantic category label of the instance is obtained. By calling the preset semantic motion attribute mapping rules, the semantic space is divided into static attribute domain and potential dynamic attribute domain, so that the dynamic attribute judgment of the target no longer depends only on short-term motion features, but also combines the prior semantic attributes of the target, effectively reducing the false judgment rate. By traversing the entire scene instance set, and based on the hit status of instance semantic tags in different attribute domains, instances are divided into a suspected dynamic instance candidate set, a suspected static instance candidate set, and a mixed semantic instance set. This enables separate labeling and further judgment of targets with uncertain motion attributes, improving the accuracy and flexibility of target classification in complex scenes.
[0010] Furthermore, the determination of the corresponding motion attribute in step 103-3 includes the following steps: when a single instance segmentation region corresponds to multiple different semantic sub-regions, the pixel area ratio of each semantic sub-region in the instance is counted, and the semantic sub-region with the largest pixel area ratio is selected as the dominant semantic label of the instance. When the pixel area ratio of the dominant semantic sub-region exceeds a preset threshold, the motion attribute of the instance is determined based on the dominant semantic label. When a single instance segmentation region contains multiple different semantic sub-regions, the pixel area ratio of each semantic sub-region in the instance is statistically analyzed, and the semantic sub-region with the largest ratio is selected as the dominant semantic label. This avoids the misjudgment problem caused by semantic judgment based on only local regions or a small number of pixels, and makes the determination of the overall semantic attributes of the instance more consistent with the real physical scene. In real-world scenarios, there are often occlusions, overlaps, or adhesions between targets, such as pedestrians and bicycles, vehicles and background structures. By using pixel proportion statistics, the main semantic components can be automatically highlighted, effectively suppressing the influence of local interference semantics on the overall instance attribute determination and improving the adaptability to complex target shapes. By setting a preset pixel area ratio threshold, the motion attribute of an instance is determined only when the proportion of the dominant semantic sub-region exceeds the threshold. This avoids making incorrect decisions when the semantic distribution is highly discrete or the dominance is not obvious, thereby improving the stability and reliability of motion attribute judgment.
[0011] Furthermore, in step S104, constructing a 3D Gaussian implicit representation model of the scene includes the following steps: Step S104-1: Perform preprocessing operations on the multi-view RGB image sequence. The preprocessing operations include noise reduction, contrast enhancement, and sharpness enhancement. Based on the preprocessed multi-view RGB image sequence, combined with camera pose parameters, sparse 3D point cloud data of the scene is generated using SfM. Step S104-2: Using the sparse 3D point cloud generated in step S104-1 as initialization input, map each 3D point to a 3D Gaussian primitive, wherein the 3D Gaussian primitive is composed of spatial mean Covariance matrix The opacity β and the spherical harmonic function color parameter c are jointly defined, and their probability density function is expressed as: ; Step S104-3: After the three-dimensional Gaussian model is initialized, the three-dimensional Gaussian primitives are projected onto the two-dimensional imaging plane using differentiable rasterization technology. For multiple Gaussian primitives located in the camera's view frustum, they are sorted according to depth information and pixel-level rendering is completed through Alpha blending to establish the mapping relationship between the three-dimensional Gaussian space and the two-dimensional image pixels. By performing preprocessing operations such as denoising, contrast enhancement, and sharpness enhancement on multi-view RGB image sequences, the signal-to-noise ratio and detail performance of the original images are effectively improved. This improves the feature matching accuracy and spatial distribution rationality of the SfM sparse 3D point cloud generated based on the preprocessed images and combined with camera pose parameters, providing a stable and reliable geometric foundation for the subsequent initialization of the 3D Gaussian model. Each 3D point in the sparse 3D point cloud generated by SfM is mapped to a 3D Gaussian element, and it is jointly modeled by spatial mean, covariance matrix, opacity parameter and spherical harmonic function color parameter, so that the scene is transformed from a discrete point set representation to a continuous probability density distribution representation, effectively alleviating the hole and discontinuity problems caused by the sparsity of point cloud. By introducing a covariance matrix to describe the spatial scale and anisotropic characteristics of Gaussian elements, and combining opacity and spherical harmonic function color parameters to model appearance information, 3D Gaussian elements can simultaneously depict the geometric structure of a scene and view-dependent color changes, thereby improving the precision and realism of 3D scene representation.
[0012] Furthermore, step S200 includes: Step S201: Select several consecutive days as the training period, obtain the shooting records within the training period, filter the set of instance IDs that are suspected to be static candidates, calculate the 3D Gaussian rendering image through the three-dimensional Gaussian implicit representation model, construct the original residual map by calculating the pixel-level difference between the 3D Gaussian rendering image and the real captured image, and generate the residual map by adopting a linear weighted fusion strategy and combining the L1 norm and the structural similarity index. Step S202: Set the minimum and maximum thresholds to preset parameters, and calculate the judgment threshold according to the following formula: ; Where Ap represents the threshold for determining the progress of the p-th training iteration, a1 represents the minimum threshold, a2 represents the maximum threshold, and a represents the maximum number of iterations; Step S203: When the residual value of the calculated residual map exceeds the current judgment threshold, it is marked as a preliminary dynamic region, and a binary mask containing noise is generated; By selecting several consecutive days as the training period and filtering the set of instance IDs suspected to be static candidates based on the shooting records within the training period, the constructed 3D Gaussian implicit representation model can fully learn the long-term stable structure in the scene, effectively reducing the interference of occasional dynamic targets or short-term noise on model training and subsequent dynamic judgment. A 3D Gaussian rendered image is generated from the corresponding viewpoint by a 3D Gaussian implicit representation model and compared with the real acquired image at the pixel level to construct the original residual map. The residual information can directly reflect the deviation between the real observation and the static 3D scene model, revealing the potential dynamic area from the perspective of 3D consistency. It has a higher discrimination ability than the method based on 2D image difference only. A linear weighted fusion strategy is adopted, which introduces the L1 norm and structural similarity index into the residual map generation process. This allows the residual map to depict both the absolute differences in pixel intensity and the changes in local structure and texture consistency, effectively suppressing the influence of illumination changes and imaging noise on residual calculation and improving the stability of dynamic region response. By setting minimum and maximum thresholds and dynamically calculating the judgment threshold according to the training iteration progress, the dynamic region judgment criteria can be adaptively adjusted as the model training gradually converges, avoiding a large number of false detections caused by model instability in the early stage of training, while improving the detection sensitivity to subtle dynamic changes in the later stage of training.
[0013] Furthermore, step S300 includes: Step S301: Based on the instance semantic tags, divide the instance IDs into a set of suspected static instances and a set of suspected dynamic instances, and calculate the mean of the residual values within each semantic set as the group residual benchmark for the corresponding semantic group. Step S302: Calculate the absolute difference between the residual value of each instance ID and the group residual benchmark of its semantic group. When the absolute difference is greater than a preset absolute threshold, the instance ID is determined to be an abnormal instance ID. Calculate the calibration coefficient according to the following formula: ; in, Represented as calibration coefficient, T represents the ratio of the population residual baseline to the residual value of the outlier instance ID, where t represents the current iteration number. max This represents the maximum number of iterations set. This is represented as the ratio of the residual values of the abnormal instance IDs. Represented as the group residual baseline, At is the judgment threshold at the current time. When the abnormal instance ID is a suspected static instance, the sign is negative; when the abnormal instance ID is a suspected dynamic instance, the sign is positive. Step S303: Based on the calibration coefficients generated in step S302, the residuals of the abnormal instance IDs are weighted and corrected. The calibrated instance residuals are compared with the judgment threshold. If the instance residuals are greater than the judgment threshold, they are judged as dynamic instances. If the instance residuals are less than the judgment threshold, they are judged as static instances. By dividing instance IDs into a set of suspected static instances and a set of suspected dynamic instances based on instance semantic tags, and calculating the mean of the residual values within each semantic set as the group residual benchmark, the residual determination no longer relies on a single global threshold, but instead combines the inherent motion characteristics of different semantic categories in the scene for comparative analysis, thereby improving the semantic consistency and rationality of dynamic determination. By calculating the absolute difference between the residual value of each instance ID and the residual benchmark of its semantic group, and combining it with a preset absolute threshold to filter out abnormal instance IDs, the dynamic behavior of individual instances can be evaluated in the context of the same semantic group, effectively avoiding misjudgment problems caused by overall noise fluctuations or local observation errors. For the identified abnormal instance IDs, a calibration coefficient related to the number of training iterations is introduced, and it is jointly calculated by combining the population residual benchmark, the abnormal instance residual value and the current judgment threshold. This allows the calibration amplitude to be dynamically adjusted as the model training gradually converges, avoiding excessive amplification of abnormal residuals in the early training stage, while enhancing the ability to distinguish real dynamic instances in the later stage. During the calibration coefficient calculation process, different sign directions are selected according to the suspected static or suspected dynamic semantic set to which the abnormal instance ID belongs, and the residuals are differentiated and corrected. This suppresses the abnormal residuals in static semantic instances and strengthens the abnormal residuals in dynamic semantic instances, thereby effectively reducing the risk of false detection and false negative detection caused by semantic confusion.
[0014] Furthermore, step S400 includes: Step S401: Use a depth estimation network to complete the depth map corresponding to the dynamic instance region, normalize the repaired depth map, calculate the noise variance of each local region in the depth map, summarize the noise of the entire map to obtain the mean noise of the entire map, and calculate the dynamic adjustment coefficient according to the following formula: ; in, This is expressed as a dynamic adjustment coefficient. This is represented as the preset base adjustment coefficient. Represented as noise variance, This is expressed as the mean noise level across the entire image. Step S402: Based on the normalized depth value and the corresponding local noise variance information, calculate the confidence weight according to the following formula: ; in, Represented as confidence weight, d n The depth value is represented as a normalized depth value. A pre-set confidence weight threshold is used. When the confidence weight is greater than the confidence weight threshold, the original depth feature details are retained. When the confidence weight is less than the confidence weight threshold, the global mean depth feature is used for smooth replacement. The optimized depth features are then stitched together with the dynamic instance to construct a geometry-region joint feature. This joint feature is then used as a conditional input to the diffusion generation model to generate an RGB image consistent with the texture of the surrounding environment. The generated repaired 2D image sequence is used as a supervision signal to iteratively optimize the 3D Gaussian reconstruction model and construct a static 3D scene model. A depth estimation network is used to complete the depth map corresponding to the dynamic instance region, effectively repairing the depth hole problem caused by dynamic target occlusion or observation missing. On this basis, the depth map is normalized and the noise variance of each local region is statistically analyzed to obtain the mean depth noise from the global level, so that subsequent processing can fully perceive the depth estimation quality and improve the integrity and consistency of geometric information. By incorporating the local noise variance and the mean noise of the entire image into the calculation of the dynamic adjustment coefficient, the adjustment range of the depth feature can adaptively change with the noise level, thereby effectively suppressing unreliable depth information in high-noise regions and fully preserving effective geometric structure in low-noise regions, significantly improving the robustness of depth features in complex dynamic scenes. By combining normalized depth values with local noise variance information to calculate confidence weights, and then performing differential processing on depth features based on preset confidence weight thresholds, the original depth details of the reliable regions are preserved, while the unreliable regions are smoothly replaced by global mean depth features, effectively avoiding structural distortion caused by noise depth in subsequent reconstruction. The optimized depth features are concatenated with dynamic instance region information to construct geometric-region joint features, which are then used as conditional inputs to the diffusion generation model. This allows the generation process to be subject to the dual constraints of geometric structure and region semantics, thereby generating an RGB inpainted image that is highly consistent with the surrounding environment in terms of texture, structure, and spatial continuity.
[0015] Compared with the prior art, the beneficial effects of the present invention are: combining instance-level semantic segmentation, semantic motion attribute mapping and 3D Gaussian implicit representation, it not only relies on the appearance changes of a single frame or short time sequence to judge the dynamics, but also introduces a residual statistics and semantic grouping calibration mechanism across training cycles, which effectively reduces the interference of "pseudo-dynamic" factors such as lighting changes, shadows, and wind blowing vegetation on the judgment results, and significantly improves the accuracy of distinguishing dynamic objects from static objects. To address the challenges of complex backgrounds, large changes in perspective, and diverse types of dynamic targets commonly encountered in UAV field scenarios, this paper proposes a method that incorporates multi-view residual fusion, adaptive adjustment of the decision threshold as training progress, and the introduction of semantic grouping calibration coefficients. This method enhances the robustness of dynamic detection results to changes in flight trajectory, shooting scale, and noise. By using instance-level residual calibration and depth confidence weight constraints, the dynamic instance region is refined. While removing real dynamic objects, the credible static geometric information is preserved to the maximum extent, thereby significantly improving the integrity and continuity of the final 3D scene model. In the processing of dynamic instance regions, normalized depth, local noise variance and dynamic adjustment coefficient are introduced to jointly calculate confidence weight, and depth optimization strategy is adaptively selected based on this weight; at the same time, a diffusion generation model is used for texture completion, so that the generated RGB image is highly consistent with the surrounding environment in terms of color, texture and structure, avoiding blurring and artifact problems caused by traditional interpolation or simple repair. Attached Figure Description
[0016] Figure 1 This is a flowchart illustrating the 3D Gaussian reconstruction method for removing dynamic objects in a drone field scene according to the present invention. Detailed Implementation
[0017] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0018] Please see Figure 1 This invention provides a technical solution: a 3D Gaussian reconstruction method for removing dynamic objects in a drone-based field scene, the method comprising: Step S100: Collect a series of continuous images from multiple perspectives using a high-resolution camera mounted on a drone, extract features from the image sequence, generate a camera pose matrix, segment the images and identify the state of objects in the images, generate static and dynamic candidate sets, and construct a three-dimensional Gaussian implicit representation model. Step S100 includes: Step S101: Install a stabilizing gimbal on the front end of the drone and deploy a high-resolution camera on the stabilizing gimbal. Preset the flight path of the drone through the dynamic removal platform and cruise and shoot according to the preset flight path. During the cruise and shooting, simultaneously collect multi-view RGB image sequences and corresponding pose data of the field scene. Timestamp align and combine the collected multi-view RGB image sequences and pose data to generate a shooting record, and upload the shooting record to the dynamic removal platform. Step S102: Acquire historical shooting records, collect multi-view RGB image sequences from historical shooting records, extract features from each image, and generate a camera pose matrix; The step S102, which generates the camera pose matrix, includes the following steps: Step S102-1: Use the multi-view RGB image sequence as input to the COLMAP platform, perform feature extraction operation on each frame of RGB image to obtain the two-dimensional pixel coordinates (u, v, 1) of feature points in the image under the camera pixel plane, and establish the feature correspondence between images from different viewpoints through feature matching. Step S102-2: Based on the multi-view feature matching relationship established in step S102-1, the incremental motion recovery structure algorithm of the COLMAP platform is used to jointly optimize the three-dimensional spatial points (X, Y, Z, 1) corresponding to the feature points and the camera pose parameters of each frame image according to the following formula: ; Where s is the scale factor, R and t are the rotation matrix and translation vector in the camera extrinsic matrix, respectively, and K is the camera intrinsic matrix. By minimizing the reprojection error of feature points, the camera pose parameters are optimized to obtain the accurate camera pose of each frame in the world coordinate system. Step S103: Segment each image, identify dynamic and static objects in the segmented images, and generate static and dynamic candidate sets; Step S103, which generates static and dynamic candidate sets, includes the following steps: Step S103-1: Preset image segmentation parameters, segment the multi-view RGB image sequence according to the image segmentation parameters, and assign a unique instance ID to each segmented region; Step S103-2: Perform semantic recognition on the pixel region corresponding to each instance ID, obtain the semantic category label of the instance, call the preset semantic motion attribute mapping rule, perform attribute matching on the instance semantic label, and divide the semantic space into static attribute domain and potential dynamic attribute domain. Step S103-3: Traverse each instance ID in the full scene instance set. When the semantic tag of an instance hits the potential dynamic attribute domain, add the instance ID to the suspected dynamic instance candidate set. When the semantic tag of an instance hits the static attribute domain, add the instance ID to the suspected static instance candidate set. When the semantic tag of an instance is associated with both the static attribute domain and the potential dynamic attribute domain, determine the instance as a hybrid semantic instance and determine the corresponding motion attribute. The determination of the corresponding motion attribute in step S103-3 includes the following steps: when a single instance segmentation region corresponds to multiple different semantic sub-regions, the pixel area ratio of each semantic sub-region in the instance is counted, and the semantic sub-region with the largest pixel area ratio is selected as the dominant semantic label of the instance. When the pixel area ratio of the dominant semantic sub-region exceeds a preset threshold, the motion attribute of the instance is determined based on the dominant semantic label. For example, a 4000×3000 pixel RGB image sequence is acquired, and the SAM2 network is combined with an instance-level semantic segmentation network. The input resolution is 4000×3000, the minimum instance area threshold is 500 pixels, and the instance confidence threshold is 0.6. Instance segmentation of a single frame image yields the following instance: The area of the pixel region in instance ID_01 is 1,245,320, which accounts for 10.38% of the image. The area of the pixel region in instance ID_02 is 356,410, which accounts for 2.97% of the image. The area of the pixel region in instance ID_03 is 18,450, which accounts for 0.15% of the image. The area of the pixel region in instance ID_04 is 22,380, which accounts for 0.19% of the image. The area of the pixel region in instance ID_05 is 7,920, which accounts for 0.07% of the image. The area of the pixel region in instance ID_06 is 5,640, which accounts for 0.05% of the image. Perform semantic recognition on the above instance regions to obtain semantic category labels: Instance ID_01 has a semantic category label of "ground" and a recognition confidence score of 0.94. Instance ID_02 has a semantic category label of tree and a recognition confidence score of 0.91. Instance ID_03 has a semantic category label of "person" and a recognition confidence score of 0.88. Instance ID_04 has a semantic category label of sheep (animal) and a recognition confidence score of 0.86; Instance ID_05 has a semantic category label of "rock" and a recognition confidence score of 0.83. Instance ID_06 has a semantic category label of "drone shadow" and a recognition confidence score of 0.79. Predefined semantics: Static attribute domains include ground, building, tree, rock, and road; Potential dynamic attribute domains include person, animal, vehicle, and bird. Mixed semantic categories include shadow, treecrown, and watersurface. Instance ID_01 hits the static attribute field; Instance ID_02 hits a static attribute field; Instance ID_03 hits a potential dynamic attribute field; Instance ID_04 hits a potential dynamic attribute domain; Instance ID_05 hits a static attribute field; Instance ID_06 hits the attribute domain of static + potential dynamic; Basic information about the segmented region of instance ID_06: The total pixel area of the instance is 5,640; Further fine-grained semantic segmentation of the pixel region of instance ID_06 yields multiple semantic sub-regions: The sub-region number R_06_1 has the semantic label "shadow" and a pixel area of 3,980. The sub-region number R_06_2 has the semantic label "ground" and a pixel area of 1,660. Calculate the pixel area percentage of each semantic sub-region in instance ID_06: The shadow pixel ratio of sub-region R_06_1 is 70.6%; The ground pixel ratio of sub-region R_06_2 is 29.4%; The default threshold for the proportion of dominant semantic pixels is 60%, and instance ID_06 is a static subordinate dynamic.
[0019] Step S104: Based on the camera pose matrix and static candidate set, process the multi-view image data collected by the UAV and the corresponding sparse 3D point cloud to construct a 3D Gaussian implicit representation model of the scene. The step S104, which involves constructing a 3D Gaussian implicit representation model of the scene, includes the following steps: Step S104-1: Perform preprocessing operations on the multi-view RGB image sequence. The preprocessing operations include noise reduction, contrast enhancement, and sharpness enhancement. Based on the preprocessed multi-view RGB image sequence, combined with camera pose parameters, sparse 3D point cloud data of the scene is generated using SfM. Step S104-2: Using the sparse 3D point cloud generated in step S104-1 as initialization input, map each 3D point to a 3D Gaussian primitive, wherein the 3D Gaussian primitive is composed of spatial mean Covariance matrix The opacity β and the spherical harmonic function color parameter c are jointly defined, and their probability density function is expressed as: ; Step S104-3: After the three-dimensional Gaussian model is initialized, the three-dimensional Gaussian primitives are projected onto the two-dimensional imaging plane using differentiable rasterization technology. For multiple Gaussian primitives located in the camera's view frustum, they are sorted according to depth information, and pixel-level rendering is completed through Alpha blending to establish the mapping relationship between the three-dimensional Gaussian space and the two-dimensional image pixels.
[0020] Step S200: Set the training period, filter instance IDs that are suspected to be static candidates, calculate the 3D Gaussian rendering image based on the 3D Gaussian implicit representation model, combine the real acquired image to generate the residual map, calculate the judgment threshold, and determine the preliminary dynamic region. Step S200 includes: Step S201: Select several consecutive days as the training period, obtain the shooting records within the training period, filter the set of instance IDs that are suspected to be static candidates, calculate the 3D Gaussian rendering image through the three-dimensional Gaussian implicit representation model, construct the original residual map by calculating the pixel-level difference between the 3D Gaussian rendering image and the real captured image, and generate the residual map by adopting a linear weighted fusion strategy and combining the L1 norm and the structural similarity index. Step S202: Set the minimum and maximum thresholds to preset parameters, and calculate the judgment threshold according to the following formula: ; Where Ap represents the threshold for determining the progress of the p-th training iteration, a1 represents the minimum threshold, a2 represents the maximum threshold, and a represents the maximum number of iterations; Step S203: When the residual value of the calculated residual map exceeds the current judgment threshold, it is marked as a preliminary dynamic region, and a binary mask containing noise is generated; For example, the training period length is 5 consecutive days, the daily collection time is 09:00–10:00, the number of collection views per day is 30, and the total number of training frames is 150. Instance ID and instance type have been used for preliminary stability assessment within the past 5 days. Instance ID_01, instance type: building, showed a stability of 100% within 5 days, initially judged to be static; Instance ID_02, instance type street light, showed 100% stability within 5 days, initially judged to be static; Instance ID_03, instance type: construction machinery, stability rate of 60% within 5 days, initially judged to be suspected static; Instance ID_04, instance type: parked vehicle, stability rate of 40% has occurred within 5 days, initially judged to be suspected static; Instance ID_05, instance type pedestrian, has a stability rate of 10% within 5 days, initially judged to be dynamic; Instance ID_06, instance type enclosure, has shown a stability of 80% within 5 days, initially judged to be suspected static; The final set of suspected static candidate instance IDs obtained after filtering is: {ID_03, ID_04, ID_06}; Example pixel (x=1024, y=560): The actual pixel value is 148, the rendered pixel value is 132, and the L1 difference is 16. SSIM is 0.82, and the SSIM difference is 0.18; The example pixel residual calculation result is 9.672.
[0021] Step S300: Divide the instance IDs based on the instance semantic tags, calculate the calibration coefficient of each semantic group, and determine the dynamic attributes of each instance ID; Step S300 includes: Step S301: Based on the instance semantic tags, divide the instance IDs into a set of suspected static instances and a set of suspected dynamic instances, and calculate the mean of the residual values within each semantic set as the group residual benchmark for the corresponding semantic group. Step S302: Calculate the absolute difference between the residual value of each instance ID and the group residual benchmark of its semantic group. When the absolute difference is greater than a preset absolute threshold, the instance ID is determined to be an abnormal instance ID. Calculate the calibration coefficient according to the following formula: ; in, Represented as calibration coefficient, T represents the ratio of the population residual baseline to the residual value of the outlier instance ID, where t represents the current iteration number. max This represents the maximum number of iterations set. This is represented as the ratio of the residual values of the abnormal instance IDs. Represented as the group residual baseline, At is the judgment threshold at the current time. When the abnormal instance ID is a suspected static instance, the sign is negative; when the abnormal instance ID is a suspected dynamic instance, the sign is positive. Step S303: Based on the calibration coefficients generated in step S302, the residuals of the abnormal instance IDs are weighted and corrected. The calibrated instance residuals are compared with the judgment threshold. If the instance residuals are greater than the judgment threshold, the instance is judged as a dynamic instance. If the instance residuals are less than the judgment threshold, the instance is judged as a static instance.
[0022] Step S400: Calculate the normalized depth value and dynamic adjustment coefficient of the dynamic instance region, and calculate the confidence weight based on the normalized depth value and dynamic adjustment coefficient. Based on the confidence weight, determine the optimization processing strategy for each depth feature in the dynamic instance region. Step S400 includes: Step S401: Use a depth estimation network to complete the depth map corresponding to the dynamic instance region, normalize the repaired depth map, calculate the noise variance of each local region in the depth map, summarize the noise of the entire map to obtain the mean noise of the entire map, and calculate the dynamic adjustment coefficient according to the following formula: ; in, This is expressed as a dynamic adjustment coefficient. This is represented as the preset base adjustment coefficient. Represented as noise variance, This is expressed as the mean noise level across the entire image. Step S402: Based on the normalized depth value and the corresponding local noise variance information, calculate the confidence weight according to the following formula: ; in, Represented as confidence weight, d n The depth value is represented as a normalized depth value. A pre-set confidence weight threshold is used. When the confidence weight is greater than the confidence weight threshold, the original depth feature details are retained. When the confidence weight is less than the confidence weight threshold, the global mean depth feature is used for smooth replacement. The optimized depth features are then stitched together with the dynamic instance to construct a geometry-region joint feature. This joint feature is then used as a conditional input to the diffusion generation model to generate an RGB image consistent with the texture of the surrounding environment. The generated repaired 2D image sequence is used as a supervision signal to iteratively optimize the 3D Gaussian reconstruction model and construct a static 3D scene model. In this embodiment, a depth estimation network is used to perform a preliminary estimation of the depth map of the dynamic instance region, and a diffusion generation model is used to complete the missing or incomplete depth information. Assuming a preset base adjustment coefficient of 0.8, a noise variance of 1.12, and a mean noise level of 0.54 for the entire image, the calculated dynamic adjustment coefficient is 0.8. (1.12 / 0.54) = 1.66; For example, if the noise variance of pixel R1 is 0.18, the normalized depth value is 0.42, the calculated confidence weight is 0.4, and the preset confidence weight threshold is 0.25, then the original depth details are preserved.
[0023] It will be apparent to those skilled in the art that the present invention is not limited to the details of the exemplary embodiments described above, and that the invention can be implemented in other specific forms without departing from its spirit or essential characteristics. Therefore, the embodiments should be considered in all respects as exemplary and non-limiting, and the scope of the invention is defined by the appended claims rather than the foregoing description. Thus, all variations falling within the meaning and scope of equivalents of the claims are intended to be included within the present invention. No reference numerals in the claims should be construed as limiting the scope of the claims.
Claims
1. A 3D Gaussian reconstruction method for removing dynamic objects in a drone-based field scene, characterized in that, The methods include: Step S100: Collect a series of continuous images from multiple perspectives using a high-resolution camera mounted on a drone, extract features from the image sequence, generate a camera pose matrix, segment the images and identify the state of objects in the images, generate static and dynamic candidate sets, and construct a three-dimensional Gaussian implicit representation model. Step S200: Set the training period, filter instance IDs that are suspected to be static candidates, calculate the 3D Gaussian rendering image based on the 3D Gaussian implicit representation model, combine the real acquired image to generate the residual map, calculate the judgment threshold, and determine the preliminary dynamic region. Step S300: Divide the instance IDs based on the instance semantic tags, calculate the calibration coefficient of each semantic group, and determine the dynamic attributes of each instance ID; Step S400: Calculate the normalized depth value and dynamic adjustment coefficient of the dynamic instance region, and calculate the confidence weight based on the normalized depth value and dynamic adjustment coefficient. Based on the confidence weight, determine the optimization processing strategy for each depth feature in the dynamic instance region.
2. The 3D Gaussian reconstruction method for removing dynamic objects in a drone-based field scene according to claim 1, characterized in that, Step S100 includes the following steps: Step S101: Install a stabilizing gimbal on the front end of the drone and deploy a high-resolution camera on the stabilizing gimbal. Preset the flight path of the drone through the dynamic removal platform and cruise and shoot according to the preset flight path. During the cruise and shooting, simultaneously collect multi-view RGB image sequences and corresponding pose data of the field scene. Timestamp align and combine the collected multi-view RGB image sequences and pose data to generate a shooting record, and upload the shooting record to the dynamic removal platform. Step S102: Acquire historical shooting records, collect multi-view RGB image sequences from historical shooting records, extract features from each image, and generate a camera pose matrix; Step S103: Segment each image, identify dynamic and static objects in the segmented images, and generate static and dynamic candidate sets; Step S104: Based on the camera pose matrix and static candidate set, process the multi-view image data collected by the UAV and the corresponding sparse 3D point cloud to construct a 3D Gaussian implicit representation model of the scene.
3. The 3D Gaussian reconstruction method for removing dynamic objects in a drone field scene according to claim 2, characterized in that, The step S102 of generating the camera pose matrix includes the following steps: Step S102-1: Use the multi-view RGB image sequence as input to the COLMAP platform, perform feature extraction operation on each frame of RGB image to obtain the two-dimensional pixel coordinates (u, v, 1) of feature points in the image under the camera pixel plane, and establish the feature correspondence between images from different viewpoints through feature matching. Step S102-2: Based on the multi-view feature matching relationship established in step S102-1, the incremental motion recovery structure algorithm of the COLMAP platform is used to jointly optimize the three-dimensional spatial points (X, Y, Z, 1) corresponding to the feature points and the camera pose parameters of each frame image according to the following formula: ; Where s is the scale factor, R and t are the rotation matrix and translation vector in the camera extrinsic matrix, respectively, and K is the camera intrinsic matrix. By minimizing the reprojection error of feature points, the camera pose parameters are optimized to obtain the accurate camera pose of each frame in the world coordinate system.
4. The 3D Gaussian reconstruction method for removing dynamic objects in a drone field scene according to claim 2, characterized in that, The generation of static and dynamic candidate sets in step S103 includes the following steps: Step S103-1: Preset image segmentation parameters, segment the multi-view RGB image sequence according to the image segmentation parameters, and assign a unique instance ID to each segmented region; Step S103-2: Perform semantic recognition on the pixel region corresponding to each instance ID, obtain the semantic category label of the instance, call the preset semantic motion attribute mapping rule, perform attribute matching on the instance semantic label, and divide the semantic space into static attribute domain and potential dynamic attribute domain. Step S103-3: Traverse each instance ID in the full scene instance set. When the semantic tag of an instance hits the potential dynamic attribute domain, add the instance ID to the suspected dynamic instance candidate set. When the semantic tag of an instance hits the static attribute domain, add the instance ID to the suspected static instance candidate set. When the semantic tag of an instance is associated with both the static attribute domain and the potential dynamic attribute domain, determine the instance as a hybrid semantic instance and determine the corresponding motion attribute.
5. The 3D Gaussian reconstruction method for removing dynamic objects in a drone field scene according to claim 4, characterized in that, The determination of the corresponding motion attribute in step S103-3 includes the following steps: when a single instance segmentation region corresponds to multiple different semantic sub-regions, the pixel area ratio of each semantic sub-region in the instance is counted, and the semantic sub-region with the largest pixel area ratio is selected as the dominant semantic label of the instance. When the pixel area ratio of the dominant semantic sub-region exceeds a preset threshold, the motion attribute of the instance is determined based on the dominant semantic label.
6. The 3D Gaussian reconstruction method for removing dynamic objects in a drone-based field scene according to claim 2, characterized in that, The step S104 of constructing the three-dimensional Gaussian implicit representation model of the scene includes the following steps: Step S104-1: Perform preprocessing operations on the multi-view RGB image sequence. The preprocessing operations include noise reduction, contrast enhancement, and sharpness enhancement. Based on the preprocessed multi-view RGB image sequence, combined with camera pose parameters, sparse 3D point cloud data of the scene is generated using SfM. Step S104-2: Using the sparse 3D point cloud generated in step S104-1 as initialization input, map each 3D point to a 3D Gaussian element. The 3D Gaussian element is composed of spatial mean. Covariance matrix The opacity β and the spherical harmonic function color parameter c are jointly defined, and their probability density function is expressed as: ; Step S104-3: After the three-dimensional Gaussian model is initialized, the three-dimensional Gaussian primitives are projected onto the two-dimensional imaging plane using differentiable rasterization technology. For multiple Gaussian primitives located in the camera's view frustum, they are sorted according to depth information, and pixel-level rendering is completed through Alpha blending to establish the mapping relationship between the three-dimensional Gaussian space and the two-dimensional image pixels.
7. The 3D Gaussian reconstruction method for removing dynamic objects in a drone-based field scene according to claim 1, characterized in that, Step S200 includes the following steps: Step S201: Select several consecutive days as the training period, obtain the shooting records within the training period, filter the set of instance IDs that are suspected to be static candidates, calculate the 3D Gaussian rendering image through the three-dimensional Gaussian implicit representation model, construct the original residual map by calculating the pixel-level difference between the 3D Gaussian rendering image and the real captured image, and generate the residual map by adopting a linear weighted fusion strategy and combining the L1 norm and the structural similarity index. Step S202: Set the minimum and maximum thresholds to preset parameters, and calculate the judgment threshold according to the following formula: ; Where Ap represents the threshold for determining the progress of the p-th training iteration, a1 represents the minimum threshold, a2 represents the maximum threshold, and a represents the maximum number of iterations; Step S203: When the residual value of the calculated residual map exceeds the current judgment threshold, it is marked as a preliminary dynamic region, and a binary mask containing noise is generated.
8. The 3D Gaussian reconstruction method for removing dynamic objects in a drone field scene according to claim 7, characterized in that, Step S300 includes the following steps: Step S301: Based on the instance semantic tags, divide the instance IDs into a set of suspected static instances and a set of suspected dynamic instances, and calculate the mean of the residual values within each semantic set as the group residual benchmark for the corresponding semantic group. Step S302: Calculate the absolute difference between the residual value of each instance ID and the group residual benchmark of its semantic group. When the absolute difference is greater than a preset absolute threshold, the instance ID is determined to be an abnormal instance ID. Calculate the calibration coefficient according to the following formula: ; in, Represented as calibration coefficient, T represents the ratio of the population residual baseline to the residual value of the outlier instance ID, where t represents the current iteration number. max This represents the maximum number of iterations set. This is represented as the ratio of the residual values of the abnormal instance IDs. Represented as the group residual baseline, At is the judgment threshold at the current time. When the abnormal instance ID is a suspected static instance, the sign is negative; when the abnormal instance ID is a suspected dynamic instance, the sign is positive. Step S303: Based on the calibration coefficients generated in step S302, the residuals of the abnormal instance IDs are weighted and corrected. The calibrated instance residuals are compared with the judgment threshold. If the instance residuals are greater than the judgment threshold, the instance is judged as a dynamic instance. If the instance residuals are less than the judgment threshold, the instance is judged as a static instance.
9. The 3D Gaussian reconstruction method for removing dynamic objects in a drone field scene according to claim 8, characterized in that, Step S400 includes the following steps: Step S401: Use a depth estimation network to complete the depth map corresponding to the dynamic instance region, normalize the repaired depth map, calculate the noise variance of each local region in the depth map, summarize the noise of the entire map to obtain the mean noise of the entire map, and calculate the dynamic adjustment coefficient according to the following formula: ; in, This is expressed as a dynamic adjustment coefficient. This is represented as the preset base adjustment coefficient. Represented as noise variance, This is expressed as the mean noise level across the entire image. Step S402: Based on the normalized depth value and the corresponding local noise variance information, calculate the confidence weight according to the following formula: ; in, Represented as confidence weight, d n The depth value is represented as a normalized depth value. A pre-set confidence weight threshold is used. When the confidence weight is greater than the confidence weight threshold, the original depth feature details are retained. When the confidence weight is less than the confidence weight threshold, the global mean depth feature is used for smooth replacement. The optimized depth features are then stitched together with the dynamic instance to construct a geometry-region joint feature. This joint feature is then used as a conditional input to the diffusion generation model to generate an RGB image consistent with the texture of the surrounding environment. The generated repaired 2D image sequence is used as a supervision signal to iteratively optimize the 3D Gaussian reconstruction model and construct a static 3D scene model.