A method for detecting targets in complex scenes by using a D-FINE model based on a UAV aerial photography
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CIVIL AVIATION UNIV OF CHINA
- Filing Date
- 2026-04-10
- Publication Date
- 2026-06-26
AI Technical Summary
Existing drone aerial image target detection technologies suffer from low detection accuracy, poor efficiency, and insufficient generalization ability in complex scenarios. They struggle to meet the needs of small target detection, the processing of edge detail information of multi-scale targets, and the adaptability of algorithms in complex scenarios.
An improved D-FINE model is adopted, which improves feature extraction and object detection capabilities by using the LoG-Stem edge enhancement module, complementary feature downsampling module and context-aware module, combined with the matching-aware loss function, thereby achieving a balance between lightweight model deployment and high detection accuracy.
It improves the detection accuracy and efficiency of UAVs in complex scenarios, enhances the model's scenario adaptability and real-time performance, and is superior to existing methods.
Smart Images

Figure CN122289986A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of image processing and computer vision technology, specifically relating to a method for target detection in complex scenes by drone aerial photography based on the D-FINE model. Background Technology
[0002] In recent years, unmanned aerial vehicles (UAVs) have played an important role in many fields due to their excellent flexibility, maneuverability, and accurate target perception capabilities. Combining deep learning-based target measurement methods with UAV systems has become one of the current research and application focuses, especially in fields such as aerial tracking, risk monitoring, and military operations, where it has significant academic and practical value.
[0003] In recent years, the development of drone aerial image target detection technology has evolved from traditional target detection techniques to deep learning-based target detection techniques. Traditional detection algorithms, such as Scale-Invariant Feature Transform (SIFT), VJ Detector (VJDet), Histogram of Oriented Gradient (HOG), and Deformable Parts Model (DPM), often require manual feature extraction. These methods are not only time-consuming and labor-intensive, but also sensitive to changes in image scale, exhibiting low accuracy, poor detection efficiency, and insufficient generalization and robustness when dealing with small targets captured from an aerial perspective. With the continuous advancement of deep learning technology, target detection algorithms based on Convolutional Neural Networks (CNNs), such as Faster R-CNN, RetinaNet, YOLO series, DETR and RT-DETR, DFINE, and DEIM, can automatically learn multi-level feature representations, achieving effective recognition of target objects at different scales, thereby improving the efficiency and accuracy of target detection.
[0004] However, since drones typically perform detection missions at higher altitudes, the targets they capture are easily affected by the limitations of the aerial viewpoint and the interference of complex environments. These factors often lead to problems such as target occlusion, uneven distribution, insufficient resolution, and complex lighting, which in turn cause false detections and missed detections during the detection process.
[0005] Moreover, most current research methods focus on optimizing model performance in a single scenario, rarely taking into account the comprehensive improvement of performance across multiple dimensions, such as small target detection requirements, processing of edge details of multi-scale targets, algorithm adaptability in complex scenarios, and lightweight model deployment. Therefore, there is an urgent need to explore a comprehensive solution to achieve a balance between lightweight model deployment and high detection accuracy, ensuring that UAVs maintain high maneuverability and operational efficiency in complex scenarios. Summary of the Invention
[0006] To address the aforementioned problems, the present invention aims to provide a target detection method for complex scenes in UAV aerial photography based on the D-FINE model.
[0007] To achieve the above objectives, the UAV aerial photography target detection method based on the D-FINE model provided by this invention includes the following steps performed in sequence:
[0008] Step 1: Improve the D-FINE model, which serves as the baseline model, to construct a target detection model for UAV aerial images;
[0009] Step 2: Select a public dataset and preprocess the original images in it. Then, divide the preprocessed images into training set, validation set and test set according to the proportion. Then, use the training set to train the above UAV aerial image target detection model, and use the validation set and test set to evaluate and test it to obtain the trained UAV aerial image target detection model.
[0010] Step 3: Use a drone to capture multiple video streams, and then preprocess all the frame images in the multiple video streams according to the method in Step 2 to obtain the drone aerial images to be detected.
[0011] Step 4: Input the above-mentioned UAV aerial image to be detected into the trained UAV aerial image target detection model for target detection, and finally output the UAV aerial image target detection result.
[0012] In step 1, the D-FINE model comprises four modules: an input terminal, a backbone network, a neck network, and an output terminal. The input terminal preprocesses the input image. The backbone network uses a convolution module with various convolution kernels to perform convolution and pooling on the preprocessed image, thereby extracting features to obtain feature maps of different sizes. The neck network fuses feature maps of different sizes through a sampling and feature concatenation module. The output terminal employs a decoupled head structure to decouple the classification and regression processes, including positive and negative sample matching and loss calculation.
[0013] The drone aerial image target detection model uses the aforementioned D-FINE model as its base model. In the backbone network, the LoG-Stem edge enhancement module replaces the Stem module in the D-FINE model; in the neck network, the complementary feature downsampling module replaces the convolutional downsampling module in the D-FINE model; in the neck network, the context-aware module replaces the image stitching module in the D-FINE model; and the heavy loss function of the detection head uses the matching perception loss function instead of the multi-task combination loss function in the D-FINE model.
[0014] In step 2, the method of selecting a public dataset and preprocessing the original images therein, then dividing the preprocessed images into training, validation, and test sets proportionally, and then using the training set to train the above-mentioned UAV aerial image target detection model, and using the validation and test sets for evaluation and testing, to obtain the trained UAV aerial image target detection model is as follows:
[0015] The VisDrone 2019 public dataset was selected. First, the original images in the dataset were uniformly scaled. Then, the pixels of the scaled images were normalized to map the pixel values to a preset range. Next, data augmentation processing, including random flipping, random cropping, color perturbation, or scaling, was performed on the normalized images to obtain preprocessed images of size H×W×3. Finally, the preprocessed images were divided into training, validation, and test sets in a ratio of 7:2:1.
[0016] The preprocessed image is then used as the input image to the backbone network. The input image first enters the LoG-Stem edge enhancement module for convolution operation and stride downsampling. After channel expansion, a feature map with a size of H / 4×W / 4C1 is output. Then, the feature map is sequentially processed by four enhancement modules from Stage1 to Stage4 for hierarchical feature extraction.
[0017] The processing method of the LoG-Stem edge enhancement module is as follows:
[0018] Input image I∈R H×W×3 First, initial features are extracted using a 9×9 convolutional layer. Then, a LoG filter with a kernel size of 7×7 and a standard deviation σ=1.0 is used to achieve edge-aware feature representation. The 2D LoG filter at position x=(i,j) is defined as follows:
[0019] (1);
[0020] The expression for a Gaussian filter is:
[0021] (2);
[0022] Where k×k represents the kernel size, and the standard deviation σ is 1.0;
[0023] The output of the LoG filter, after being processed by an activation and normalization layer, is added to the input image via a residual connection.
[0024] (3);
[0025] This is followed by two 3×3 convolutional layers. The second convolutional layer uses a stride of 2 for downsampling, thereby reducing the size to H / 2×W / 2.
[0026] (4);
[0027] The data is processed using Gaussian filters with kernel sizes of 9×9 and 5×5, both with a standard deviation σ=0.5. The outputs are summed, normalized, and then processed through a deep residual feature extraction module to finally obtain a 1 / 4 resolution feature map.
[0028] (5);
[0029] Finally, the above feature maps are sequentially input into the four enhancement modules Stage1 to Stage4 for processing to obtain multi-scale feature maps {F1, F2, F3} and input into the neck network.
[0030] The neck network first expands the spatial scale of the high-level semantic features of the multi-scale feature map {F1, F2, F3} output by the backbone network through upsampling; then, it uses a complementary feature downsampling module to enhance the semantic expressive power of low-level features through downsampling; then, it uses a context-aware module to fuse features of different scales by feature concatenation or element-wise addition; and it embeds a Transformer layer in the deeper feature paths to establish global dependencies, enhance long-distance information interaction, and improve the recognition of occluded targets; finally, it outputs a multi-scale fused feature map {P1, P2, P3}.
[0031] The processing method of the complementary feature downsampling module is as follows:
[0032] First, the input feature map X is copied into feature maps x1, x2, and x3. Then, feature map x1 is downsampled using a slicing method to obtain a matrix with half the size. This matrix contains the following feature maps:
[0033] ;
[0034] ;
[0035] Where Xij The feature at position (i,j) is obtained by concatenating feature maps c1, c2, c3 and c4, increasing the number of channels from C to 4C; then, a 1×1 convolutional layer with a stride of 1 is used for processing, reducing the number of channels from 4C to 2C, and obtaining feature map y1. This process is named Dcut, as shown in equation (7); for ease of representation, a function called fusion is defined, as shown in equation (6):
[0036] (6);
[0037] (7);
[0038] (8);
[0039] The processing of feature map x2 begins with a 3×3 group convolutional layer with a stride of 1, where the number of groups is equal to the number of input channels, and the number of channels increases from C to 2C; then a 3×3 depthwise separable convolutional layer with a stride of 2 is used for downsampling, and Gaussian error linear unit (GELU activation function) is used to obtain feature map y2, and this process is named Dconv, as shown in Equation (8).
[0040] The feature map x3 is processed using a group convolutional layer that shares parameters with Dconv; the number of channels is also increased from C to 2C, and then a 2×2 max pooling operation with a stride of 2 is used to complete the feature downsampling to obtain the feature map y3, and this process is named Dmax, as shown in equation (9):
[0041] (9);
[0042] Where BN, Concat, Conv, DWConvD, and GConv represent batch normalization, concatenation, convolution, depthwise convolution, and group convolution, respectively; then, by concatenating feature maps y1, y2, and y3, the number of channels of the output feature map is increased to 6C; finally, a 1×1 convolutional layer with a stride of 1 is used to process the feature map, and the number of channels returns to 2C, and the final output is the fused feature map Y as shown in equation (10):
[0043] (10);
[0044] The processing method of the context-aware module is as follows:
[0045] First, two input feature maps Pi of the same dimension are received and merged into a single-path fused feature map through a concatenation operation. This fused feature map then enters an attention branch consisting of a global average pooling layer, a linear layer, a ReLU function, a linear layer, and a Sigmoid activation function. The global average pooling layer extracts global information, while the linear layer and ReLU function perform feature transformation and non-linear mapping. Finally, the Sigmoid activation function outputs attention weights normalized to the [0,1] interval. These attention weights are then split into two equal-weight vectors, which are element-wise weighted with the two input feature maps Pi. Finally, the weighted feature map is element-wise added to the corresponding feature map Pi through a residual connection to obtain a multi-scale fused feature map {P1, P2, P3}.
[0046] Subsequently, the multi-scale fused feature maps {P1, P2, P3} are input to the output and decoupled for prediction. The output includes a classification branch and a regression branch. The classification branch predicts the class probability for each spatial location. The regression branch predicts the parameters of the target bounding box. Based on a preset positive and negative sample matching strategy, the samples participating in the loss calculation are determined. The classification loss and localization loss are calculated and jointly optimized. Finally, the prediction results are filtered by confidence, overlapping bounding boxes are suppressed, and the final detection results are output, including the target class label and its corresponding location coordinates.
[0047] During the training process, the performance changes of the UAV aerial image target detection model are monitored in real time using a validation set, and the parameters of the UAV aerial image target detection model are updated using an appropriate optimizer. Simultaneously, a set anchor box intersection-over-union ratio (IoU) is used as a standard to guide the UAV aerial image target detection model in learning the bounding boxes of UAV targets, ensuring that the UAV aerial image target detection model can effectively learn the features of aerial targets in UAV aerial images during training, until the performance of the UAV aerial image target detection model on the validation set stabilizes, thus completing the training process of the UAV aerial image target detection model. After training, the trained UAV aerial image target detection model is evaluated using the validation set, and then the test set is input into the trained UAV aerial image target detection model for detection.
[0048] The evaluation metrics include precision, recall, average precision, number of parameters, one billion floating-point operations per second, and frame rate, as shown in the following formula:
[0049] (11);
[0050] (12);
[0051] (13);
[0052] (14);
[0053] GFLOPS = Total number of floating-point operations / Execution time (seconds) (15);
[0054] (16);
[0055] Where TP is the number of images in which the drone aerial image target detection model correctly detected the drone target, FP is the number of images in which the drone aerial image target detection model incorrectly detected the drone target, FN is the number of images in which the drone aerial image target detection model did not detect the drone target, Precision is the accuracy, Recall is the recall, and T is the detection time per image.
[0056] Compared with existing technologies, the target detection method for complex scenes in UAV aerial photography based on the D-FINE model provided by this invention has the following advantages:
[0057] 1. The LoG-Stem edge enhancement module combines traditional image processing operators (Gaussian smoothing and Laplacian operators) with learnable deep features to specifically address the feature degradation problem in low-quality aerial images, aiming to efficiently and effectively improve the detection capability of challenging targets (such as low-quality, blurred or occluded targets).
[0058] 2. The complementary feature downsampling module fuses multiple feature maps extracted by different downsampling techniques to generate a more robust feature map with complementary feature sets, thereby overcoming the limitations of traditional convolutional downsampling and achieving more accurate and robust analysis of remote sensing images.
[0059] 3. The context-aware module enhances the model's ability to perceive key global contextual information by introducing dynamic channel adaptation, context-aware feature recalibration, and bidirectional feature enhancement.
[0060] 4. By adopting the matching-aware loss function as the classification supervision loss function, the utilization rate of limited positive samples can be improved without affecting the optimization effect of high-quality matching, thus alleviating the training supervision noise problem caused by low-quality matching in dense matching scenarios.
[0061] 5. Overall, it achieves improved real-time performance, enhanced detection accuracy, and improved comprehensive generalization ability with strong scene adaptability, which is superior to existing methods. Attached Figure Description
[0062] Figure 1 This is a flowchart of the UAV aerial photography target detection method based on the D-FINE model provided by the present invention.
[0063] Figure 2This is a structural diagram of the D-FINE model used as the benchmark model in this invention.
[0064] Figure 3 This is a structural diagram of the drone aerial image target detection model constructed in this invention.
[0065] Figure 4 This is a schematic diagram of the LoG-Stem edge enhancement module structure in the UAV aerial image target detection model constructed in this invention.
[0066] Figure 5 This is a schematic diagram of the complementary feature downsampling module in the UAV aerial image target detection model constructed in this invention.
[0067] Figure 6 This is a schematic diagram of the context-aware module structure in the UAV aerial image target detection model constructed in this invention. Detailed Implementation
[0068] Various embodiments of the present invention will now be clearly and completely described with reference to the accompanying drawings. The embodiments described with reference to the drawings are exemplary and intended to explain the present invention, and should not be construed as limiting the present invention.
[0069] like Figure 1 As shown, the target detection method for complex scenes in UAV aerial photography based on the D-FINE model provided by this invention includes the following steps performed in sequence:
[0070] Step 1: Improve the D-FINE model, which serves as the baseline model, to construct a target detection model for UAV aerial images;
[0071] like Figure 2 As shown, the D-FINE model integrates the advantages of the DETR series of object detection algorithms and mainly includes four modules: input, backbone, neck, and output. The input is used to preprocess the input image. The backbone uses a convolution module to perform convolution and pooling on the preprocessed image using various convolution kernels, thereby extracting features to obtain feature maps of different sizes. The neck uses a sampling and feature concatenation module to fuse feature maps of different sizes. The output adopts a decoupled header (Decoder) structure to decouple the classification and regression processes, including positive and negative sample matching and loss calculation.
[0072] like Figure 3As shown, the target detection model for UAV aerial images uses the aforementioned D-FINE model as its base model. In the backbone network, the LoG-Stem edge enhancement module replaces the Stem module in the D-FINE model; in the neck network, the complementary feature downsampling module replaces the convolutional downsampling module in the D-FINE model; in the neck network, the context-aware module replaces the image stitching module in the D-FINE model; and the loss function of the detection head uses the Matchability-Aware Loss (MAL) function instead of the multi-task combination loss function in the D-FINE model.
[0073] Step 2: Select a public dataset and preprocess the original images in it. Then, divide the preprocessed images into training set, validation set and test set according to the proportion. Then, use the training set to train the above UAV aerial image target detection model, and use the validation set and test set to evaluate and test it to obtain the trained UAV aerial image target detection model.
[0074] The method is as follows:
[0075] The VisDrone 2019 public dataset was selected. First, the original images in this dataset were uniformly scaled to meet the model's input size requirements. Then, the pixels of the scaled images were normalized, mapping pixel values to a preset range. Next, data augmentation processing, including random flipping, random cropping, color perturbation, or scaling, was performed on the normalized images to obtain data-augmented images of size H×W×3. After the above preprocessing, the original images were mapped from the original pixel space to the normalized tensor space, obtaining preprocessed images that provide basic data for subsequent feature extraction. Finally, the preprocessed images were divided into training, validation, and test sets in a 7:2:1 ratio.
[0076] The preprocessed image is then input into the backbone network. The input image first enters the LoG-Stem edge enhancement module for convolution operations and stride downsampling. After channel expansion, a feature map of size H / 4×W / 4C1 is output. This feature map is then sequentially processed through four enhancement modules (Stage 1 to Stage 4) for hierarchical feature extraction. As the network depth increases, the receptive field is expanded layer by layer, and higher-level semantic information is extracted.
[0077] like Figure 4 As shown, the processing method of the LoG-Stem edge enhancement module is as follows:
[0078] The LoG-Stem edge enhancement module utilizes the dual capabilities of the LoG (Laplacian Gaussian) filter—noise suppression and edge detection—to enhance edge features in the initial stage. This is particularly crucial when deep learning feature extractors often underperform in aerial images. The LoG filter combines Gaussian smoothing and the Laplacian operator, effectively suppressing noise while highlighting rapidly changing areas.
[0079] Input image I∈R H×W×3 First, initial features are extracted using a 9×9 convolutional layer. Then, a LoG filter with a kernel size of 7×7 and a standard deviation σ=1.0 is used to achieve edge-aware feature representation. The 2D LoG filter at position x=(i,j) is defined as follows:
[0080] (1);
[0081] The expression for a Gaussian filter is:
[0082] (2);
[0083] Where k×k represents the kernel size, and in this invention, the standard deviation σ is taken as 1.0 to achieve the best edge detection effect;
[0084] The output of the LoG filter is processed by an activation and normalization (AN) layer and then added to the input image via a residual connection.
[0085] (3);
[0086] This residual-like design preserves image details and promotes stable gradient flow. This is followed by two 3×3 convolutional layers, with the second layer downsampling using a stride of 2, further reducing the size to H / 2×W / 2.
[0087] (4);
[0088] To further enhance multi-scale features, Gaussian filters with kernel sizes of 9×9 and 5×5 were used, both with a standard deviation σ=0.5. The outputs were summed, normalized, and then processed through the DRFD (Depth Residual Feature Extraction) module to finally obtain a 1 / 4 resolution feature map.
[0089] (5);
[0090] This process enriches edge features and integrates multi-scale context, providing a solid foundation for subsequent detection. Finally, the above feature maps are sequentially input into the four enhancement modules Stage1 to Stage4 for processing to obtain multi-scale feature maps {F1, F2, F3}, which are then input into the neck network.
[0091] The neck network first expands the spatial scale of the high-level semantic features of the multi-scale feature map {F1, F2, F3} output by the backbone network through upsampling; then, it uses a complementary feature downsampling module to enhance the semantic expressive power of low-level features through downsampling; then, it uses a context-aware module to fuse features of different scales by feature concatenation or element-wise addition; and it embeds a Transformer layer in the deeper feature paths to establish global dependencies, enhance long-distance information interaction, and improve the recognition of occluded targets; finally, it outputs a multi-scale fused feature map {P1, P2, P3}.
[0092] like Figure 5 As shown, the processing method of the complementary feature downsampling module is as follows:
[0093] First, the input feature map X is copied into feature maps x1, x2, and x3. Then, feature map x1 is downsampled using a cut-slice method. This technique effectively reduces the size of the feature map by cutting adjacent pixels while preserving the original feature information, resulting in a matrix with half the size. This matrix contains the following feature maps:
[0094] ;
[0095] ;
[0096] Where X ij The feature at position (i,j) is obtained by concatenating feature maps c1, c2, c3 and c4, increasing the number of channels from C to 4C. Then, a 1×1 convolutional layer with a stride of 1 is used for processing, reducing the number of channels from 4C to 2C, and obtaining feature map y1. This process is named Dcut, as shown in equation (7). The downsampling process can improve computational efficiency while preserving key features. For ease of representation, a function called fusion is defined, which concatenates and fuses multiple features into the required number of channels, for efficient combination and integration of diverse feature representations, as shown in equation (6).
[0097] (6);
[0098] (7);
[0099] (8);
[0100] The processing of feature map x2 begins with a 3×3 group convolution (GConv) layer with a stride of 1, where the number of groups equals the number of input channels, and the number of channels increases from C to 2C. Subsequently, a 3×3 depthwise convolution layer with a stride of 2 is used for downsampling, while simultaneously employing the Gaussian Error Linear Unit (GELU) activation function to obtain feature map y2. This process is named Dconv, as shown in Equation (8). Depthwise convolution can integrate local feature information, thereby improving feature fusion and halving the feature size. By utilizing group convolution and depthwise convolution, the method of this invention outperforms traditional convolution downsampling in terms of computing floating-point operations (FLOPs).
[0101] The feature map x3 is processed using a group convolutional layer that shares parameters with Dconv; the number of channels is also increased from C to 2C, and then a 2×2 max pooling operation with a stride of 2 is used to complete the feature downsampling to obtain the feature map y3, and this process is named Dmax, as shown in equation (9):
[0102] (9);
[0103] Where BN, Concat, Conv, DWConvD, and GConv represent batch normalization, concatenation, convolution, depthwise convolution, and group convolution, respectively. Then, by concatenating feature maps y1, y2, and y3, the number of channels in the output feature map is increased to 6C; finally, a 1×1 convolutional layer with a stride of 1 is used to process the data, reducing the number of channels back to 2C, and the final output is the fused feature map Y as shown in equation (10):
[0104] (10);
[0105] like Figure 6 As shown, the processing method of the context-aware module is as follows:
[0106] First, two input feature maps Pi of the same dimension are received and merged into a single-path fused feature map through a concatenation operation. This fused feature map then enters an attention branch consisting of a global average pooling (GAP) layer, a linear layer, a ReLU function, another linear layer, and a Sigmoid activation function. The GAP layer extracts global information, while the linear layer and ReLU function perform feature transformation and non-linear mapping. Finally, the Sigmoid activation function outputs attention weights normalized to the [0,1] interval. These attention weights are then split into two identical weight vectors, which are weighted element-wise with the two input feature maps Pi. Finally, the weighted feature map is added element-wise to the corresponding feature map Pi through a residual connection to obtain a multi-scale fused feature map {P1, P2, P3}.
[0107] Subsequently, the above multi-scale fused feature maps {P1, P2, P3} are input to the output end and decoupled prediction processing is performed. The output end includes a classification branch and a regression branch. The classification branch predicts the class probability for each spatial location. The regression branch performs regression prediction on the target bounding box parameters. According to the preset positive and negative sample matching strategy, the samples participating in the loss calculation are determined. The classification loss and localization loss are calculated and jointly optimized. Finally, the prediction results are filtered by confidence, and overlapping bounding boxes are suppressed. Finally, the final detection results are output, including the target class label and its corresponding location coordinates.
[0108] During the training process, the performance changes of the UAV aerial image target detection model are monitored in real time using a validation set, and the parameters of the UAV aerial image target detection model are updated using an appropriate optimizer. Simultaneously, a set anchor box intersection-over-union ratio (IoU) is used as a standard to guide the UAV aerial image target detection model in learning the bounding boxes of UAV targets, ensuring that the UAV aerial image target detection model can effectively learn the features of aerial targets in UAV aerial images during training, until the performance of the UAV aerial image target detection model on the validation set stabilizes, thus completing the training process of the UAV aerial image target detection model. After training, the trained UAV aerial image target detection model is evaluated using the validation set, and then the test set is input into the trained UAV aerial image target detection model for detection.
[0109] The evaluation metrics include precision, recall, average precision (AP), number of parameters, billion floating-point operations per second (GF1OPS), and frame rate (FPS), as shown in the following formula:
[0110] (11);
[0111] (12);
[0112] (13);
[0113] (14);
[0114] GFLOPS = Total number of floating-point operations / Execution time (seconds) (15);
[0115] (16);
[0116] Where TP is the number of images in which the drone aerial image target detection model correctly detected the drone target, FP is the number of images in which the drone aerial image target detection model incorrectly detected the drone target, FN is the number of images in which the drone aerial image target detection model did not detect the drone target, Precision is the accuracy, Recall is the recall, and T is the detection time per image.
[0117] Step 3: Use a drone to capture multiple video streams, and then preprocess all the frame images in the multiple video streams according to the method in Step 2 to obtain the drone aerial images to be detected.
[0118] Step 4: Input the above-mentioned UAV aerial image to be detected into the trained UAV aerial image target detection model for target detection, and finally output the UAV aerial image target detection result.
Claims
1. A target detection method for complex scenes in UAV aerial photography based on the D-FINE model, characterized in that: The method includes the following steps performed in sequence: Step 1: Improve the D-FINE model, which serves as the baseline model, to construct a target detection model for UAV aerial images; Step 2: Select a public dataset and preprocess the original images in it. Then, divide the preprocessed images into training set, validation set and test set according to the proportion. Then, use the training set to train the above UAV aerial image target detection model, and use the validation set and test set to evaluate and test it to obtain the trained UAV aerial image target detection model. Step 3: Use a drone to capture multiple video streams, and then preprocess all the frame images in the multiple video streams according to the method in Step 2 to obtain the drone aerial images to be detected. Step 4: Input the above-mentioned UAV aerial image to be detected into the trained UAV aerial image target detection model for target detection, and finally output the UAV aerial image target detection result.
2. The method for target detection in complex scenes by UAV aerial photography based on the D-FINE model according to claim 1, characterized in that: In step 1, the D-FINE model comprises four modules: an input terminal, a backbone network, a neck network, and an output terminal. The input terminal preprocesses the input image. The backbone network uses a convolution module with various convolution kernels to perform convolution and pooling on the preprocessed image, thereby extracting features to obtain feature maps of different sizes. The neck network fuses feature maps of different sizes through a sampling and feature concatenation module. The output terminal employs a decoupled head structure to decouple the classification and regression processes, including positive and negative sample matching and loss calculation. The drone aerial image target detection model uses the aforementioned D-FINE model as its base model. In the backbone network, the LoG-Stem edge enhancement module replaces the Stem module in the D-FINE model; in the neck network, the complementary feature downsampling module replaces the convolutional downsampling module in the D-FINE model; in the neck network, the context-aware module replaces the image stitching module in the D-FINE model; and the heavy loss function of the detection head uses the matching perception loss function instead of the multi-task combination loss function in the D-FINE model.
3. The method for target detection in complex scenes by UAV aerial photography based on the D-FINE model according to claim 1, characterized in that: In step 2, the method of selecting a public dataset and preprocessing the original images therein, then dividing the preprocessed images into training, validation, and test sets proportionally, and then using the training set to train the above-mentioned UAV aerial image target detection model, and using the validation and test sets for evaluation and testing, to obtain the trained UAV aerial image target detection model is as follows: The VisDrone2019 public dataset was selected. First, the original images in the dataset were scaled to a uniform size. Then, the pixels of the scaled images were normalized to map the pixel values to a preset range. The normalized images are then subjected to data augmentation processes including random flipping, random cropping, color perturbation, or scaling to obtain preprocessed images of size H×W×3. Finally, the preprocessed images are divided into training, validation, and test sets in a ratio of 7:2:
1. The preprocessed image is then used as the input image to the backbone network. The input image first enters the LoG-Stem edge enhancement module for convolution operation and stride downsampling. After channel expansion, a feature map with a size of H / 4×W / 4C1 is output. Then, the feature map is sequentially processed by four enhancement modules from Stage1 to Stage4 for hierarchical feature extraction. The processing method of the LoG-Stem edge enhancement module is as follows: Input image I∈R H×W×3 First, initial features are extracted using a 9×9 convolutional layer. Then, a LoG filter with a kernel size of 7×7 and a standard deviation σ=1.0 is used to achieve edge-aware feature representation. The 2D LoG filter at position x=(i,j) is defined as follows: (1); The expression for a Gaussian filter is: (2); Where k×k represents the kernel size, and the standard deviation σ is 1.0; The output of the LoG filter, after being processed by an activation and normalization layer, is added to the input image via a residual connection. (3); This is followed by two 3×3 convolutional layers. The second convolutional layer uses a stride of 2 for downsampling, thereby reducing the size to H / 2×W / 2. (4); The data is processed using Gaussian filters with kernel sizes of 9×9 and 5×5, both with a standard deviation σ=0.
5. The outputs are summed, normalized, and then processed through a deep residual feature extraction module to finally obtain a 1 / 4 resolution feature map. (5); Finally, the above feature maps are sequentially input into the four enhancement modules Stage1 to Stage4 for processing to obtain multi-scale feature maps {F1, F2, F3} and input into the neck network. The neck network first expands the spatial scale of the high-level semantic features of the multi-scale feature map {F1, F2, F3} output by the backbone network through upsampling; then, it uses a complementary feature downsampling module to enhance the semantic expressive power of low-level features through downsampling; then, it uses a context-aware module to fuse features of different scales by feature concatenation or element-wise addition; and it embeds a Transformer layer in the deeper feature paths to establish global dependencies, enhance long-distance information interaction, and improve the recognition of occluded targets; finally, it outputs a multi-scale fused feature map {P1, P2, P3}. The processing method of the complementary feature downsampling module is as follows: First, the input feature map X is copied into feature maps x1, x2, and x3. Then, feature map x1 is downsampled using a slicing method to obtain a matrix with half the size. This matrix contains the following feature maps: ; ; Where X ij The feature at position (i,j) is obtained by concatenating feature maps c1, c2, c3 and c4, increasing the number of channels from C to 4C; then, a 1×1 convolutional layer with a stride of 1 is used for processing, reducing the number of channels from 4C to 2C, and obtaining feature map y1. This process is named Dcut, as shown in equation (7); for ease of representation, a function called fusion is defined, as shown in equation (6): (6); (7); (8); The processing of feature map x2 begins with a 3×3 group convolutional layer with a stride of 1, where the number of groups is equal to the number of input channels, and the number of channels increases from C to 2C; then a 3×3 depthwise separable convolutional layer with a stride of 2 is used for downsampling, and Gaussian error linear unit (GELU activation function) is used to obtain feature map y2, and this process is named Dconv, as shown in Equation (8). The feature map x3 is processed using a group convolutional layer that shares parameters with Dconv; the number of channels is also increased from C to 2C, and then a 2×2 max pooling operation with a stride of 2 is used to complete the feature downsampling to obtain the feature map y3, and this process is named Dmax, as shown in equation (9): (9); Where BN, Concat, Conv, DWConvD, and GConv represent batch normalization, concatenation, convolution, depthwise convolution, and group convolution, respectively; then, by concatenating feature maps y1, y2, and y3, the number of channels of the output feature map is increased to 6C; finally, a 1×1 convolutional layer with a stride of 1 is used to process the feature map, and the number of channels returns to 2C, and the final output is the fused feature map Y as shown in equation (10): (10); The processing method of the context-aware module is as follows: First, two input feature maps Pi of the same dimension are received and merged into a single-path fused feature map through a concatenation operation. This fused feature map then enters an attention branch consisting of a global average pooling layer, a linear layer, a ReLU function, a linear layer, and a Sigmoid activation function. The global average pooling layer extracts global information, while the linear layer and ReLU function perform feature transformation and non-linear mapping. Finally, the Sigmoid activation function outputs attention weights normalized to the [0,1] interval. These attention weights are then split into two equal-weight vectors, which are element-wise weighted with the two input feature maps Pi. Finally, the weighted feature map is element-wise added to the corresponding feature map Pi through a residual connection to obtain a multi-scale fused feature map {P1, P2, P3}. Subsequently, the above multi-scale fused feature maps {P1,P2,P3} are decoupled and predicted at the input and output ends; the output ends include classification and regression branches. The classification branch predicts the category probability for each spatial location; The regression branch performs regression prediction on the parameters of the target bounding box; Based on the preset positive and negative sample matching strategy, the samples to be included in the loss calculation are determined; The classification loss and localization loss are calculated and jointly optimized. Finally, the prediction results are filtered by confidence, overlapping bounding boxes are suppressed, and the final detection results are output, including the target category label and its corresponding location coordinates. During the training process described above, the performance changes of the UAV aerial image target detection model are monitored in real time using the validation set. An appropriate optimizer is used to update the parameters of the UAV aerial image target detection model. At the same time, the set anchor box intersection-union ratio is used as a standard to guide the UAV aerial image target detection model in learning the bounding boxes of UAV targets. This ensures that the UAV aerial image target detection model can effectively learn the features of aerial targets in UAV aerial images during the training process, until the performance of the UAV aerial image target detection model on the validation set tends to stabilize, thus completing the training process of the UAV aerial image target detection model. After training, the training model for detecting targets in UAV aerial images is evaluated using a validation set, and then the test set is input into the training model for detection.
4. The method for target detection in complex scenes by UAV aerial photography based on the D-FINE model according to claim 3, characterized in that: The evaluation metrics include precision, recall, average precision, number of parameters, one billion floating-point operations per second, and frame rate, as shown in the following formula: (11); (12); (13); (14); GFLOPS = Total number of floating-point operations / Execution time (seconds) (15); (16); Where TP is the number of images in which the drone aerial image target detection model correctly detected the drone target, FP is the number of images in which the drone aerial image target detection model incorrectly detected the drone target, FN is the number of images in which the drone aerial image target detection model did not detect the drone target, Precision is the accuracy, Recall is the recall, and T is the detection time per image.