A YOLOv5-based unmanned aerial vehicle aerial image target detection method

By introducing dilated convolution and deformable convolution in YOLOv5 to expand the receptive field, and adding a tandem cross self-attention module between the backbone network and the neck, the anchor boxes and detection head are optimized, which solves the problems of insufficient speed and accuracy in UAV image target detection and improves the detection effect of small targets and dense areas.

CN116091946BActive Publication Date: 2026-06-30CHONGQING UNIV OF POSTS & TELECOMM

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHONGQING UNIV OF POSTS & TELECOMM
Filing Date
2022-12-06
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing drone image target detection algorithms are insufficient in terms of speed and accuracy, especially when detecting small targets and dense areas. The YOLOv5 algorithm has poor accuracy in identifying the location of objects in drone images and has a low recall rate.

Method used

In the YOLOv5 backbone network, dilated convolution and deformable convolution are introduced to expand the receptive field, a context extraction module is constructed, a cascaded cross self-attention module is added to enhance feature extraction, K-means algorithm is used to cluster anchor boxes, the detection head is optimized, and multi-label classification is combined to improve detection accuracy.

Benefits of technology

It improves the speed and accuracy of target detection in UAV images, especially the detection effect of small targets and dense areas, and enhances the robustness to small targets.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116091946B_ABST
    Figure CN116091946B_ABST
Patent Text Reader

Abstract

This invention claims protection for a target detection method for UAV aerial images based on YOLOv5, belonging to the field of target detection technology. The method includes the following steps: Step 1. Using the YOLOv5 algorithm as the basic model framework, to improve the accuracy of small target detection in urban aerial images, this invention designs a network with multiple contextual feature extraction methods. Step 2. To enhance the network's attention to dense regions, this invention proposes a cascaded cross-attention algorithm, which is added between the backbone network and the three detection heads to further enhance the information of dense regions. Step 3. Through iterative training and parameter updates, the final network model is obtained. Multi-scale prediction is then used to improve the small target detection performance, and finally, the final result is obtained through prediction by the three-scale detection heads. This invention effectively alleviates the problem of contextual information loss, enhances feature extraction capabilities, captures a more diverse feature space, and achieves clearer anchor box localization.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of computer vision and deep learning, and in particular to a target detection method for drone aerial images based on YOLOv5. Background Technology

[0002] With the rapid development of UAV technology and deep learning, high-resolution, large-scale UAV image data is becoming increasingly abundant. However, urban UAV images often suffer from problems such as small targets, high resolution, and uneven target distribution. Artificial neural networks have been widely used in the field of UAV image target detection. Most algorithms are based on prior bounding boxes, which perform well on some traditional datasets, but their performance on UAV images is only average. Therefore, UAV image target detection that balances detection speed and accuracy has become a current research hotspot.

[0003] Object detection involves identifying all objects of interest in an image, encompassing two subtasks: object localization and object classification—that is, simultaneously determining the object's category and location. Currently, widely used object detection methods are mainly divided into two categories: one-stage and two-stage. Two-stage methods are region-based algorithms that divide object detection into detection and recognition stages. First, an algorithm or network searches for regions of interest in the image, and then identifies objects within those regions, such as R-CNN, Fast-RCNN, and Faster-RCNN. One-stage methods, on the other hand, are end-to-end algorithms that directly generate the object's category probability and location coordinates using regression principles, achieving both detection and recognition, such as YOLO and SSD. One-stage methods have an advantage in speed compared to two-stage methods, but their accuracy is relatively lower.

[0004] Due to the challenges of single imaging perspective, dense target distribution, and large target scale variations in UAV images, directly applying natural scene target detection methods to UAV image target detection tasks fails to yield satisfactory results. Furthermore, high resolution and large image size increase the computational cost of the algorithm. In recent years, one-stage algorithms have achieved accuracy comparable to two-stage algorithms. The YOLO algorithm series is a representative example of one-stage algorithms, with YOLOv5 offering a balance between speed and accuracy. However, compared to R-CNN series object detection methods, it suffers from lower accuracy in identifying object locations and lower recall. Therefore, designing an algorithm suitable for fast target detection in UAV images while improving detection accuracy for small targets and densely populated areas remains a challenge.

[0005] CN113807464B discloses a target detection method for UAV aerial images based on an improved YOLO V5, belonging to the fields of deep learning and target detection. This method first constructs a relevant dataset using UAV aerial images. Then, it replaces the slicing layer in the Focus module of the YOLO V5 backbone network with a convolutional layer. Next, it further processes image features using the Neck part. Addressing the issues of cluttered target distribution and small target pixel ratio caused by the high-altitude perspective of UAV aerial photography, the method optimizes the network prediction layer by removing large 76×76×255 detector heads and simultaneously adjusting the anchor boxes. Finally, the target detection performance is evaluated by generalized intersection-over-union ratio, average precision, and inference speed. This method achieves fast and accurate target detection in UAV aerial images while improving recognition accuracy and feature extraction performance.

[0006] Patent CN113807464B did not consider contextual information in the image when improving the backbone network, and only removed large detectors without optimizing existing ones. This invention, in optimizing the backbone network, utilizes dilated convolution and deformable convolution to expand the receptive field, obtaining more comprehensive contextual information to aid in the detection of small targets. Simultaneously, in the detector head, the detection performance is enhanced through shallow information from the backbone network and self-attention, making it more attentive to areas containing objects. Summary of the Invention

[0007] This invention aims to solve the problems of the prior art mentioned above. It proposes a target detection method for UAV aerial images based on YOLOv5. The technical solution of this invention is as follows:

[0008] A target detection method for drone aerial images based on YOLOv5 includes the following steps:

[0009] Step 1: Divide the drone image dataset into a training set and a test set. Preprocess and augment the training set to obtain a complete sample dataset. Then, use the K-means algorithm to cluster the dataset and obtain the size of the anchor boxes.

[0010] Step 2: Based on the YOLOv5 backbone network, a context extraction module is constructed using dilated convolution and deformable convolution to extract features from UAV images and expand the receptive field;

[0011] Step 3: Between the backbone network and the Neck layer, in order to utilize shallow semantic information and make the network focus on dense regions, a cascaded cross self-attention module is constructed for feature enhancement.

[0012] Step 4: Obtain the final model through complete training, use the model to detect test images, and obtain the detection results.

[0013] Furthermore, step 1 involves preprocessing and data augmentation of the training set to obtain a complete sample dataset, and then using the K-means algorithm to cluster and obtain anchors. Specifically, this includes the following steps:

[0014] Step 1.1: Scale and stretch the images in the initial sample dataset to generate 1088*1088 pixel images while maintaining the anchor frame ratio;

[0015] Step 1.2: Perform data augmentation on the image data obtained in Step 1.1 by translation, rotation, and adjustment of saturation and exposure to increase sample data and process the feature parameters of the target to be identified.

[0016] Step 1.3: Using the K-means clustering algorithm, perform cluster analysis on the ground truth target bounding boxes of the sample data training set obtained in Step 1.2; initialize 9 anchor boxes by randomly selecting 9 values ​​from all bounding boxes as the initial values ​​of the anchor boxes; calculate the IoU (Intersection over Union) value between each bounding box and each anchor box. The IoU is calculated as follows:

[0017]

[0018] Where ∩ represents the intersection and ∪ represents the union.

[0019] Then, for each bounding box, the highest IoU value is selected, and the average value of all bounding boxes is calculated, which is the final accuracy value; finally, 9 accurate anchor boxes are obtained as the preset values ​​of the network.

[0020] Furthermore, step 2 involves constructing multiple context extraction modules to extract features from the UAV images, specifically including:

[0021] 2.1 After the SPP (Spatial Pyramid Pooling Module) of the backbone network, three sets of dilated convolutions are used to extract features from the original feature maps, with dilation rates set to 1, 2, and 3 respectively; the obtained feature maps are then merged using a Concat operation.

[0022] 2.2 Continue to use deformable convolution to correct the boundary information of the feature map obtained in step 2.1. Specifically, build an additional convolutional layer to learn the bias information and use the offset to reposition the convolution position. Finally, in order to ensure that the number of channels is the same, 1x1 convolution dimensionality reduction is required and a skip connection is made for feature fusion.

[0023] Furthermore, step 3, which involves constructing a cascaded cross-attention mechanism for feature extraction from the UAV image, specifically includes:

[0024] 3.1 The algorithm is implemented on the Ultralytics version of the YOLOv5 network model, with cross-self-attention cascaded between the backbone and the detector head. First, three cross-convolutions are used to extract features from the feature map. The cross-convolutions emphasize edge information by mining vertical and horizontal gradient information in parallel, providing information enhancement for subsequent self-attention. The cross-convolutions are designed using two asymmetric vertical filters, denoted as 1×3 and 3×1 respectively, with F... in ,F out Given the input and output feature maps, we have

[0025]

[0026] k 1×3 k 3×1 These represent different convolution kernel sizes.

[0027] 3.2 For each feature map I∈R C×H×W (C represents the number of channels, H represents the height of the feature map, W represents the width of the feature map, and R represents all feature maps that pass through this network layer.) First, three feature maps Q, K, and V are generated independently using cross-convolution, where Q, K ∈ R. C′×H×W C and C′ both represent the number of channels, and here C′ is set to C / 8. For each Q, the dimensionality is changed to decompose it into Q_H(B×W,H,C′) and Q_W(B×H,W,C′); Q_H and Q_W represent the vertical and horizontal decomposition of the Q feature map, respectively. The same process is then performed on K; subsequently, weights are applied horizontally and vertically to obtain A (Attention).

[0028]

[0029] After obtaining A (Attention) and V (Value), A is processed to obtain A_H(B×W,H,H) and A_H(B×H,W,W), which represent the vertical and horizontal decompositions of the feature map A, respectively. V is also processed to obtain V_H(B×W,C,H) and V_W(B×H,C,W); V_H and V_W represent the vertical and horizontal decompositions of the feature map V, respectively. Out represents the final output feature map.

[0030]

[0031] Finally, the cross self-attention is concatenated to ensure that each point on the feature map can be associated with other points for computation.

[0032] 3.3 The enhanced features are input into the YOLO detection heads of three scales, corresponding to small, medium and large target objects respectively. The anchor boxes clustered in step 1.3 are used as prior boxes, and the number of predicted object categories is set.

[0033] 3.4 At this point, the entire network framework has been built.

[0034] Furthermore, step 4 involves obtaining the final model through complete training, using the model to perform object detection on the test image, and obtaining the final detection result, specifically including:

[0035] 4.1 Train the network constructed in step 3 on the training set to obtain the network output model;

[0036] 4.2 The network output is downsampled to obtain three multi-scale feature maps. Each cell in the feature map will predict 3 bounding boxes. Each bounding box will predict: (1) the position of each box (4 values, center coordinates t) x and t y The height b of the frame h and width b w (2) An objectness prediction (confidence), (3) N categories;

[0037] 4.3 Bounding box coordinate prediction, t x t y t w t h It refers to the model's predicted output, c. x and c y This represents the coordinates of a grid cell, specifically the coordinates c of the grid cell in row 0 and column 1. x It's 0, c y That is; p w and p h Indicates the size of the bounding box before prediction; b x b y b w and b h It refers to the coordinates and size of the predicted bounding box center; the loss for the coordinates uses the squared error loss.

[0038] b x =δ(t) x )+c x

[0039] by =δ(t) y )+c y

[0040]

[0041]

[0042] p r (object)*IOU(b,object)=δ(t0)

[0043] 4.4 Category prediction adopts multi-label classification. The category label in the detection result may have two classes at the same time, so a logistic regression layer is needed to perform binary classification for each class. The logistic regression layer mainly uses the sigmoid function, which can constrain the input to the range of 0 to 1. Therefore, when the output of a certain class of an image after feature extraction is constrained by the sigmoid function and is greater than 0.5, it means that it belongs to that class.

[0044] The advantages and beneficial effects of this invention are as follows:

[0045] This invention addresses the problem of low detection accuracy for small targets and unevenly distributed objects in popular UAV image target detection tasks based on deep convolutional neural networks. It proposes a UAV target detection method based on YOLOv5, incorporating multiple contextual information extraction modules and cascaded cross-self-attention. In the network structure design stage, YOLOv5 is selected as the baseline algorithm. Dilated convolution and deformable convolution replace traditional convolution to extract multiple contextual features and expand the receptive field. Considering that the features extracted in the backbone network are shallow features with rich semantic information, while the features in the neck area are deep features, cascaded cross-self-attention is added between the backbone network and the neck area to effectively enhance the neck features. Cascaded cross-self-attention calculates horizontal and vertical weights and concatenates them to obtain global features, further enhanced by cross-convolution. The extracted features are used as input to the YOLO detection head for prediction at three scales, improving robustness for small target detection. This method achieves good detection results. Attached Figure Description

[0046] Figure 1 This invention provides a preferred embodiment of a network framework for a UAV image target detection method based on YOLOv5;

[0047] Figure 2 This is a schematic diagram of the various context information extraction modules of the present invention.

[0048] Figure 3 This is a schematic diagram of the serial cross self-attention of the present invention. Detailed Implementation

[0049] The technical solutions of the present invention will be clearly and thoroughly described below with reference to the accompanying drawings. The described embodiments are merely some embodiments of the present invention.

[0050] The technical solution of the present invention to solve the above-mentioned technical problems is:

[0051] This invention is based on the YOLOv5 object detection framework, detailed at https: / / github.com / ultralytics / yolov5. A context extraction module consisting of dilated convolutions and deformable convolutions is added after the SPP module in the backbone network to expand the receptive field. An attention mechanism combining cross-convolutions and self-attention is added between the backbone network and the neck region to enhance the network's attention to dense areas.

[0052] The invention will be further described below with reference to the accompanying drawings:

[0053] As attached Figure 1 As shown, the design process of a network framework for a UAV image target detection method based on YOLOv5 includes the following steps:

[0054] A. This design process is performed on the Ultralytics version of the YOLOv5 network model. The YOLOv5 backbone includes a Focus module, an SPP (Spatial Pyramid Pooling) module, and multiple CBS and C3 modules.

[0055] B. Add the following after the SPP module at the end of the backbone network: Figure 2 The diagram shows multiple context information extraction modules. First, three sets of dilated convolutions are used to extract features from the original feature map, with dilation rates set to 1, 2, and 3 respectively. The resulting feature maps are then merged using a concat operation.

[0056] C. Further refine the boundary information using deformable convolution on the feature map obtained in the previous step. Specifically, an additional convolutional layer is built to learn the bias information, and the offset is used to reposition the convolutional layer. Finally, to ensure the same number of channels, 1x1 convolutional dimensionality reduction is performed, and a skip connection is added for feature fusion.

[0057] Furthermore, to fuse the shallow information from the backbone network and the deep information from the neck region, and to make the network pay more attention to dense areas, a cross-attention mechanism is implemented between the backbone and the detection head. The specific network flow design is attached. Figure 3 As shown, the specific implementation steps are as follows:

[0058] A. First, use three cross-convolutions to extract features from the feature map. Cross-convolutions emphasize edge information by mining vertical and horizontal gradient information in parallel, providing information enhancement for subsequent self-attention. The cross-convolutions are designed using two asymmetric vertical filters, denoted as 1×3 and 3×1 respectively. Let F... in ,F out Given the input and output feature maps, we have

[0059]

[0060] B. For each feature map I∈R C×H×W First, three feature maps Q, K, and V are generated independently using cross-convolution. Where Q, K ∈ R C′×H×W C and C′ both represent the number of channels. Here, we set C′ = C / 8. For each Q, we decompose it dimensionally into Q_H(B×W,H,C′) and Q_W(B×H,W,C′). We then perform the same process on K. Finally, we weight the horizontal and vertical sides to obtain A (Attention).

[0061]

[0062] After obtaining A (Attention) and V (Value), we process the obtained A to obtain A_H(B×W,H,H) and A_H(B×H,W,W), and we also process V to obtain V_H(B×W,C,H) and V_W(B×H,C,W).

[0063]

[0064] C. Finally, we concatenate the cross self-attention to ensure that each point on the feature map can be associated with other points for computation.

[0065] D. Input the enhanced features into the YOLO detection heads of three scales, corresponding to small, medium and large target objects respectively. Use the anchor boxes clustered in 1.3 as prior boxes and set the number of predicted object categories.

[0066] Furthermore, the final model is obtained through complete training. The model is then used to detect images to be tested, and the final detection results are obtained. The specific steps are as follows:

[0067] A. Train the network constructed in the above steps on the training set to obtain the network output model;

[0068] B. The network output is downsampled to obtain three multi-scale feature maps. Each cell in the feature map will predict three bounding boxes. Each bounding box will predict three things: (1) the position of each box (4 values, center coordinates t) x and t y The height b of the frame h and width b w (2) An objectness prediction (confidence), (3) N categories;

[0069] C. Bounding box coordinate prediction, t x t y t w t h This refers to the model's predicted output. x and c y This represents the coordinates of a grid cell. For example, if the size of a feature map in a certain layer is 13*13, then there will be 13*13 grid cells. The coordinates of the grid cell in row 0 and column 1 are represented by c. x It's 0, c y It's 1. p w and p h This indicates the size of the bounding box before prediction. x b y b w and b h This refers to the coordinates and size of the predicted bounding box center. The loss for the coordinates uses the squared error loss; b x =δ(t) x )+c x

[0070] b y =δ(t) y )+c y

[0071]

[0072]

[0073] p r (object)*IOU(b,object)=δ(t0)

[0074] D. Category prediction employs multi-label classification. In complex scenarios, an object may belong to multiple classes, and the detection results may simultaneously show two class labels. Therefore, a logistic regression layer is needed to perform binary classification for each class. The logistic regression layer primarily uses the sigmoid function, which constrains the input to the range of 0 to 1. Therefore, if the output of a certain class after feature extraction is greater than 0.5 after being constrained by the sigmoid function, it indicates that the image belongs to that class.

[0075] The systems, devices, modules, or units described in the above embodiments can be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer. Specifically, a computer can be, for example, a personal computer, laptop computer, cellular phone, camera phone, smartphone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or any combination of these devices.

[0076] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0077] The above embodiments should be understood as illustrative only and not as limiting the scope of protection of the present invention. After reading the description of the present invention, those skilled in the art can make various alterations or modifications to the present invention, and these equivalent changes and modifications also fall within the scope defined by the claims of the present invention.

Claims

1. A target detection method for UAV aerial images based on YOLOv5, characterized in that, Includes the following steps: Step 1: Divide the drone image dataset into a training set and a test set. Preprocess and augment the training set to obtain a complete sample dataset. Then, use the K-means algorithm to cluster the dataset and obtain the size of the anchor boxes. Step 2: Based on the YOLOv5 backbone network, various context extraction modules are constructed using dilated convolution and deformable convolution to extract features from UAV images and expand the receptive field; Step 3: Between the backbone network and the Neck layer, in order to utilize shallow semantic information and make the network focus on dense regions, a cascaded cross self-attention module is constructed for feature enhancement. Step 4: Obtain the final model through complete training, use the model to detect the test images, and obtain the detection results; Step 3, which involves constructing a cascaded cross-attention mechanism for feature extraction from UAV images, specifically includes: 3.1 The algorithm is implemented on the Ultralytics version of the YOLOv5 network model, with cross-self-attention cascaded between the backbone and the detector head. First, three cross-convolutions are used to extract features from the feature map. The cross-convolutions emphasize edge information by mining vertical and horizontal gradient information in parallel, providing information enhancement for subsequent self-attention. The cross-convolutions are designed using two asymmetric vertical filters, denoted as follows: and 1. Let , Given the input and output feature maps, we have ; , These represent different convolutional kernel sizes; 3.2 For each feature map C represents the number of channels, H represents the height of the feature map, W represents the width of the feature map, and R represents all feature maps that pass through this network layer. First, three feature maps Q, K, and V are generated independently using cross-convolution. , and Both represent the number of channels, and are set here. ; Based on the dimension of change of each Q, it is decomposed into and ; , This represents the vertical and horizontal decomposition of the Q feature map, respectively; then the same process is performed on K; finally, A is obtained by weighting the horizontal and vertical directions respectively. ; After obtaining A and V, process the obtained A to get... and , and These represent the vertical and horizontal decompositions of feature map A, respectively; V is also processed to obtain... and ; , These represent the vertical and horizontal decompositions of the V feature map, respectively; Out represents the final output feature map. ; Finally, the cross self-attention is concatenated to ensure that each point on the feature map can be associated with other points for computation. 3.3 The enhanced features are input into the YOLO detection heads of three scales, corresponding to small, medium and large target objects respectively. The anchor boxes clustered in step 1.3 are used as prior boxes, and the number of predicted object categories is set. 3.4 At this point, the entire network framework has been built.

2. The target detection method for UAV aerial images based on YOLOv5 according to claim 1, characterized in that, Step 1 involves preprocessing and data augmentation of the training set to obtain a complete sample dataset, and then using the K-means algorithm to cluster and obtain anchors. Specifically, this includes the following steps: Step 1.1: Scale and stretch the images in the initial sample dataset to generate 1088*1088 pixel images while maintaining the anchor frame ratio; Step 1.2: Perform data augmentation on the image data obtained in Step 1.1 by translation, rotation, and adjustment of saturation and exposure to increase sample data and process the feature parameters of the target to be identified. Step 1.3: Using the K-means clustering algorithm, perform cluster analysis on the ground truth target bounding boxes of the sample data training set obtained in Step 1.2; initialize 9 anchor boxes by randomly selecting 9 values ​​from all bounding boxes as the initial values ​​of the anchor boxes; calculate the IoU (Intersection over Union) between each bounding box and each anchor box. The IoU is calculated as follows: ; in Indicates intersection, Represents the union; Then, for each bounding box, the highest IoU value is selected, and the average value of all bounding boxes is calculated, which is the final accuracy value; finally, 9 accurate anchor boxes are obtained as the preset values ​​of the network.

3. The method for target detection in UAV aerial images based on YOLOv5 according to claim 2, characterized in that, Step 2 involves constructing multiple context extraction modules to extract features from UAV images, specifically including: 2.1 After the SPP spatial pyramid pooling module of the backbone network, three sets of dilated convolutions are used to extract features from the original feature maps, with dilation rates set to 1, 2, and 3 respectively; the obtained feature maps are then merged using a Concat operation. 2.2 Continue to use deformable convolution to correct the boundary information of the feature map obtained in step 2.

1. Specifically, build an additional convolutional layer to learn the bias information and use the offset to reposition the convolution position. Finally, in order to ensure that the number of channels is the same, 1x1 convolution dimensionality reduction is required and a skip connection is made for feature fusion.

4. The method for target detection in UAV aerial images based on YOLOv5 according to claim 1, characterized in that, Step 4 involves obtaining the final model through complete training, using the model to perform object detection on the test images, and obtaining the final detection results, specifically including: 4.1 Train the network constructed in step 3 on the training set to obtain the network output model; 4.2 The network output is downsampled to obtain three multi-scale feature maps. Each cell in the feature map will predict 3 bounding boxes. Each bounding box will predict: (1) the position of each box, i.e., 4 values, center coordinates and The height of the frame and width (2) An objectness prediction confidence score; (3) N categories; 4.3 Bounding box coordinate prediction , , , It is the model's predicted output. and This represents the coordinates of a grid cell, specifically the coordinates of the grid cell in row 0 and column 1. It's 0. It is 1; and This indicates the size of the bounding box before prediction; , , and It refers to the coordinates and size of the predicted bounding box center; the loss for the coordinates uses the squared error loss. ; ; ; ; ; 4.4 Category prediction adopts multi-label classification. The category label in the detection result may have two classes at the same time. Therefore, a logistic regression layer is needed to perform binary classification for each class. The logistic regression layer mainly uses the sigmoid function, which can constrain the input to the range of 0 to 1. Therefore, when the output of a certain class of an image after feature extraction is constrained by the sigmoid function and is greater than 0.5, it means that it belongs to that class.