Dynamic Context Relationship Acquisition System for UAV Target Detection
By using the dynamic context relationship acquisition system DyCC-Net, the execution or skipping of the context relationship acquisition module can be dynamically selected. Combined with pseudo-label semi-supervised learning, the problem of wasted computational resources in UAV target detection algorithms is solved, and efficient target detection performance is improved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- GUANGZHOU RES INST OF XIAN UNIV OF ELECTRONIC SCI & TECH
- Filing Date
- 2022-11-29
- Publication Date
- 2026-06-30
AI Technical Summary
On platforms with limited computing resources, drone target detection algorithms face a contradiction between huge computational overhead and insufficient computing power. In particular, the use of static architecture for images of varying recognition difficulty leads to a waste of computing resources.
A dynamic context acquisition system (DyCC-Net) for UAV target detection is designed. It learns the mapping function from the input image feature map to the gate signal through dynamic gate learning, dynamically selects to execute or skip the context acquisition module, and combines pseudo-label semi-supervised learning strategy to reasonably allocate computing resources.
It effectively reduces the computational cost of UAV target detection, reduces inference time to one-third of existing methods, and improves detection performance, with AP75 performance indicators being 1.97% higher.
Smart Images

Figure CN115830477B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the fields of intelligent unmanned systems, artificial intelligence, and computer vision technology, and relates to a dynamic contextual relationship acquisition system for target detection in unmanned aerial vehicles. Background Technology
[0002] Unmanned aerial vehicle (UAV) platforms equipped with visible light cameras have attracted widespread attention. These platforms can be deployed quickly and cost-effectively in various emerging applications, such as aerial photography and aerial video surveillance. There is a strong demand for automatically identifying targets of interest in UAV images or video data. Thanks to advancements in deep neural networks in remote sensing image processing, algorithms for target detection from a UAV perspective (UAV target detection) have made significant progress.
[0003] Mainstream UAV target detection algorithms typically incorporate a contextual information acquisition module. This module gathers information about the surrounding environment of the target to enrich the information about the region containing low-quality targets. While this approach effectively improves the accuracy of existing detection algorithms, deploying it on a UAV platform and implementing onboard processing still presents numerous challenges. The most intractable problem is the contradiction between the algorithm's enormous computational overhead and the extremely limited computing power of the UAV platform.
[0004] Most drone target detection algorithms use a static architecture. For example... Figure 1 As shown, the context relationship acquisition module is used to process input images of varying recognition difficulty. Clearly, executing this acquisition module for simple images would result in a significant waste of computational resources. Summary of the Invention
[0005] The purpose of this invention is to provide a dynamic contextual relationship acquisition system for UAV target detection. This system can automatically distinguish images of different difficulty and autonomously select whether to execute the contextual relationship acquisition module, thereby reducing the overhead of the detection algorithm.
[0006] The technical solution adopted in this invention is a dynamic context relationship acquisition system for UAV target detection, including a backbone network. The backbone network is connected to the top-to-bottom path through a dynamic context collector, and the top-to-bottom path is connected to the detection head.
[0007] The invention is further characterized by:
[0008] The dynamic context collector includes dynamic gates, which are used to learn feature maps from the input image. Gate signal The mapping function, dynamic gates include gate networks and activation functions.
[0009] The gate network structure consists of a globally average pooling layer (GAP), a fully connected layer (FC1), a fully connected layer (FC2), and a ReLU layer connected in sequence. composition;
[0010] output of gate network It is expressed by the following formula (1):
[0011] (1);
[0012] in, This is the input to the i-th layer of FPN. Assume the shape of the input features of GateNet is... , Indicates the length of the input features, Indicates the width of the input feature, The input feature represents the channel, and the output feature shape is: , Indicates the length of the output feature, Indicates the width of the output feature, Channels that represent output features.
[0013] The structure of the gate network adopts The convolutional layer Conv collects contextual information. After the convolutional layer is the global average pooling layer GAP, which is used to capture the contextual information of the entire image. Finally, the fully connected layer FC outputs a two-dimensional vector, which is represented by the following formula (2). :
[0014] (2).
[0015] Gate networks contain Max Pooling layer with a stride of 2 The convolutional layer Conv, followed by global average pooling GAP and a fully connected layer FC, is used to calculate the output using the following formula (3). :
[0016] (3).
[0017] The activation function uses probability. The continuous differential function is used to predict the k-dimensional one-hot encoding using the following formula. :
[0018] (4);
[0019] In the formula, , These are independent and identically distributed samples extracted by Gumbel(0,1). It is a temperature parameter.
[0020] The beneficial effects of this invention are that it has been extensively tested on two widely used drone capture datasets and compared with state-of-the-art (SOTA) methods, thus validating the effectiveness of DyCC-Net. Experimental results show that DyCC-Net's inference time is less than one-third that of the SOTA model, thereby reducing the model's computational cost. This invention also reveals that DyCC-Net's AP... 75 Its performance is 1.97% higher than the state-of-the-art (SOTA) model. Attached Figure Description
[0021] Figure 1 This is a static architecture diagram of an existing drone detector;
[0022] Figure 2 This is an application principle diagram of the dynamic context relationship acquisition system for UAV target detection of the present invention;
[0023] Figure 3 This is a schematic diagram of the dynamic context relationship acquisition system for UAV target detection according to the present invention.
[0024] Figure 4 This is a schematic diagram of the dynamic gate structure in the dynamic context relationship acquisition system for UAV target detection of the present invention.
[0025] Figure 5 This is a flowchart of the pseudo-label generation algorithm in the dynamic context relationship acquisition system for UAV target detection of the present invention;
[0026] Figure 6 This is an image taken by a UAV at an altitude of 100m above the ground in an embodiment of the dynamic context relationship acquisition system for UAV target detection of the present invention, and the detection effect of DyCC-Net.
[0027] Figure 7 This is a diagram showing the detection performance of DyCC-Net near the ground in an embodiment of the dynamic context relationship acquisition system for UAV target detection of the present invention. Detailed Implementation
[0028] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments.
[0029] This invention proposes a Dynamic Context Collection Network (DyCC-Net) structure for UAV target detection. For example... Figure 2As shown, DyCC-Net can selectively execute the context acquisition module based on the complexity of the input image, thereby achieving perceptual reasoning based on the input. This invention proposes a pseudo-label-based semi-supervised learning strategy, which uses generated pseudo-labels as supervision signals and effectively allocates computational resources according to the difficulty of the input image. In this invention, DyCC-Net uses dynamic gates to execute or skip the context acquisition module for different inputs. That is, for simple input images, DyCC-Net skips the context acquisition module, while for complex input images, DyCC-Net executes the context acquisition module.
[0030] Figure 3 The diagram below illustrates the architecture of DyCC-Net, which consists of four modules. The first module is the backbone network, used for feature extraction. The second module is the Top-Down Path, used for multi-scale feature extraction. The third module is the Dynamic Context Collector (DyCC) proposed in this invention, used to support input-aware reasoning. The last module is the detector head, used to estimate bounding box positions and classification scores.
[0031] DyCC consists of three components: dynamic gates, heavy chains, and light chains. The dynamic gates, comprising a gate network (GateNet) and activation functions, are responsible for predicting gate signals and assigning the appropriate chain—either a heavy chain or a light chain—to the input. DyCC aims to reduce computational cost by allocating different computational resources to simple or complex images. The computational cost of DyCC-Net does not decrease during training because both light and heavy chains process simple and complex images indiscriminately. During inference, simple inputs only run the light chains, skipping the heavy chains, thus reducing the computational cost of DyCC-Net.
[0032] Dynamic gates are used to learn feature maps from input images. Gate signal The mapping function. Signal It is an approximately two-dimensional one-hot encoding, selecting the heavy chain when the value approaches [0, 1] and the light chain when the value approaches [1, 0]. During the training phase, the outputs of the two chains are respectively coupled with two elements. and Multiplication. During the testing phase, if the output value of the dynamic gate approaches [1, 0], the heavy chain is ignored, and only the light chain is run. The dynamic gate consists of a gating network and a gate activation function.
[0033] GateNets not only need to accurately select appropriate links, but also need to minimize computational cost. This invention designs three types of GateNets, such as... Figure 4 As shown. Figure 4 The first gated network in the network, "GateNet-I", consists of a global average pooling layer (GAP), two fully connected layers (FC1 and FC2), and a ReLU layer. This layer is composed of a two-dimensional vector π.
[0034] The computational cost of GateNet-I is approximately that of a light chain. GateNet-II includes convolutional layers and its computational cost is about 10 times that of light chains, while GateNet-III includes pooling layers and its computational cost is close to that of light chains. Additionally, it uses the Gumbel-Softmax function. Approximating one-hot encoding. The output of GateNet-I can be defined using the following mathematical formula. :
[0035] (1);
[0036] in, This is the input to the i-th layer of FPN. Assume the shape of the input features of GateNet is... ( Indicates the length of the input features, Indicates the width of the input feature, (representing the channels of the input features), the shape of the output features is ( Indicates the length of the output feature, Indicates the width of the output feature, (representing the channels of the output features), the computational cost of GateNet-I is approximately that of a light chain. Although the computational cost is almost negligible, due to GAP's... Layer directly Value representation Feature maps, therefore, the features extracted by GateNet-I lack contextual relationship information.
[0037] Convolutional layers enrich the contextual information in feature maps. The second design of the gated network (“GateNet-II”) employs… The convolutional layer (Conv) captures contextual information. Following the convolutional layer is a global average pooling layer (GAP), which captures the contextual information of the entire image. Finally, a fully connected layer (FC) outputs a two-dimensional vector π. The output of GateNet-II can be expressed by the following mathematical formula. :
[0038] (2);
[0039] GateNet-III includes Max Pooling with a stride of 2. The GateNet-III output consists of a convolutional layer (Conv), followed by a global average pooling (GAP) layer and a fully connected (FC) layer. Similarly, the output of GateNet-III can be expressed mathematically. :
[0040] (3);
[0041] The computational cost of GateNet-III is similar to that of light chains. Therefore, GateNet-III was used to determine the gate signals in the experiments.
[0042] This invention uses the Gumbel-Softmax function as the gate activation function to train the parameters of a non-differentiable decision model. For example... Figure 4 As shown, the Gumbel-Softmax function is used as the class probability. The continuous differential function is used to predict the k-dimensional one-hot encoding using the following formula.
[0043] (4);
[0044] In the formula, , These are independent and identically distributed (iid) samples extracted by Gumbel(0,1). It's a temperature parameter. In At lower levels, the output of the gated activation function approximates one-hot encoding. When improved, it converges to a uniform distribution. Gumbel-Softmax is a continuous distribution. A partial derivative estimator.
[0045] This invention proposes a pseudo-label-based semi-supervised learning strategy for path selection, called "pseudo-label learning". Figure 5This describes the pseudo-label generation process. Simple and complex images are labeled with pseudo-labels. Simple images that existing detectors can easily recognize are labeled with 1. Complex images that existing detectors struggle to recognize are labeled with 0. Compared to unsupervised learning, our proposed GateNet, after training with pseudo-labels, can more rationally select appropriate approaches, achieving a better efficiency-accuracy balance. Figure 5 In the diagram, the images within the gray boxes are unlabeled data, which are fed into the trained base model to estimate the average accuracy. The images within the green boxes are positive samples, and the images within the red boxes are negative samples.
[0046] The pseudo-learning strategy in this invention can be divided into two steps: pseudo-label generation and training using the generated pseudo-labels. First, the baseline model is trained and object detection annotations are completed. Then, this model is used to infer all images in the entire dataset. The metrics evaluate the prediction results for each image. Images with values higher than a specified threshold are labeled as 1 (simple images). Images with values below a specified threshold are labeled 0 (complex images). We set the experimental threshold to 0.6. Therefore, we can generate pseudo-labels for the dataset. We then use these generated pseudo-labels to train DyCC-Net, completing the original object detection annotations while maintaining object detection performance.
[0047] The advantages of this invention are:
[0048] 1. This invention proposes a detector from the perspective of a drone, which supports input-aware reasoning and is called DyCC-Net. This network can ignore or run the context acquisition module according to the complexity of the input, thereby greatly reducing redundant calculations and improving reasoning efficiency.
[0049] 2. This invention designs DyCC, which solves the problem of gradients of discrete variables not being able to propagate backward in the training network by introducing the Gumbel-Softmax function.
[0050] 3. This invention proposes a semi-supervised learning strategy based on pseudo-labels, called "pseudo-label learning". Under the guidance of this strategy, computing power is correctly allocated to various inputs, achieving a better trade-off between efficiency and accuracy.
[0051] Example
[0052] Extensive experiments were conducted on two widely used drone capture datasets, VisDrone2021 and UAVDT. The effectiveness of DyCC-Net was validated by comparisons with the top 10 state-of-the-art methods. The object detection results were evaluated. Experimental results show that DyCC-Net's inference time is less than one-third of that of state-of-the-art detectors, thus reducing the model's computational cost. This invention further reveals that DyCC-Net's... The performance is 1.94% higher than the state-of-the-art (SOTA) model. The overall experimental results are shown in Figure 1 below.
[0053] Table 1
[0054]
[0055] The models compared in this invention have all been published in top journals or conferences in the field of artificial intelligence, including CVPR, EECV, ICCV, AAAI, PR, and TIP. The models compared are: SSD (SingleShot MultiBoxDetector), FRCNN (Faster R-CNN), YOLOv5 (YouOnlyLookOnce version Five), DSHNet (Dual.SamplerandHeadNetwork), GLSAN (Global-LocalSelf-AdaptiveNetwork), CLUSTDet (ClusteredDetection), CRENet (ClusterRegionEstimationNetwork), TPH-YOLOv5 (TransformerPredictionHeads -YOLOv5), and UFPMP-Det (UnifiedForegroundPackingDetector).
[0056] Figure 6 and Figure 7 The results show the target detection performance of DyCC-Net on real-world UAV images. For example... Figure 6 As shown, even when a drone takes an image 100 meters above the ground, DyCC-Net can still effectively detect cars on a distant road. Figure 7 As shown, near the ground, DyCC-Net is able to detect crowds in densely populated areas. This demonstrates that DyCC-Net can effectively detect and identify targets captured by drones.
[0057] Table 2 summarizes the ablation experiment results of the basic model and core modules. The computational cost of the algorithm is measured in FLOPs. YOLOv5+tinyHead+CC represents YOLOv5 integrating a prediction head with low-level, high-resolution feature maps and a contextual relationship acquisition module. The last row of data shows that the proposed DyCC-Net can reduce the computational cost of YOLOv5+tinyHead+CC by approximately 10% (FLOPs: 456.17 G vs. 505.46 G), while achieving detection performance comparable to state-of-the-art (Recall: 57.01% vs. 57.16%, AP). 50 (59.72% vs. 59.98%).
[0058] Table 2
[0059]
Claims
1. A dynamic contextual relationship acquisition system for UAV target detection, characterized in that: This includes a backbone network, which connects to the top-to-bottom pathway via a dynamic context collector, and the top-to-bottom pathway connects to the detection head. The dynamic context collector includes dynamic gates, which are used to learn feature maps from the input image. Gate signal The mapping function; The training process of the dynamic gate adopts a pseudo-label semi-supervised learning strategy, which includes the following steps: Step 1: The baseline model is trained and the object detection is labeled. Then, the trained baseline model is used to infer the prediction results of all images in the entire dataset. use The metrics evaluate the prediction results for each image and generate pseudo-labels: Images with values higher than the threshold of 0.6 are labeled as pseudo-label 1 and defined as simple images; Images with values below the threshold of 0.6 are labeled with pseudo-label 0, defined as complex images, and pseudo-labels are generated for the entire dataset. Step 2: During the dynamic gate training phase, the generated pseudo-labels are used as supervision signals to train the dynamic gate to learn the mapping rules: when the input image is a simple image, the dynamic gate outputs a gate signal close to [1,0], controlling the dynamic context collector to select the light chain; when the input image is a complex image, the dynamic gate outputs a gate signal close to [0,1], controlling the dynamic context collector to select the heavy chain.
2. The dynamic context relationship acquisition system for UAV target detection according to claim 1, characterized in that: The dynamic gate includes a gate network and an activation function.
3. The dynamic context relationship acquisition system for UAV target detection according to claim 2, characterized in that: The gate network structure includes a globally average pooling layer (GAP), a fully connected layer (FC1), a fully connected layer (FC2), and a ReLU layer connected in sequence. composition; Output of gate network It is expressed by the following formula (1): (1)。 4. The dynamic context relationship acquisition system for UAV target detection according to claim 2, characterized in that: The structure of the gate network adopts The convolutional layer Conv collects contextual information. After the convolutional layer is the global average pooling layer GAP, which is used to capture the contextual information of the entire image. Finally, the fully connected layer FC outputs a two-dimensional vector, which is represented by the following formula (2). : (2)。 5. The dynamic context relationship acquisition system for UAV target detection according to claim 2, characterized in that: The gate network includes Max Pooling layer with a stride of 2 The convolutional layer Conv, followed by global average pooling GAP and a fully connected layer FC, is used to calculate the output using the following formula (3). : (3)。 6. The dynamic context relationship acquisition system for UAV target detection according to claim 2, characterized in that: The activation function uses probability. The continuous differential function is used to predict the k-dimensional one-hot encoding using the following formula. : (4); In the formula, , These are independent and identically distributed samples extracted by Gumbel(0,1). It is a temperature parameter.