Unmanned aerial vehicle survey image target detection and positioning method based on deep learning
By combining reversible neural networks and capsule networks with Lie group index mapping, a spatiotemporal context graph is constructed and features are aggregated. This solves the problems of variable target scale and unstable features in UAV survey images, and achieves high-precision and high-reliability target detection and localization.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHINA RAILWAY CHENGDU RES INST OF SCI & TECH CO LTD
- Filing Date
- 2026-05-21
- Publication Date
- 2026-06-19
Smart Images

Figure CN122244734A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of underwater vision technology, specifically a method for target detection and localization in UAV reconnaissance images based on deep learning. Background Technology
[0002] Unmanned aerial vehicle (UAV) reconnaissance technology rapidly acquires information over large areas from a high-altitude perspective, playing an irreplaceable role in disaster assessment, infrastructure inspection, and other fields. Its core value lies in its ability to automatically identify and locate key targets in complex scenes, providing immediate information for decision-making. However, existing methods face a chain of challenges arising from environmental complexity when dealing with real-world reconnaissance tasks. Reconnaissance images typically contain vast and varied backgrounds, with targets varying greatly in size. Distant vehicles may be only tens of pixels in size, while nearby buildings occupy most of the image. This dramatic scale variation makes it difficult for a single feature extraction strategy to cover all targets. More importantly, the dynamic changes in perspective and lighting during UAV flight cause the apparent features (such as color, texture, and shape) of the same type of target to exhibit unstable fluctuations in continuous images, further amplifying the difficulty for models to learn stable features for multi-scale targets.
[0003] The interplay of scale variability and feature instability directly hinders the simultaneous achievement of both detection accuracy and positioning reliability—two core performance parameters—in highly dynamic reconnaissance scenarios. For example, in mountain search and rescue missions, the model may miss distant targets due to their small size or misjudge them because nearby rock textures resemble clothing features. This coexistence of missed detections and misjudgments reduces the reliability of the target location information output by the automated system, failing to meet the application requirements for precise positioning. Therefore, achieving high-precision and high-reliability target recognition and location determination under the inherent complex conditions of scale variability and feature fluctuations in UAV mobile photography has become a key issue in improving the efficiency of autonomous UAV reconnaissance. Summary of the Invention
[0004] The purpose of this invention is to provide a deep learning-based method for target detection and localization in UAV reconnaissance images, which effectively solves the problems of variable scale, unstable features, and coarse localization in UAV reconnaissance images, and significantly improves the accuracy and reliability of target detection and localization.
[0005] The objective of this invention can be achieved through the following technical solutions: This application provides a deep learning-based method for target detection and localization in UAV reconnaissance images, including the following steps: S1. Input the raw survey images collected by the UAV into the reversible neural network, and generate high-resolution detail feature maps and low-resolution semantic feature maps by alternately propagating through additive coupling layers and affine coupling layers. S2. Align the high-resolution detail feature map with the low-resolution semantic feature map and fuse them element by element. Input the fusion into the primary capsule layer to generate the pose capsule vector set. Calculate the prediction vector between adjacent capsule layers using an exponential mapping on a Lie group and update the connection weights through sparse iterative routing to output the candidate target set. The candidate target set includes a category capsule and initial position parameters for each candidate target; S3. Receive the candidate target set, take each candidate target as a node, calculate the relative position difference and feature cosine similarity between nodes, establish edges according to the threshold to construct a spatiotemporal context graph, input the spatiotemporal context graph into a multi-layer graph attention network to aggregate the features of neighboring nodes and update the node representation, and output a refined bounding box through a regression head. S4. Receive each refined bounding box, sample continuous coordinate points inside each refined bounding box, input the continuous coordinate points into a multilayer perceptron to fit the symbolic distance field, solve the zero level set of the symbolic distance field, and obtain the positioning contour corresponding to each refined bounding box. S5. Receive each positioning contour and its continuous boundary distance measurement, compare each distance measurement with a preset threshold, if the distance measurement is greater than the preset threshold, retain the positioning contour as the final detection and positioning result, otherwise discard the contour, and output all retained contours.
[0006] The beneficial effects of this invention are as follows: This invention effectively solves the problem of unstable appearance features caused by drastic changes in target scale and dynamic changes in viewpoint and illumination in UAV reconnaissance images by combining lossless multi-scale feature extraction of reversible neural networks with attitude equivariant encoding of capsule networks using Lie group exponential mapping. The reversible neural network generates high-resolution detail feature maps and low-resolution semantic feature maps through alternating forward propagation of additive coupling layers and affine coupling layers, avoiding information loss in traditional downsampling, thus taking into account the feature representation of small targets at long distances and large targets at close distances. The capsule network uses Lie group exponential mapping to calculate prediction vectors, directly encoding geometric transformation parameters such as rotation and scaling into attitude capsules, and updating connection weights through sparse iterative routing, achieving robust detection of viewpoint and illumination changes, significantly improving the accuracy and stability of candidate target detection, and laying a solid foundation for subsequent refinement. By constructing a spatiotemporal context graph and aggregating neighbor node features using a multi-layer graph attention network, the problem of coarse bounding box localization caused by mutual occlusion between targets and lack of contextual association is solved. Using candidate targets as nodes, the relative position difference and feature cosine similarity between nodes are calculated. Undirected edges are established according to the threshold to form a spatiotemporal context graph. Then, the multi-head attention mechanism of the multi-layer graph attention network is used to aggregate neighbor node features and update node representations. This allows the features of isolated targets to be completed and corrected with the help of information from nearby similar targets, while suppressing isolated false detections. The refined bounding box output by the regression head makes full use of non-local context information, significantly improving the recall rate and localization accuracy, and effectively addressing the dilemma of both missed detections and false detections in complex scenarios such as mountain search and rescue. Subpixel-level positioning contours are obtained by fitting the symbolic distance field using a deep implicit function and solving for the zero level set. Contour filtering and enhancement are performed by combining Canny edge detection and morphological filling, which solves the problem of low reliability of output position information and inability to meet the requirements of accurate positioning. Continuous coordinate points are sampled inside each refined bounding box, and the symbolic distance field is fitted by a multilayer perceptron. Then, the zero level set is extracted by the Marching Squares algorithm to obtain the positioning contour with subpixel accuracy. Furthermore, Canny edge detection is used to extract the continuous boundary point sequence, calculate the continuous boundary distance metric and compare it with a preset threshold to retain high-confidence contours. Complete and smooth internal regions are obtained through morphological closing operations, hole filling and median filtering, etc., realizing highly reliable, subpixel-level target contour output. This provides accurate location information for UAV survey tasks such as disaster assessment and infrastructure inspection, and comprehensively improves the efficiency of UAV autonomous survey. Attached Figure Description
[0007] To better understand and implement this application, the technical solution is described in detail below with reference to the accompanying drawings.
[0008] Figure 1 This is a flowchart illustrating the deep learning-based UAV reconnaissance image target detection and localization method provided in Embodiment 1 of this application. Detailed Implementation
[0009] To further illustrate the technical means and effects adopted by the present invention to achieve its intended purpose, exemplary embodiments will be described in detail below, examples of which are illustrated in the accompanying drawings. In the following description, when referring to the drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application.
[0010] The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The singular forms “a,” “the,” and “the” used herein are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.
[0011] The following detailed description of the specific implementation methods, features, and effects of the present invention, in conjunction with the accompanying drawings and preferred embodiments, is provided in detail.
[0012] Example 1, please refer to Figure 1 This embodiment provides a deep learning-based method for target detection and localization in UAV reconnaissance images. It is suitable for reconnaissance scenarios with high requirements for geometric transformations, small targets, and arbitrary shape contours, including mountain search and rescue and detailed inspection. The method includes the following steps: S1. Input the original survey images collected by the UAV into the reversible neural network, and generate high-resolution detail feature maps and low-resolution semantic feature maps by alternating forward propagation through additive coupling layers and affine coupling layers. When the UAV collects the original survey images, it first plans the flight path and shooting parameters (such as flight altitude, speed, and gimbal pitch angle), and carries a visible light or multispectral camera to continuously shoot at a fixed frame rate (such as 30 frames / second), while recording timestamps and inertial navigation data (GPS position, attitude angle); the images are stored in high-resolution format (such as 1920×1080 pixels, 8-bit or 16-bit depth) on the airborne memory card or transmitted in real time via the image transmission link to ensure coverage of the target area and that the overlap rate meets the requirements of subsequent 3D reconstruction or detection.
[0013] The reversible neural network is a deep learning model that achieves a one-to-one mapping between input and output by constructing reversible transformations. It is usually composed of multiple stacked reversible blocks. Each reversible block contains an additive coupling layer and an affine coupling layer. During forward propagation, it can decompose the input image into high-resolution detail feature maps and low-resolution semantic feature maps without loss. At the same time, during backward propagation, it uses the reversible structure to accurately reconstruct the original input, thereby avoiding the information loss caused by traditional pooling or stride convolution.
[0014] During backpropagation, the reversible neural network accurately reconstructs the original input using the reversible transformations of the additive coupling layer and the affine coupling layer: the inverse operation of the additive coupling layer is subtraction at the corresponding position, and the inverse operation of the affine coupling layer is subtracting the translation factor and then dividing by the scaling factor, thereby completely restoring the input image information and avoiding feature loss. At the same time, the spatial resolution ratio of the high-resolution detail feature map to the low-resolution semantic feature map is set to 2:1 or 4:1, that is, the size of the low-resolution feature map is 1 / 2 or 1 / 4 of the high-resolution feature map, so as to facilitate subsequent alignment and fusion. In addition, the number of reversible blocks stacked is set to 4 to 8 according to the input image size. When the image size is greater than 512×512 pixels, 6 to 8 reversible blocks are selected to fully extract multi-scale features, and when the image size is less than or equal to 512×512 pixels, 4 to 5 reversible blocks are selected to balance computational efficiency.
[0015] Furthermore, the forward propagation is performed alternately by additive coupling layer and affine coupling layer, including: processing the survey image through additive coupling layer to obtain a first intermediate feature; inputting the first intermediate feature into affine coupling layer to obtain a second intermediate feature; repeating the above alternating process at least twice, and finally outputting a high-resolution detail feature map and a low-resolution semantic feature map.
[0016] Specifically, input segmentation divides the input image (or the feature map output from the previous layer) into two parts along the channel dimension on an average basis. and Each part contains half the number of channels; Additive coupling layer processing, maintain Unchanged, will Input is a subnetwork consisting of two convolutional layers (Intermediate batch normalization and ReLU activation), output and Feature maps of the same size ; then calculate and output directly ;Will and By splicing along the channel, the first intermediate feature is obtained; Affine coupling layer processing further segments the first intermediate feature along the channel. and ,Keep Unchanged, will Input another convolutional subnetwork The network outputs twice the number of input channels, which are used as scaling factor s and translation factor t (both proportional to the input). (Same dimensions); then calculate ( (For element-wise multiplication), and output directly. ;Will and By splicing the components, the second intermediate feature is obtained; The process is repeated, with the second intermediate feature used as the input to the next reversible block. The additive coupling layer and affine coupling layer are executed again, and the above process is repeated at least twice (usually stacking 4 to 8 reversible blocks). After the output of the last affine coupling layer, no splicing is performed. Instead, the two parts of the output are used as a high-resolution detail feature map (preserving more spatial details) and a low-resolution semantic feature map (obtained through stepwise downsampling or channel transformation), respectively.
[0017] in, and These are the two parts of the features output by the additive coupling layer. Keep input constant, pass It is obtained by adding it to the subnetwork, thus achieving a reversible transformation; and These are the two features output by the affine coupling layer. Keep input constant, By scaling factor s and translation factor t The feature representation capability is enhanced by performing an affine transformation. The above calculation method ensures the reversibility of the neural network and avoids the loss of feature information.
[0018] Specifically, by using lossless multi-scale feature extraction through reversible neural networks, the problem of losing small target details due to drastic changes in target scale in UAV survey images is solved. This achieves the extraction of both high-resolution details and low-resolution semantics, providing a complete feature foundation for subsequent detection and localization.
[0019] S2. Align the high-resolution detail feature map with the low-resolution semantic feature map and fuse them element by element. Input the fusion into the primary capsule layer to generate a pose capsule vector set. Calculate the prediction vector between adjacent capsule layers using an exponential mapping on a Lie group and update the connection weights through sparse iterative routing. Output a candidate target set, including the category capsule and initial position parameters for each candidate target. Furthermore, the high-resolution detail feature map is aligned with the low-resolution semantic feature map and then fused element-wise, including: First, the low-resolution semantic feature map is upsampled to the same spatial size as the high-resolution detail feature map using bilinear interpolation. If the number of channels is inconsistent between the two, a 1×1 convolution is then used to transform the number of channels in the low-resolution feature map to match that of the high-resolution feature map. Figure 1The aligned low-resolution feature map is obtained. Then, the aligned low-resolution feature map and the original high-resolution detail feature map are added element-wise at the same spatial location and channel to obtain the fused feature map. Finally, an optional normalization operation (such as batch normalization or layer normalization) is applied to the addition result to prevent the numerical range from being too large, and the final fused feature map is output for subsequent network use. For example, let the high-resolution detail feature map be... The low-resolution semantic feature map is ,in , H represents the height of the high-resolution detail feature map, W represents the width of the high-resolution detail feature map, C represents the number of channels in the high-resolution detail feature map, h represents the height of the low-resolution feature map, and w represents the width of the low-resolution feature map. The number of channels in the low-resolution feature map; the aligned low-resolution feature map. The calculation is as follows: ,in This indicates bilinear interpolation upsampling. For optional 1×1 convolution (when Used when necessary, otherwise it is an identity mapping); then element-wise addition and merging: Finally, normalization is performed. Norm represents batch normalization or layer normalization.
[0020] The fused feature map is input into the primary capsule layer to generate a pose capsule vector set, specifically including: fusing the feature map obtained after alignment and element-wise addition. As input to the primary capsule layer, the primary capsule layer consists of a set of parallel convolutional capsule units. Each convolutional capsule unit contains a convolutional kernel (k×k in size, s in stride, and D output channels). However, unlike ordinary convolution, the output of this kernel is no longer a scalar, but a d-dimensional vector (i.e., the pose of the capsule). Specifically, for each spatial location of the input feature map, N such convolutional capsule units are applied, each unit producing a d-dimensional vector. Therefore, the shape of the output tensor is... ,in and Determined by the convolution stride and padding; The output tensor is reorganized into a set of capsule vectors: the N d-dimensional vectors at each spatial location are regarded as N pose capsules, and all capsules on the entire feature map constitute the set of capsule vectors. Each capsule vector represents a partial feature of the candidate target detected at that location, its magnitude represents the probability of the target's existence, and its orientation encodes the target's pose parameters (such as position offset, rotation, etc.).
[0021] Finally, these primary capsule vectors are normalized, for example, using the squashing function, i.e. , where, v is the original output vector of capsule v (e.g., the vector obtained after weighted summation of the previous route iteration), ||v|| represents the magnitude of vector v, which is a non-negative real number; it is used to ensure that the magnitude of the vector is between 0 and 1, so as to obtain the final set of attitude capsule vectors for use by subsequent adjacent capsule layers.
[0022] Furthermore, the output candidate target set includes: The pose capsule vector set output by the primary capsule layer is used as the bottom capsule. For each bottom capsule i and each high-level capsule j, the pose vector of the bottom capsule is transformed to the Lie algebra space through a logarithmic mapping, multiplied by the learnable weight matrix, and then mapped back to the Lie group space through an exponential mapping to obtain the prediction vector. This prediction vector represents the contribution of the bottom capsule to the target category and pose represented by the high-level capsule, and due to the characteristics of the Lie group exponential mapping. The prediction vector is represented as: The logarithmic mapping transforms the attitude vector into a Lie algebra, which is then transformed linearly and then mapped back to the Lie group via an exponential mapping, thus ensuring the equivariance of the prediction vector under geometric transformations such as rotation and scaling. This represents the learnable weight matrix. Initialize the coupling coefficients (uniformly distributed) of all bottom-level capsules to top-level capsules. In each routing iteration, first calculate the input vector of each top-level capsule (i.e., the weighted sum of the prediction vectors of all bottom-level capsules according to the coupling coefficients), and then obtain the output vector of the top-level capsule through the compression function (squash). The output of the high-level capsule obtained by the compression function is expressed as follows: , Represents the input vector. It is a scalar compression factor, when When it is very large, it approaches 1; when it is very small, it approaches 0.
[0023] The log-prior probability of the coupling coefficient is updated based on the consistency of the dot product between the output vector and the prediction vector, and the coupling coefficient is recalculated using softmax. Meanwhile, a sparsity constraint is introduced after each iteration: only the K largest coupling coefficients corresponding to each bottom capsule are retained, and the rest are set to zero. This iteration process is repeated several times. After the routing iteration is completed, the output vector of each high-level capsule is the category capsule of the candidate target. Its magnitude represents the probability of the existence of the category. The direction of the vector encodes the initial position parameters of the target (such as center coordinates, width and height). All high-level capsules with magnitudes exceeding the preset threshold and their corresponding position parameters are output as the candidate target set.
[0024] The attitude capsule vector set output by the primary capsule layer is compressed to a magnitude between 0 and 1, which is directly used as the probability of target existence. When the prediction vector is calculated by the exponential mapping of the Lie group between adjacent capsule layers, the logarithmic mapping and the exponential mapping are based on the bijective relationship between the Lie algebra and the Lie group, respectively. Specifically, the exponential mapping on the SO(2) or SO(3) group is used for rotation parameters, and the mapping on the affine group is used for scale and translation parameters, so as to ensure that the prediction vector is equally variable to the UAV's viewpoint. The sparse constraint in the sparse iterative routing adopts the Top-K strategy, with the K value preset to 10% to 20% of the number of bottom capsules. The coupling coefficient is L1 regularized after each routing iteration to enhance the sparsity and convergence stability of the routing. In addition, the initial position parameters of each target in the candidate target set include the bounding box center coordinates (x, y), width w, and height h. These parameters are decoded from the last 4 dimensions of the high-level capsule output vector, while the first d-4 dimensions encode the target's rotation angle and scale factor.
[0025] Specifically, by fusing high-resolution details with low-resolution semantic features and using capsule networks combined with Lie group index mapping to encode pose in an equivariant manner, the problem of unstable appearance features caused by changes in UAV perspective and illumination is solved, and robust candidate target detection and initial localization are achieved against geometric transformations such as rotation and scaling.
[0026] S3. Receive the candidate target set, take each candidate target as a node, calculate the relative position difference and feature cosine similarity between nodes, build an edge spatiotemporal context graph, input the multi-layer graph attention network to aggregate the features of neighboring nodes and update the node representation, and output the refined bounding box through the regression head. Furthermore, the relative positional difference between nodes and the feature cosine similarity are calculated, including: The center point coordinates (including x and y coordinates) of the target are read from the initial position parameters in the candidate target set. At the same time, a fixed-length feature vector (e.g., 128-dimensional or 256-dimensional) is extracted from the corresponding category capsule. This vector encodes the semantic information of the target (such as category, texture, shape, etc.). For any two nodes, obtain the coordinates of the center point, calculate the Euclidean distance (i.e., straight-line distance) between the two points, and then divide the distance value by the diagonal length of the current image (or a preset maximum distance constant) to obtain the normalized relative position difference. This normalization operation makes the position difference fall between 0 and 1. The smaller the value, the closer the actual distance between the two targets in the image space. For any two nodes, extract their corresponding feature vectors, calculate the dot product of the two vectors (i.e., multiply the corresponding dimensions and then sum them), and then calculate the magnitude of each vector (i.e., the square root of the sum of the squares of each dimension). Finally, divide the dot product by the product of the two magnitudes to obtain the cosine similarity value. This value is between -1 and 1, and the larger the value, the more similar the semantic features of the two targets are (usually a positive value indicates a positive correlation). After completing the above calculations, compare the position difference with a preset position threshold (e.g., 0.3) and the similarity with a preset similarity threshold (e.g., 0.6). When the position difference is less than the threshold and the similarity is greater than the threshold, establish an undirected edge between the two nodes. Finally, all nodes and edges construct a spatiotemporal context graph.
[0027] Furthermore, the regression head outputs a refined bounding box, including: After being input into a multi-layer graph attention network, the feature vector of each node undergoes multi-layer propagation and aggregation to obtain an enhanced representation that incorporates contextual information. This enhanced representation is then input into a regression head, which consists of two fully connected layers. The first layer maps the features to an intermediate dimension and then performs ReLU activation. The second layer directly outputs the absolute coordinates of the refined bounding box (center point x and y coordinates, width, and height). Unlike the initial position parameters in step S2, this regression head does not depend on the initial box offset but independently predicts the final bounding box. Its formula is: ;in, Let i be the output feature of node i after passing through an L-layer graph attention network. and These are two fully connected layers, with outputs... To refine the bounding box, among which, and To refine the center point coordinates of the bounding box, and To refine the width and height of the bounding box.
[0028] The multi-layer graph attention network consists of two stacked graph attention layers, each with four attention heads. Each attention head independently calculates the normalized attention coefficients of its neighboring nodes, and the multi-head outputs are concatenated and linearly transformed to reduce the dimensionality to 256 as node update features. The first fully connected layer of the regression head maps the 256-dimensional input to 128 dimensions and connects it to a ReLU activation layer and a Dropout layer with a dropout rate of 0.2. The second fully connected layer directly outputs 4-dimensional refined bounding box coordinates (center point x-coordinate, center point y-coordinate, width, and height). The position threshold and similarity threshold are dynamically adjusted according to the UAV's flight altitude and image resolution. When the flight altitude is below 50 meters, the position threshold is set to 0.2, and when it is above 100 meters, it is set to 0.4. The similarity threshold is fixed at 0.6. After the spatiotemporal context graph is constructed, if the number of nodes in the graph is less than two, the graph attention network is skipped, and the initial position parameters are directly output as refined bounding boxes.
[0029] Specifically, by constructing a spatiotemporal context graph and using a multi-layer graph attention network to aggregate the positional and semantic features of neighboring nodes, the problem of coarse bounding box localization caused by mutual occlusion between targets and lack of context is solved, and context-aware refined bounding box output is achieved, which significantly improves localization accuracy.
[0030] S4. Receive each refined bounding box, sample continuous coordinate points inside each refined bounding box, and input the coordinate points into a multilayer perceptron to fit the symbolic distance field. Solve the zero level set of the symbolic distance field to obtain the positioning contour corresponding to each refined bounding box. Furthermore, continuous coordinate points are sampled within each refined bounding box, and these coordinate points are input into a multilayer perceptron to fit the signed distance field, including: For each refined bounding box, firstly, uniform grid sampling points (e.g., stride 2 pixels) are generated within its internal region. The distance from each sampling point to the nearest ground truth boundary is calculated to obtain an initial symbolic distance field. Then, near-boundary regions with absolute distances less than a threshold (e.g., 5 pixels) are identified, and sampling points are densified within these regions (stride reduced to 0.5 pixels). Regions far from the boundary are either sparsely sampled or have some points randomly discarded. Finally, a set of continuous coordinate points is obtained within each bounding box. K is adaptively determined based on the bounding box area (approximately 500~2000 points); Coordinates of the sampling points As input, it is fed into a multilayer perceptron (MLP). This MLP typically contains 3 to 5 fully connected layers, each with 64 to 256 neurons, using the ReLU activation function, and the last layer outputs a single scalar value. This represents the predicted symbolic distance of the point (positive values indicate the point is outside the target, negative values indicate it is inside, and zero values indicate it is on the boundary); during training, the true symbolic distance (calculated from the labeled contours) is used as the supervision signal, and the loss function is the L2 distance. Where K is the total number of sampling points within a single refined bounding box, and k is the number of sampling points. The predicted symbol distance value for the k-th sampling point. The true symbol distance value of the k-th sampling point is calculated from the manually annotated target contour. It is the squared L2 norm; at the same time, a gradient regularization term can be added to encourage the gradient magnitude of the distance field output by the MLP to be 1 near the boundary.
[0031] For each refined bounding box, the MLP outputs the predicted symbolic distance values at all sampling points within the box. These values constitute a continuous symbolic distance field (implicit representation) for the bounding box region. This distance field will be used for the next step of zero-level set extraction. The overall process is expressed by the following formula: ,in, For a multilayer perceptron, ( , ) represents the continuous coordinates of the k-th sampling point, and θ represents the learnable parameters of the MLP.
[0032] Furthermore, solving for the zero level set of the sign distance field yields the positioning profile corresponding to each refined bounding box, specifically including: Within the local area covered by each refined bounding box, the continuous coordinate space is discretized into a high-resolution grid (e.g., a grid size of M×M, where M is 128 or 256). For each grid point on the grid... The pre-trained multilayer perceptron (MLP) is invoked to calculate the symbolic distance value of the point. Thus, the discrete symbolic distance field over the entire grid is obtained, where Represents the coordinates of grid points on the discretized grid; The sign distance value of each cell (composed of four adjacent cells) on the grid is checked one by one. If the sign distance values of the four corner points of the cell are opposite (i.e., both positive and negative values exist at the same time), it indicates that the target boundary passes through the cell. The intersection points of the boundary and the cell edge are calculated by linear interpolation according to the sign pattern (16 cases in total). These intersection points are connected in order to form line segments. Finally, the line segments in all cells are spliced to form one or more closed contour lines. This contour line is the zero level set, which corresponds to the precise boundary of the target. The extracted initial contour lines are post-processed: first, Gaussian filtering or spline interpolation is used to remove jagged noise; then, the Douglas-Peucker algorithm is used to reduce the number of redundant vertices while maintaining the contour shape; finally, the positioning contours corresponding to each refined bounding box are obtained and output in the form of an ordered vertex list.
[0033] Specifically, the extracted initial contour lines undergo post-processing: For the initial contour lines extracted by the Marching Squares algorithm (composed of a series of discrete pixel-level vertices connected sequentially), Gaussian filtering is used for smoothing and denoising. Each vertex coordinate is treated as a two-dimensional signal, and a Gaussian kernel with a standard deviation σ = 1.0 to 1.5 (kernel size 5×5 or 7×7) is used to convolve the vertex sequence. That is, the horizontal and vertical coordinates of each vertex are weighted and averaged with the coordinates of its immediate and neighboring vertices, with the weights determined by a Gaussian function. This eliminates the sawtooth fluctuations caused by discrete sampling and numerical errors in the symbolic distance field. Alternatively, cubic spline interpolation can be used: the original vertex sequence is used as control points, and a cubic B-spline curve is fitted. Resampling is then performed at equal parameter intervals to obtain a smoother contour line. Secondly, the Douglas-Peucker algorithm is used to simplify the number of contour vertices: a distance threshold ε is set to 1.0 to 2.0 pixels (adjustable according to the contour size, usually 1.5 pixels). The algorithm recursively finds the vertex farthest from the line connecting the beginning and end endpoints of the current contour. If the distance is greater than ε, the vertex is retained and recursively segmented; otherwise, all intermediate vertices are discarded. The retained vertices constitute the simplified contour, which significantly reduces redundant vertices while maintaining the overall shape of the contour, thus reducing storage and subsequent processing overhead. Finally, the post-processed contour vertices are arranged into an ordered list in their original order (clockwise or counterclockwise), with the coordinates of each vertex being sub-pixel precision (floating-point numbers), which serves as the final positioning contour output corresponding to the refined bounding box.
[0034] The zero level set (target contour) is represented as follows: This formula defines the zero-level set Γ, which is the target contour. Specifically, it means: within the local region defined by a given refined bounding box, it is the set of all continuous coordinates (x, y) that satisfy the output value of a multilayer perceptron (MLP) equal to zero. The MLP fits a signed distance field, and its output value is positive if the point is outside the target, negative if it is inside, and zero if it is exactly on the target boundary; therefore, Γ is the extracted precise contour line of the target.
[0035] Specifically, by sampling continuous coordinate points in layers within the refined bounding box and fitting the symbolic distance field using a multilayer perceptron, and then solving the zero level set to obtain the sub-pixel level positioning contour, the problem of rough bounding box positioning and inability to meet the requirements of accurate contour extraction is solved, and high-precision, arbitrary-shaped target contour output is achieved.
[0036] S5. Receive each positioning contour and its continuous boundary distance measurement, compare each distance measurement with a preset threshold; if the distance measurement is greater than the preset threshold, retain the positioning contour as the final detection and positioning result; if the distance measurement is less than or equal to the threshold, discard the contour; output all retained contours.
[0037] Furthermore, step S5 specifically includes: Receive each positioning contour, use the Canny edge detection algorithm to extract a continuous boundary point sequence for each contour, and calculate the Euclidean distance of each boundary point sequence to obtain the continuous boundary distance metric for the contour. The Canny edge detection algorithm is used to extract a continuous boundary point sequence, including: Let the input image be I, and the Canny edge detection operator be Canny(·), then the extracted continuous boundary point sequence S is: Canny(I) outputs a binary edge map, and Trace(·) represents 8-neighborhood tracing along the edge, recording the coordinates of the boundary points in sequence to obtain an ordered sequence. .
[0038] The distance metric of each contour is compared with a preset threshold. If the distance metric is greater than the preset threshold, the contour is retained; otherwise, it is discarded, resulting in a list of retained contours. For each contour in the retained contour list, perform morphological operations to fill the internal region. Then, overlay the filled contour image with the original input image and determine whether the contours in the overlay image meet the continuity condition. If they do, output all retained contours as the final detection and localization result.
[0039] The process involves performing morphological operations to fill the internal region of each contour in the retained contour list. This includes generating a separate binary mask image for each contour, with the same size as the original image. In the mask, all pixels on the contour boundary (i.e., each point in the continuous boundary point sequence extracted by the Canny algorithm) are marked as white (value 1), and all other background pixels are marked as black (value 0). Next, a structuring element is selected for morphological operations, using a 3×3 pixel cross-shaped template, containing only the center point and its four adjacent points (top, bottom, left, and right). This structuring element will act as a "probe" sliding across the image. Then, morphological closing operations are performed, i.e., dilation followed by erosion. During dilation, a cross-shaped structuring element is slid across the binary mask pixel by pixel. Whenever the area covered by the structuring element contains at least one white pixel, the output pixel corresponding to the center of the structuring element is set to white, and the outline boundary expands outward. Tiny holes or cracks that were originally adjacent to the boundary are covered by the expanded white area. After dilation, an image with thickened boundaries and holes initially filled is obtained. Next, an erosion operation is performed: the same cross-shaped structuring element is slid across the dilated image again. The center output pixel is set to white only when all pixels covered by the structuring element are white; otherwise, it is set to black. Erosion shrinks the previously expanded boundary back, but since the hole areas were filled with white during dilation, these areas remain white after erosion, thus filling the holes. After the closing operation, a specialized hole-filling algorithm is applied to process any remaining internal holes that are not completely filled. The specific steps are as follows: First, take the complement of the closing operation result, i.e., change white to black and black to white; then, using all pixels on the four boundaries of the image as seeds, flood fill the complement image, marking all black areas connected to the image boundaries as background; the black areas not flooded are the original holes inside the contour; finally, merge the closing operation result with these hole regions (i.e., take the union) to obtain a binary image where the holes are completely filled. Finally, post-processing is performed to remove small burrs or isolated noise generated during the filling process. Median filtering is used: a 3×3 square window is slid across the filled image, and the value of the center pixel of the window is replaced with the median value of all pixels within the window. This effectively eliminates isolated white or black noise. Alternatively, morphological opening operations (erosion followed by dilation) can be used. Using a cross-shaped structuring element, small protrusions are first eroded to remove them, and then the main shape is restored by dilation, making the edges of the inner contour area smoother.
[0040] After the above steps, the internal region corresponding to each contour is completely and smoothly filled, resulting in the final filled binary image, which can be directly used for subsequent overlay enhancement.
[0041] The continuous boundary distance metric uses Hausdorff distance, which calculates the maximum and minimum distance between the extracted contour boundary point sequence and the real labeled boundary point sequence, and then divides it by the image diagonal length to normalize it to the [0,1] interval. The preset threshold is dynamically set according to the UAV reconnaissance scenario. For scenarios with high missed detection costs, such as mountain search and rescue, the threshold is set to 0.2, and for scenarios with high false detection costs, such as infrastructure inspection, the threshold is set to 0.5. The low and high thresholds in the Canny edge detection algorithm are set to 50 and 150, respectively, and the Gaussian filter kernel size is 5×5. The continuity condition judgment is specifically as follows: check whether the internal pixels of the filled contour in the superimposed image are consistent with the edge gradient direction of the corresponding position in the original input image. If more than 80% of the pixel gradient directions in the filled area have an angle of less than 30 degrees with the contour normal, it is determined that the continuity condition is met. In addition, if multiple contours overlap in the retained contour list, they are sorted from high to low according to the distance metric, and only the contour with the highest distance metric is retained to remove overlapping redundancy.
[0042] Specifically, by calculating the continuous boundary distance metric of the contour and comparing it with a preset threshold, and combining morphological filling and continuity condition judgment, the problems of low confidence, internal holes and discontinuities in the contour output are solved, and a highly reliable, complete and smooth final positioning contour output is achieved.
[0043] Example 2 provides another implementation scheme for a deep learning-based UAV reconnaissance image target detection and localization method. This example solves the scale problem through feature pyramids and deformable convolutions, uses Transformer to achieve efficient parallel detection, uses graph convolutional networks for collaborative refinement, and finally obtains fine contours through subpixel convolutions and CRF. This example is more suitable for scenarios with high requirements for real-time performance, multi-scale changes, and regular target detection (such as vehicle counting and building detection), and complements the technical path of Example 1.
[0044] The specific content includes: Feature extraction based on multi-scale feature pyramids and deformable convolution: The original reconnaissance images collected by UAVs are input into the backbone network (such as ResNet-50) to extract multi-layer feature maps; a bidirectional feature pyramid network is constructed to fuse feature maps of different scales through top-down and bottom-up paths, and a deformable convolution module is introduced at each fusion node to adaptively adjust the sampling position to adapt to the drastic scale changes and geometric deformations of targets in UAV images; a set of multi-scale enhanced feature maps is output. End-to-end object detection and coarse localization based on Transformer: Multi-scale feature maps are flattened and positional encoding is added. The input is then fed into a Transformer encoder-decoder structure. The decoder uses learnable object queries and interacts with the features output by the encoder through a cross-attention mechanism to directly and in parallel predict a set of candidate targets. Each candidate target includes a class confidence score and initial bounding box coordinates. The set-to-set loss function (Hungarian matching) is used for training, eliminating the need for non-maximum suppression post-processing. Collaborative Bounding Box Refinement Based on Lightweight Graph Convolutional Networks: A relationship graph is constructed using candidate targets as nodes and intersection-over-union (IoU) and feature similarity between targets as edge weights; a two-layer graph convolutional network is used, where each node aggregates the bounding box information and features of its neighbors and corrects its own bounding box coordinates through message passing; the refined bounding box and its corresponding category are output. Subpixel convolution-based bounding box detail enhancement and contour refinement: For each refined bounding box, subpixel convolutional layers are used to upsample its internal region to obtain a higher resolution feature map; then a small fully convolutional network (FCN) is used to predict the probability that the point belongs to the target's interior pixel by pixel, and the edge is optimized by a conditional random field (CRF); finally, the target segmentation mask or subpixel-level localization contour is output. The final output based on adaptive confidence screening is as follows: calculate the comprehensive confidence of each target (a weighted sum of category confidence and contour consistency), discard candidate targets with confidence below a preset threshold; for the retained targets, output their category, refined bounding box and sub-pixel contour as the final detection and localization results of the UAV survey image.
[0045] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention in any way. Although the present invention has been disclosed above with reference to preferred embodiments, it is not intended to limit the present invention. Any person skilled in the art can make some modifications or alterations to the above-disclosed technical content to create equivalent embodiments without departing from the scope of the present invention. Any brief modifications, equivalent changes and alterations made to the above embodiments based on the technical essence of the present invention without departing from the scope of the present invention shall still fall within the scope of the present invention.
Claims
1. A deep learning-based method for target detection and localization in UAV reconnaissance images, characterized by: Includes the following steps: S1. Input the raw survey images collected by the UAV into the reversible neural network, and generate high-resolution detail feature maps and low-resolution semantic feature maps by alternately propagating through additive coupling layers and affine coupling layers. S2. Align the high-resolution detail feature map with the low-resolution semantic feature map and fuse them element by element, then input the result into the primary capsule layer to generate a pose capsule vector set. The prediction vector is calculated using an exponential mapping on a Lie group between adjacent capsule layers, and the connection weights are updated by sparse iterative routing to output a candidate target set. The candidate target set includes a category capsule and initial position parameters for each candidate target; S3. Receive the candidate target set, take each candidate target as a node, calculate the relative position difference and feature cosine similarity between nodes, establish edges according to the threshold to construct a spatiotemporal context graph, input the spatiotemporal context graph into a multi-layer graph attention network to aggregate the features of neighboring nodes and update the node representation, and output a refined bounding box through a regression head. S4. Receive each refined bounding box, sample continuous coordinate points inside each refined bounding box, input the continuous coordinate points into a multilayer perceptron to fit the symbolic distance field, solve the zero level set of the symbolic distance field, and obtain the positioning contour corresponding to each refined bounding box. S5. Receive each positioning contour and its continuous boundary distance measurement, compare each distance measurement with a preset threshold, if the distance measurement is greater than the preset threshold, retain the positioning contour as the final detection and positioning result, otherwise discard the contour, and output all retained contours.
2. The method for target detection and localization of UAV reconnaissance images based on deep learning according to claim 1, characterized in that: The method involves alternating forward propagation using additive coupling layers and affine coupling layers, including: processing the survey image through additive coupling layers to obtain a first intermediate feature; inputting the first intermediate feature into an affine coupling layer to obtain a second intermediate feature; repeating the above alternating process at least twice to output a high-resolution detail feature map and a low-resolution semantic feature map.
3. The method for target detection and localization in UAV reconnaissance images based on deep learning according to claim 1, characterized in that: Aligning high-resolution detail feature maps with low-resolution semantic feature maps and fusing them element-wise, including: The low-resolution semantic feature map is upsampled to the same spatial size as the high-resolution detail feature map through bilinear interpolation. If the number of channels is inconsistent, convolution is used to transform the number of channels to obtain an aligned low-resolution feature map. Then, the aligned low-resolution feature map and the original high-resolution detail feature map are added element-wise at the same spatial position and channel to obtain a fused feature map. Batch normalization or layer normalization is then performed before output.
4. The method for target detection and localization in UAV reconnaissance images based on deep learning according to claim 1, characterized in that: Output the candidate target set, including: The pose capsule vector set output by the primary capsule layer is used as the bottom capsule. For each bottom capsule and each high-level capsule, the pose vector of the bottom capsule is transformed to the Lie algebra space through a logarithmic mapping, then multiplied with the learnable weight matrix, and then mapped back to the Lie group space through an exponential mapping to obtain the prediction vector. The prediction vector represents the contribution of the bottom capsule to the target category and pose represented by the high-level capsule, and due to the characteristics of the Lie group exponential mapping. Initialize the coupling coefficients from all bottom-level capsules to top-level capsules. Calculate the input vector of each top-level capsule in each routing iteration. Obtain the output vector of the top-level capsule through a compression function. Update the logarithmic prior probability of the coupling coefficients based on the consistency of the dot product between the output vector and the prediction vector, and recalculate the coupling coefficients using the maximum flexibility value. Introduce sparsity constraints after each iteration and repeat the iteration process several times. After the routing iteration ends, the output vector of each top-level capsule serves as the category capsule for candidate targets. Its magnitude represents the probability of the category's existence, and the vector's direction encodes the initial position parameters. All top-level capsules with magnitudes exceeding a preset threshold and their position parameters constitute the candidate target set.
5. The method for target detection and localization of UAV reconnaissance images based on deep learning according to claim 1, characterized in that: Calculate the relative positional difference between nodes and the feature cosine similarity, including: The coordinates of the center point of each target are read from the candidate target set. The Euclidean distance between the two points is calculated and divided by the length of the image diagonal to obtain the normalized relative position difference. Fixed-length feature vectors are extracted from the corresponding category capsules. The dot product of the two vectors is calculated and divided by the product of their magnitudes to obtain the cosine similarity. The position difference is compared with a preset position threshold, and the similarity is compared with a preset similarity threshold. When the position difference is less than the threshold and the similarity is greater than the threshold, undirected edges are established between the nodes to construct a spatiotemporal context graph.
6. The method for target detection and localization in UAV reconnaissance images based on deep learning according to claim 1, characterized in that: The refined bounding box output by the regression head includes: The enhanced feature representation of each node obtained after propagation and aggregation through a multi-layer graph attention network is input into the regression head. The regression head consists of two fully connected layers. The first layer maps the features to the intermediate dimension and then activates them with ReLU. The second layer directly outputs the absolute coordinate values of the refined bounding box. The absolute coordinate values include the x-coordinate of the center point, the y-coordinate of the center point, the width, and the height.
7. The method for target detection and localization of UAV reconnaissance images based on deep learning according to claim 1, characterized in that: Sampling continuous coordinate points within each refined bounding box, and inputting these continuous coordinate points into a multilayer perceptron to fit the signed distance field, includes: Within each refined bounding box, uniform grid sampling points are first generated with a step size of two pixels. The distance from each sampling point to the nearest real target boundary is calculated. Near-boundary regions with an absolute distance value of less than five pixels are identified, and the sampling step size is increased to 0.5 pixels within these regions. Regions far from the boundary are kept sparsely sampled, ultimately obtaining a set of continuous coordinate points. The coordinates of the sampling points are input into a multilayer perceptron, which contains multiple fully connected layers. The last layer outputs a single scalar value as the predicted symbolic distance of the point. A positive value indicates that the point is outside the target, and a negative value indicates that it is inside. The predicted symbolic distance values of all sampling points within each refined bounding box constitute the continuous symbolic distance field of that region.
8. The method for target detection and localization in UAV reconnaissance images based on deep learning according to claim 1, characterized in that: Solving for the zero level set of the symbolic distance field yields the positioning profile corresponding to each refined bounding box, specifically including: Within the local area covered by each refined bounding box, the continuous coordinate space is discretized into a high-resolution grid. For each grid point, a multilayer perceptron is used to calculate the symbolic distance value to obtain a discrete symbolic distance field. The symbolic distance values of the four corner points of each cell are checked. If they are different signs, it indicates that the target boundary passes through the cell. The intersection points of the boundary and the cell edge are calculated by linear interpolation according to the symbol pattern. All intersection points are connected in sequence to form line segments, which are then spliced to form one or more closed contour lines. Gaussian filtering or spline interpolation is applied to the initial contour lines to remove jagged noise. Then, the Douglas-Peucker algorithm is used to reduce redundant vertices, finally obtaining a localized contour in the form of an ordered vertex list.
9. The method for target detection and localization in UAV reconnaissance images based on deep learning according to claim 1, characterized in that: Step S5 specifically includes: Receive each positioning contour, use the Canny edge detection algorithm to extract a continuous boundary point sequence for each contour, and calculate the Euclidean distance of each boundary point sequence to obtain the continuous boundary distance metric for the contour. The distance metric of each contour is compared with a preset threshold. If the distance metric is greater than the preset threshold, the contour is retained; otherwise, it is discarded, resulting in a list of retained contours. For each contour in the retained contour list, perform morphological operations to fill the internal region. Then, overlay the filled contour image with the original input image and determine whether the contours in the overlay image meet the continuity condition. If they do, output all retained contours as the final detection and localization result.