A deep learning-based unmanned aerial vehicle low-altitude perspective target detection method

By integrating flight control parameters and visual information in low-altitude target detection of UAVs, performing scale analysis and adaptive block processing, and combining deep learning networks for temporal consistency correction, the problems of low detection accuracy and temporal inconsistency are solved, and the robustness and continuity of detection are improved.

CN122244723APending Publication Date: 2026-06-19NANJING TECH UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NANJING TECH UNIV
Filing Date
2026-01-28
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing deep learning-based UAV target detection methods fail to effectively integrate flight control parameters and visual information, resulting in low detection accuracy and temporal inconsistencies. They are also unable to adapt to scale fluctuations caused by altitude changes, attitude disturbances, and perspective distortion during low-altitude flight, thus affecting detection accuracy and robustness.

Method used

By collecting low-altitude video frames and flight control parameters from UAVs, preprocessing and encoding are performed, followed by highly correlated scale analysis and adaptive block processing. Combined with a deep learning object detection network, fast inference and temporal consistency processing are carried out to output complete object detection results.

Benefits of technology

It improves the detection capability for small and dense targets, suppresses false detections, missed detections, and positioning drift caused by jitter, occlusion, or rapid movement, and enhances the continuity and stability of detection results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244723A_ABST
    Figure CN122244723A_ABST
Patent Text Reader

Abstract

This invention discloses a deep learning-based method for low-altitude target detection from a UAV, relating to the field of computer vision technology. The method includes: acquiring low-altitude video frames and flight control parameters from the UAV; preprocessing the low-altitude video frames to output a preprocessed frame sequence and encoding the flight control parameters into conditional vectors; performing target region focusing processing on the overall target detection results and performing refined detection on local regions to output single-frame target detection results; performing temporal consistency processing based on the single-frame target detection results and the feature information of adjacent video frames to output temporally corrected target detection results; performing unified coordinate integration and redundancy suppression processing on the temporally corrected target detection results to form complete target detection results; and constructing conditional vectors by adding flight control parameters to drive highly correlated scale analysis and adaptive block processing, effectively suppressing jitter, missed detections, and positioning drift, and enhancing the continuity and stability of the detection results in the temporal dimension.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision technology, and in particular to a method for low-altitude target detection of unmanned aerial vehicles based on deep learning. Background Technology

[0002] Against the backdrop of the rapid development of low-altitude remote sensing and intelligent perception technologies, image detection methods based on UAV platforms have become an important direction for interdisciplinary research in computer vision and artificial intelligence. With the significant achievements of deep learning models (such as YOLO, Faster R-CNN, and DETR) in general target detection tasks, related technologies have been widely transferred to UAV vision systems. Especially in application scenarios such as urban inspection, emergency rescue, and agricultural monitoring, UAVs, with their flexible deployment, wide-area coverage, and low-altitude high-resolution imaging capabilities, provide a new data acquisition paradigm for achieving efficient target recognition. Existing deep learning-based UAV target detection methods often directly adopt general image detection frameworks, typically treating UAV video frames as static images. This ignores the drastic scale fluctuations and target deformation caused by factors such as altitude changes, attitude disturbances, and viewpoint distortion during low-altitude flight. When UAVs acquire images at different altitudes or tilt angles, the apparent scale of the same target in the image may vary by orders of magnitude. Traditional detection networks, because they do not explicitly model the mapping relationship between flight parameters (such as relative altitude, pitch angle, roll angle, etc.) and image content, find it difficult to adaptively adjust the receptive field or block segmentation strategy, thus affecting detection accuracy and robustness. In continuous video frame processing, if the inter-frame motion consistency constraint is ignored, it can easily lead to target trajectory breaks or repeated detection, reducing the overall stability of the system. Summary of the Invention

[0003] In view of the aforementioned existing problems, the present invention is proposed.

[0004] Therefore, this invention provides a deep learning-based method for low-altitude target detection in UAVs to address the problems of low detection accuracy and inconsistent timing caused by the lack of fusion of flight control parameters and visual information.

[0005] To solve the above-mentioned technical problems, the present invention provides the following technical solution: This invention provides a deep learning-based method for low-altitude target detection from a UAV, comprising: acquiring low-altitude video frames and flight control parameters of the UAV; preprocessing the low-altitude video frames to output a preprocessed frame sequence and encoding the flight control parameters into conditional vectors; performing highly correlated scale analysis and adaptive block processing on the current low-altitude video frames based on the preprocessed frame sequence and conditional vectors to output an image sub-block sequence; rapidly inferring the image sub-block sequence using a deep learning target detection network to obtain initial target prediction results and mapping the prediction results to the original image coordinate system to form a full-image target detection result; performing target region focusing processing on the full-image target detection result and performing refined detection on local regions to output a single-frame target detection result; performing temporal consistency processing based on the single-frame target detection result and the feature information of adjacent video frames to output a temporally corrected target detection result; and performing unified coordinate integration and redundancy suppression processing on the temporally corrected target detection result to form a complete target detection result.

[0006] As a preferred embodiment of the deep learning-based UAV low-altitude target detection method of the present invention, the specific steps for the output preprocessed frame sequence are as follows: Control the drone to capture raw low-altitude video frames using its onboard camera during low-altitude flight and assign time stamps; Based on the time stamp, the corresponding flight altitude and attitude information are read to form flight control parameters; The low-altitude video frames are processed for image size, pixel distribution and uniform normalization, and preprocessed low-altitude video frames are output. The preprocessed low-altitude video frames are organized in chronological order to form a preprocessed frame sequence.

[0007] As a preferred embodiment of the deep learning-based UAV low-altitude target detection method of the present invention, the specific steps for encoding flight control parameters into conditional vectors are as follows: Based on time stamps, the flight control parameters are aligned with pre-processed low-altitude video frames to output an aligned flight control parameter set. Then, unit normalization and range clipping are performed to output a normalized flight control parameter set. Based on the normalized flight control parameter set, a fixed-dimensional parameter vector representation is constructed, and a structured flight control vector is output. The structured flight control vector is subjected to dimensional transformation and feature mapping processing to output a flight control embedding vector. Then, the amplitude is normalized, and a conditional vector is output.

[0008] As a preferred embodiment of the deep learning-based UAV low-altitude target detection method of the present invention, the specific steps for the output image sub-block sequence are as follows: Select the current preprocessed low-altitude video frame and its corresponding condition vector from the preprocessed frame sequence; perform a highly correlated scale analysis on the current preprocessed low-altitude video frame based on the condition vector, and generate the scale analysis results; Based on the scale analysis results, adaptive block processing is performed on the current preprocessed low-altitude video frame to generate multiple image sub-blocks. Spatial location information corresponding to the original image coordinate system is assigned to each image sub-block, and the image sub-blocks are associated with the spatial location information to form an image sub-block sequence.

[0009] As a preferred embodiment of the deep learning-based UAV low-altitude target detection method of the present invention, the specific steps for obtaining the initial target prediction result are as follows: The image sub-block sequence is encapsulated according to a unified data format to form target detection data. The target detection data is then input into the feature extraction part of the deep learning target detection network to perform convolutional feature extraction on the image sub-block sequence and output the sub-block feature set. Based on the feature set of sub-blocks, after processing by the feature fusion part, the target prediction part performs target localization and category discrimination calculations, and outputs a set of target prediction results; The confidence score is calculated and filtered on the target prediction result set, the confidence score information is output, and the spatial location information of the image sub-block sequence is established to output the initial target prediction result.

[0010] As a preferred embodiment of the deep learning-based UAV low-altitude target detection method of the present invention, the deep learning target detection network includes a feature extraction part, a feature fusion part, and a target prediction part connected in sequence. The feature extraction part is used to perform feature encoding on the image sub-block sequence to generate multi-layer feature representations. The feature fusion part performs scale alignment and information integration processing on the multi-layer feature representations. The target prediction part outputs the target prediction result based on the fused features.

[0011] As a preferred embodiment of the deep learning-based UAV low-altitude target detection method of the present invention, the specific steps for forming the full-image target detection result are as follows: Based on the spatial location information of image sub-block sequences, coordinate transformation is performed on the target location parameters in the initial target prediction result, and the mapped target location result is output. The target category information and confidence information in the mapped target location results and the initial target prediction results are combined to form a set of target candidate results for the whole map; the overlapping area elimination and redundancy screening processes are performed on the set of target candidate results for the whole map to output the target detection results for the whole map.

[0012] As a preferred embodiment of the deep learning-based UAV low-altitude target detection method of the present invention, the steps of performing target region focusing processing on the target detection results of the entire image and performing refined detection on local regions to output single-frame target detection results are as follows. Based on the target location distribution and confidence information in the full-image target detection results, a target interest region is generated, and the corresponding local image region is extracted from the current preprocessed low-altitude video frame. High-resolution feature extraction and target detection inference are performed on local image regions to obtain refined target prediction results. The refined target prediction results are then aligned and fused with the prediction information of corresponding regions in the full-image target detection results to form a single-frame target detection result.

[0013] As a preferred embodiment of the deep learning-based UAV low-altitude target detection method of the present invention, the specific steps for outputting the time-corrected target detection result are as follows: Read the preprocessed low-altitude video frames that are adjacent to the current preprocessed low-altitude video frame and combine them to form a set of adjacent frames; Based on the set of adjacent frames, feature representations of target regions corresponding to single-frame target detection results are obtained from the inference process of the deep learning target detection network, and cross-frame association is performed to form cross-frame feature association information; Based on cross-frame feature association information, temporal consistency analysis is performed on the single-frame target detection results to obtain temporal correlation relationships. Then, temporal correction processing is performed on the target position parameters and confidence information in the single-frame target detection results to form temporally corrected target detection results.

[0014] As a preferred embodiment of the deep learning-based UAV low-altitude target detection method of the present invention, the complete target detection result is obtained by performing unified coordinate integration processing on the target position parameters in the time-corrected target detection result, and performing redundancy suppression processing on targets with spatial overlapping relationships.

[0015] The beneficial effects of this invention are as follows: by adding flight control parameters to construct a conditional vector, driving height-related scale analysis and adaptive block processing, the image detection capability for small and dense targets is improved. Furthermore, by fusing feature information from adjacent frames for temporal consistency correction, false detections, missed detections, and positioning drift caused by jitter, occlusion, or rapid movement are effectively suppressed, thereby enhancing the continuity and stability of detection results in the time dimension. Attached Figure Description

[0016] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the following description of the embodiments will be briefly introduced. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0017] Figure 1 This is a flowchart of a deep learning-based method for low-altitude target detection in UAVs.

[0018] Figure 2 The flowchart shows the preprocessing and conditional vector encoding.

[0019] Figure 3 This is a flowchart for scaling analysis and adaptive block processing.

[0020] Figure 4 The flowchart for timing consistency processing.

[0021] Figure 5 This is a schematic diagram comparing the jitter of the target position before and after timing consistency correction.

[0022] Figure 6 This is a schematic diagram showing the distribution of the number of consecutive detection frames of the target before and after timing consistency correction. Detailed Implementation

[0023] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

[0024] Many specific details are set forth in the following description in order to provide a full understanding of the invention. However, the invention may also be practiced in other ways different from those described herein, and those skilled in the art can make similar extensions without departing from the spirit of the invention. Therefore, the invention is not limited to the specific embodiments disclosed below.

[0025] Secondly, the term "one embodiment" or "embodiment" as used herein refers to a specific feature, structure, or characteristic that may be included in at least one implementation of the present invention. The phrase "in one embodiment" appearing in different places in this specification does not necessarily refer to the same embodiment, nor is it a single or selective embodiment that is mutually exclusive with other embodiments.

[0026] Reference Figures 1-6 This is one embodiment of the present invention, which provides a deep learning-based method for detecting low-altitude targets from a UAV, comprising the following steps: S1. Collect low-altitude video frames and flight control parameters of the UAV, preprocess the low-altitude video frames of the UAV, output the preprocessed frame sequence, and encode the flight control parameters into a conditional vector; S1.1. Control the UAV to collect raw low-altitude video frames through the onboard camera during low-altitude flight and assign time stamps.

[0027] Specifically, during low-altitude flight, the drone continuously acquires raw low-altitude video frames using its onboard camera, and assigns a time stamp to each raw low-altitude video frame. The time stamp is used to characterize the acquisition time corresponding to the raw low-altitude video frame and is used for subsequent correlation and reading.

[0028] S1.2. Based on the time stamp, read the corresponding flight altitude and attitude information to form flight control parameters.

[0029] Specifically, based on the time stamp, the flight altitude and attitude information corresponding to the time stamp are retrieved from the flight control record. The flight altitude information is used to characterize the flight altitude state at the time of acquisition of the original low-altitude video frame, and the attitude information is used to characterize the attitude state at the time of acquisition of the original low-altitude video frame. The flight altitude information and attitude information are combined to form flight control parameters, so that the flight control parameters and the original low-altitude video frame are established in a one-to-one correspondence through the time stamp and maintain time consistency.

[0030] S1.3. Perform image size, pixel distribution and uniform normalization processing on the low-altitude video frames, output preprocessed low-altitude video frames, and organize the preprocessed low-altitude video frames in chronological order to form a preprocessed frame sequence.

[0031] Specifically, when performing image size, pixel distribution, and normalization processing on low-altitude video frames, each low-altitude video frame is read and its image size is normalized to convert it into a consistent image size. Simultaneously, pixel distribution normalization is performed to normalize pixel value ranges and brightness / contrast, outputting preprocessed low-altitude video frames. Based on time stamps, the preprocessed low-altitude video frames are then ordered and organized sequentially, maintaining the correspondence between the preprocessed low-altitude video frames and the time stamps, forming a preprocessed frame sequence.

[0032] S1.4. Based on the time stamp, the flight control parameters are aligned with the pre-processed low-altitude video frames to output the aligned flight control parameter set. Then, unit normalization and range clipping are performed to output the normalized flight control parameter set.

[0033] Specifically, based on time stamps, the time stamps corresponding to the preprocessed low-altitude video frames are read one by one, and the flight control parameter records are matched and searched according to the time stamps. The flight altitude and attitude information with consistent time stamps are extracted and associated with the corresponding preprocessed low-altitude video frames to form an aligned flight control parameter set. The flight altitude and attitude information in the aligned flight control parameter set are processed by unit conversion, and the flight altitude is uniformly converted to meters, and the attitude angle is uniformly converted to angles or radians to eliminate the differences in the expression of different units. Numerical normalization is performed on the flight altitude and attitude information respectively. The Min-Max method is used to map values ​​of different dimensions to a unified numerical range. The upper and lower limits used in the Min-Max method are determined based on the flight envelope of the UAV platform or the mission's allowable range. For example, the normalized upper and lower limits of flight altitude are taken from the minimum and maximum values ​​of the mission's allowable altitude range (2m to 120m), the normalized upper and lower limits of pitch and roll angles are taken from the platform's commonly used attitude range (-45° to 45°), and the normalized upper and lower limits of yaw angles are taken from the full angle range (-180° to 180°). After numerical normalization, the flight altitude and attitude information are subjected to range clipping. When the normalization result is less than the boundary value corresponding to the lower normalization limit, it is clipped to the lower boundary value. When the normalization result is greater than the boundary value corresponding to the upper normalization limit, it is clipped to the upper boundary value. The normalized flight control parameter set is then output.

[0034] S1.5. Construct a fixed-dimensional parameter vector representation based on the normalized flight control parameter set, output a structured flight control vector, perform dimensional transformation and feature mapping processing on the structured flight control vector, output a flight control embedding vector, perform amplitude normalization, and output a conditional vector.

[0035] Specifically, the flight altitude and attitude information in the normalized flight control parameter set are read sequentially, and the read flight altitude and attitude information are filled into the corresponding dimension positions of the fixed-dimensional parameter vector to form a structured flight control vector. Dimension transformation and feature mapping processing are performed on the structured flight control vector. By performing linear and nonlinear mapping operations on the structured flight control vector, it is converted into a new vector representation to obtain a flight control embedding vector, which is used to characterize the embedding features of the UAV's flight altitude and attitude state in the feature space. The amplitude of the flight control embedding vector is calculated dimension by dimension and amplitude normalization processing is performed to scale the values ​​of each dimension of the flight control embedding vector to a uniform amplitude range, generating a conditional vector.

[0036] S2. Perform highly correlated scale analysis and adaptive block processing on the current UAV low-altitude video frames based on the preprocessed frame sequence and conditional vector, and output the image sub-block sequence.

[0037] S2.1. Select the current preprocessed low-altitude video frame and its corresponding condition vector from the preprocessed frame sequence; perform a highly correlated scale analysis on the current preprocessed low-altitude video frame based on the condition vector, and generate the scale analysis results.

[0038] Specifically, the frame position corresponding to the current processing moment is determined based on the temporal order of the preprocessed frame sequence. The current preprocessed low-altitude video frame is read from the preprocessed frame sequence, and a condition vector consistent with the time identifier is retrieved from the condition vector set based on the time identifier corresponding to the current preprocessed low-altitude video frame. A correspondence is established between the condition vector and the current preprocessed low-altitude video frame. The vector components representing the influence of flight altitude and attitude information in the condition vector are analyzed, and the vector components are used as constraint information for scale analysis. Based on the image size information of the condition vector and the current preprocessed low-altitude video frame, the target scale change trend related to flight altitude information is calculated. The target scale change trend is corrected by combining the viewpoint change corresponding to attitude information, forming a scale analysis result that represents the target scale range, scale level selection criteria, and scale change direction.

[0039] S2.2. Based on the scale analysis results, perform adaptive block processing on the current preprocessed low-altitude video frame to generate multiple image sub-blocks. Assign spatial location information corresponding to the original image coordinate system to the multiple image sub-blocks, and associate the image sub-blocks with the spatial location information to form an image sub-block sequence.

[0040] Specifically, the frame position corresponding to the current processing moment is determined based on the temporal order of the preprocessed frame sequence. The current preprocessed low-altitude video frame is read from the preprocessed frame sequence, and a condition vector consistent with the time identifier is retrieved from the condition vector set based on the time identifier corresponding to the current preprocessed low-altitude video frame. A correspondence is established between the condition vector and the current preprocessed low-altitude video frame. The vector components representing the influence of flight altitude and attitude information in the condition vector are analyzed, and the vector components are used as constraint information for scale analysis and input into the scale analysis process. During the scale analysis process, the imaging parameters of the airborne camera are added to participate in pixel scale estimation. The imaging parameters include camera calibration parameters or field of view parameters. The camera calibration parameters include focal length parameters and principal point parameters, or are represented by equivalent field of view. Based on flight altitude information, camera imaging parameters, and image size information of the current preprocessed low-altitude video frames, the pixel scale change trend of the target in the image is estimated according to the imaging geometry relationship. Specifically, the target pixel scale decreases as the flight altitude increases and increases as the flight altitude decreases. The pixel scale change trend is corrected by combining the viewpoint change corresponding to the attitude information, including adjusting the equivalent imaging viewpoint or equivalent ground projection scale according to the change in line of sight caused by pitch angle and roll angle, thereby compensating for the deviation caused by attitude change in pixel scale estimation. A lookup table mapping relationship of "flight altitude-attitude-pixel scale interval" is established based on camera calibration parameters or field of view parameters. During runtime, the corresponding pixel scale interval is queried according to the current flight altitude and attitude information, and the query result is used as the input for scale analysis. The scale analysis results are generated to characterize the target scale range, the basis for scale level selection, and the direction of scale change.

[0041] S3. Use a deep learning object detection network to quickly infer the image sub-block sequence, obtain the initial object prediction result, and map the prediction result to the original image coordinate system to form the full-image object detection result.

[0042] S3.1. Encapsulate the image sub-block sequence according to a unified data format to form target detection data. Input the target detection data into the feature extraction part of the deep learning target detection network, perform convolutional feature extraction on the image sub-block sequence, and output the sub-block feature set.

[0043] Specifically, when encapsulating the image sub-block sequence according to a unified data format, each image sub-block in the sequence is read sequentially, and the sub-blocks are processed to ensure consistent size and pixel value range. At the same time, the spatial location information of the image sub-block sequence is associated with the sub-blocks and written together with the sub-blocks into a unified data format container to form target detection data. The target detection data is then input into the feature extraction part of the deep learning target detection network. The feature extraction part of the deep learning target detection network performs convolution operations on the target detection data to extract multi-layer feature representations of the image sub-blocks, and collects and organizes the multi-layer feature representations to output a sub-block feature set.

[0044] S3.2. Based on the sub-block feature set, after processing by the feature fusion part, the target prediction part performs target localization and category discrimination calculation, and outputs a target prediction result set.

[0045] Specifically, the feature set of sub-blocks is input into the feature fusion part, which performs scale alignment processing and information integration processing on the feature sets of sub-blocks from different levels to form a fused feature representation. The fused feature representation is then input into the target prediction part, which performs target localization calculation on the fused feature representation to obtain target position parameters, performs category discrimination calculation on the fused feature representation to obtain target category information, and calculates the corresponding confidence information for the target position parameters and target category information, outputting a set of target prediction results.

[0046] S3.3. Perform confidence calculation and filtering on the target prediction result set, output confidence information, and establish a correlation with the spatial location information of the image sub-block sequence to output the initial target prediction result.

[0047] Specifically, the target location parameters and target category information in the target prediction result set are read one by one, and the confidence information is calculated based on the corresponding target location parameters and target category information in the target prediction result set. At the same time, the confidence information is written back to the target prediction result set to form a target prediction result set with confidence information. The target prediction result set with confidence information is filtered to remove target prediction results with confidence information lower than the filtering criteria and retain target prediction results that meet the filtering criteria. The spatial location information of the image sub-block sequence is read and, based on the source image sub-block identifier of the target prediction result set, the retained target prediction results are associated with the spatial location information of the corresponding image sub-block sequence to output the initial target prediction result. The initial target prediction result includes the target location parameters, target category information, target confidence information in the local coordinate system of the corresponding image sub-block, as well as the spatial location information and association identifier corresponding to the source image sub-block of the target.

[0048] S3.4. The deep learning object detection network consists of a feature extraction part, a feature fusion part, and an object prediction part connected in sequence. The feature extraction part is used to perform feature encoding on the image sub-block sequence to generate multi-layer feature representations. The feature fusion part performs scale alignment and information integration processing on the multi-layer feature representations. The object prediction part outputs the object prediction result based on the fused features.

[0049] Specifically, a deep learning object detection network is constructed based on a feature extraction part, a feature fusion part, and an object prediction part. The image sub-block sequence is passed as input to the feature extraction part, which performs convolution operations and hierarchical feature encoding on the image sub-block sequence and outputs multi-layer feature representations at different network layers. The multi-layer feature representations are passed to the feature fusion part, which performs scale alignment on the multi-layer feature representations and performs information integration processing on the scale-aligned multi-layer feature representations to form a fused feature representation. The fused feature representations are passed to the object prediction part, which performs object localization and category discrimination calculation based on the fused feature representations and calculates confidence information, outputting the object prediction result. The object prediction result includes the corresponding object location parameters, object category information, and confidence information corresponding to the object location parameters and object category information.

[0050] Furthermore, during the training phase, the deep learning object detection network iteratively updates its parameters by inputting training images containing object annotation information. During training, the network utilizes object localization error information and category discrimination error information to jointly optimize the parameters of feature extraction, feature fusion, and object prediction, resulting in the trained deep learning object detection network.

[0051] S3.5. Based on the spatial location information of the image sub-block sequence, perform coordinate transformation processing on the target location parameters in the initial target prediction result, and output the mapped target location result.

[0052] Specifically, the target position parameters corresponding to each target in the initial target prediction results are read one by one, and the spatial position information of the image sub-block to which the target belongs in the image sub-block sequence is read simultaneously. Based on the spatial position information of the image sub-block in the original image coordinate system, the target position parameters are transformed from the local coordinate system of the image sub-block to the original image coordinate system. The coordinate conversion is completed by superimposing the corresponding spatial offset of the image sub-block on the position start coordinates and size parameters in the target position parameters. The transformed target position parameters are written into the target prediction results and replace the target position parameters in the original image sub-block coordinate system, and the mapped target position results are output.

[0053] S3.6. Combine the target category information and confidence information in the mapped target location results and the initial target prediction results to form a set of target candidate results for the whole map; perform overlapping region elimination and redundancy screening on the set of target candidate results for the whole map, and output the target detection results for the whole map.

[0054] Specifically, when combining the target category information and confidence information from the mapped target location results and the initial target prediction results, the target location parameters in the mapped target location results are read one by one, and the target category information and confidence information corresponding to the target location parameters are read from the initial target prediction results. The target location parameters, target category information, and confidence information are written into the same target record and the fields are aligned and numbered to form a set of full-map target candidate results. The spatial overlap relationship is calculated for any two target records in the full-map target candidate result set, and the target pairs with overlapping relationships are marked. Redundancy filtering is performed on the overlapping target pairs based on the confidence information, retaining the target records with higher confidence information and removing the target records with lower confidence information. The full-map target candidate result set after removal is summarized and organized, and the full-map target detection result is output. The full-map target detection result includes the target location parameters located in the unified coordinate system of the original low-altitude video frame, the target category information and target confidence information corresponding to the target location parameters. The full-map target detection result is the target set obtained by coordinate mapping of the initial target prediction results corresponding to multiple image sub-blocks and by overlapping region elimination and redundancy filtering.

[0055] It should be noted that the confidence information is used to characterize the degree of credibility of the target detection result as a real target. The confidence information is generated by the deep learning target detection network when performing target localization and category discrimination calculations, and it corresponds one-to-one with the corresponding target position parameters and target category information. In the target detection result set, the spatial overlap relationship between targets is calculated pairwise based on the target position parameters to determine the target pairs or target groups that have spatial overlap. When the position parameters of multiple targets have spatial overlap, it is considered that multiple targets correspond to the same real target or there is a possibility of duplicate detection. For targets with spatial overlap, redundancy screening is performed based on the confidence information corresponding to the targets. By comparing the confidence levels of overlapping targets, the target detection results with higher confidence are retained, and the target detection results with lower confidence are removed.

[0056] S4. Perform target region focusing processing on the target detection results of the whole image, and perform fine detection on local regions to output the target detection results of a single frame.

[0057] S4.1. Based on the target location distribution and confidence information in the full-image target detection results, generate the target interest region and extract the corresponding local image region from the current preprocessed low-altitude video frame.

[0058] Specifically, the target location parameters and confidence information of each target in the full-image target detection results are read one by one. Based on the spatial distribution of the target location parameters in the current preprocessed low-altitude video frame, the target location parameters of spatially adjacent targets are aggregated. At the same time, the target location parameters are filtered and sorted in combination with the confidence information to form the target location range for key processing and as the target interest region. According to the position range of the target interest region in the original image coordinate system, the image content corresponding one-to-one with the position of the target interest region is cropped from the current preprocessed low-altitude video frame to obtain the local image region. The local image region maintains the spatial position consistency with the target interest region and is directly derived from the current preprocessed low-altitude video frame.

[0059] S4.2. Perform high-resolution feature extraction and target detection inference on local image regions to obtain refined target prediction results. Align and fuse the prediction information of corresponding regions in the refined target prediction results and the full-image target detection results to form single-frame target detection results.

[0060] Specifically, a local image region is read, and convolutional feature extraction is performed on the image representation of the local image region while maintaining a high pixel resolution to obtain a high-resolution feature representation of the local image region. Then, object detection inference is performed based on the high-resolution feature representation, and a refined object prediction result is output. The prediction information corresponding to the spatial range of the local image region in the full-image object detection result is read, and the refined object prediction result is aligned with the corresponding prediction information to establish a one-to-one correspondence between the objects. Then, the aligned object position parameters, object category information, and confidence information are fused to update the corresponding object records and form a single-frame object detection result. The single-frame object detection result is a complete description of the object detection information within a single video frame.

[0061] S5. Based on the target detection results of a single frame, combined with the feature information of adjacent video frames, perform temporal consistency processing and output the temporally corrected target detection results.

[0062] S5.1. Read the pre-processed low-altitude video frames that are adjacent to the current pre-processed low-altitude video frame and combine them to form a set of adjacent frames.

[0063] Specifically, the time stamp corresponding to the current preprocessed low-altitude video frame is located in the preprocessed frame sequence. Then, using the time stamp as an index, the preprocessed low-altitude video frames that are immediately before and immediately after the current preprocessed low-altitude video frame in the preprocessed frame sequence are retrieved. The image content and time stamp of the adjacent preprocessed low-altitude video frames are read respectively. The adjacent preprocessed low-altitude video frames are combined and organized with the current preprocessed low-altitude video frame in chronological order, keeping the correspondence between each preprocessed low-altitude video frame and the time stamp unchanged, to form a set of adjacent frames.

[0064] S5.2. Based on the set of adjacent frames, the feature representation of the target region corresponding to the single-frame target detection result is obtained from the inference process of the deep learning target detection network, and cross-frame association is performed to form cross-frame feature association information.

[0065] Specifically, the target location parameters in the single-frame target detection result are read and mapped to the corresponding spatial range of each preprocessed low-altitude video frame in the adjacent frame set. Then, each preprocessed low-altitude video frame in the adjacent frame set is sequentially input into the deep learning target detection network, and intermediate feature representations of the spatial range corresponding to the target location parameters are extracted during the inference process of the deep learning target detection network to obtain the feature representation of the target region corresponding to the single-frame target detection result. Similarity calculation is performed on the target region feature representations of different preprocessed low-altitude video frames in the adjacent frame set, and the target region feature representations are paired and associated based on the similarity calculation results. Correspondence is established according to time order to form cross-frame feature association information. The cross-frame feature association information is used to characterize the correspondence between the same target in different preprocessed low-altitude video frames. The cross-frame feature association information includes the target region feature representation of the correspondence between the same target in different preprocessed low-altitude video frames, the similarity evaluation results between the feature representations, and the cross-frame target correspondence identifier established by the similarity evaluation results.

[0066] It should be noted that the target region feature representation is denoted as... and The expression for calculating similarity is: ; in, Indicates time The feature representation vector corresponding to a certain target region. Indicates time The feature representation vector of the corresponding target or candidate target region. Feature Representation Vector and The dot product of vectors, Feature Representation Vector The 2-norm, Feature Representation Vector The 2-norm, This represents a numerically stable term used to avoid zero denominators. It does not participate in feature similarity semantic judgment and is only used for numerical computation stability. This indicates the current time index.

[0067] It should be noted that the cross-frame feature association index mapping is constructed using the maximum similarity matching method, and the expression is: ; in, Indicates time The The target area in time The corresponding target region index is used to represent the correspondence of target regions in the time dimension. Indicates time Time of the first Feature representation vectors of each target region Indicates time Time of the first Feature representation vectors of each target region Indicates time The corresponding target region index, Indicates time The corresponding number Target region index.

[0068] S5.3. Based on cross-frame feature association information, perform temporal consistency analysis on the single-frame target detection results to obtain temporal association relationships, and perform temporal correction processing on the target position parameters and confidence information in the single-frame target detection results to form temporally corrected target detection results.

[0069] Specifically, the system reads the target region correspondence established in time order from the cross-frame feature association information, and reads the target position parameters and confidence information that match the target region correspondence from the single-frame target detection result. Then, based on the cross-frame feature association information, it performs time-series association calculation on the target position parameters between adjacent preprocessed low-altitude video frames to form a time-series association relationship. It performs time-series correction processing on the target position parameters in the single-frame target detection result, and smoothly updates the target position parameters through the time-series association result. It also performs time-series correction processing on the confidence information in the single-frame target detection result, and uniformly updates the confidence information through the time-series association result to form a time-series corrected target detection result. The time-series corrected target detection result includes the time-series corrected target position parameters, the target category information corresponding to the target position parameters, and the uniformly updated target confidence information.

[0070] Furthermore, such as Figure 5 As shown, under the same UAV low-altitude video sequence, the distribution of the number of consecutively detected target frames in the control scheme and the present invention scheme are compared. The horizontal axis represents the number of consecutively detected target frames, and the vertical axis represents the corresponding number of targets. In the control scheme, the number of consecutively detected target frames is mainly concentrated in a small range, and the target is prone to detection interruption in the video sequence. In the present invention scheme, the number of consecutively detected target frames is distributed over a larger range of frames, indicating that the target can be stably detected in more consecutive video frames. By fusing the feature information of adjacent video frames and performing temporal consistency correction processing, the present invention can effectively reduce the target detection interruption caused by jitter, occlusion, or rapid movement, thereby enhancing the continuity and stability of the target detection results in the time dimension. The control scheme is a detection scheme that outputs detection results based only on single-frame target detection results under the same UAV low-altitude video sequence and the same target detection network, without introducing cross-frame feature association information or performing temporal consistency correction processing on target position parameters and confidence information.

[0071] And as Figure 6 As shown, without temporal consistency correction, the target position jitter amplitude is large and fluctuates significantly; however, after fusing features from adjacent frames and performing temporal consistency correction, the target position jitter amplitude is significantly reduced and the trend of change is smoother. Especially in the frame interval with the greatest difference, the proposed solution shows a significant jitter suppression effect compared to the control solution, indicating that temporal consistency correction can effectively suppress positioning drift and enhance the stability of detection results in the time dimension.

[0072] S6. Perform unified coordinate integration and redundancy suppression on the time-corrected target detection results to form complete target detection results.

[0073] S6.1. The complete target detection result is obtained by unifying the coordinate integration of the target position parameters in the time-corrected target detection result and performing redundancy suppression processing on targets with spatial overlapping relationships.

[0074] Specifically, since temporal consistency correction performs cross-frame smooth updates to target position parameters, the corrected positions of the same target may merge or overlap. Furthermore, cross-frame fusion may introduce new duplicate target records. Therefore, redundancy suppression is performed again. The target position parameters in the temporal correction target detection results are read line by line, and their coordinate representation is checked. The target position parameters are uniformly converted to coordinates in the original image coordinate system, and field consistency is achieved. After unified coordinate integration, pairwise spatial overlap is calculated for the target position parameters in the temporal correction target detection results. Target pairs with spatial overlap are marked, and redundancy suppression is performed on the marked target pairs based on confidence information. One target record is retained, and duplicate target records are removed. The retained target records are summarized and organized to obtain a complete target detection result. The complete target detection result is the target detection result formed after unified coordinate integration of target position parameters and redundancy suppression for targets with spatial overlap, based on the temporal correction target detection result.

[0075] In summary, this invention improves the detection capability for small and dense targets by adding flight control parameters to construct a conditional vector, driving height-related scale analysis and adaptive block processing, and by fusing feature information from adjacent frames for temporal consistency correction, effectively suppressing false detections, missed detections, and positioning drift caused by jitter, occlusion, or rapid movement, thereby enhancing the continuity and stability of detection results in the time dimension.

[0076] It should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all such modifications or substitutions should be covered within the scope of the claims of the present invention.

Claims

1. A deep learning-based method for low-altitude target detection from a UAV, characterized in that: include, Collect low-altitude video frames and flight control parameters of the UAV, preprocess the low-altitude video frames of the UAV, output the preprocessed frame sequence, and encode the flight control parameters into a conditional vector; Based on the preprocessed frame sequence and conditional vector, perform highly correlated scale analysis and adaptive block processing on the current UAV low-altitude video frame, and output image sub-block sequence; A deep learning object detection network is used to quickly infer the sequence of image sub-blocks to obtain the initial object prediction results, and the prediction results are mapped to the original image coordinate system to form the full image object detection results. The system performs target region focusing processing on the target detection results of the entire image, and performs fine-grained detection on local regions, outputting single-frame target detection results; Temporal consistency processing is performed based on the single-frame target detection results and the feature information of adjacent video frames to output temporally corrected target detection results. The time-corrected target detection results are integrated with unified coordinates and redundancy suppression to form a complete target detection result.

2. The deep learning-based UAV low-altitude target detection method as described in claim 1, characterized in that: The specific steps for the output preprocessed frame sequence are as follows. Control the drone to capture raw low-altitude video frames using its onboard camera during low-altitude flight and assign time stamps; Based on the time stamp, the flight altitude and attitude information of the UAV are read to form flight control parameters; The low-altitude video frames are processed for image size, pixel distribution and uniform normalization, and preprocessed low-altitude video frames are output. The preprocessed low-altitude video frames are organized in chronological order to form a preprocessed frame sequence.

3. The deep learning-based UAV low-altitude target detection method as described in claim 1, characterized in that: The specific steps for encoding flight control parameters into conditional vectors are as follows. Based on time stamps, the flight control parameters are aligned with pre-processed low-altitude video frames to output an aligned flight control parameter set. Then, unit normalization and range clipping are performed to output a normalized flight control parameter set. Based on the normalized flight control parameter set, a fixed-dimensional parameter vector representation is constructed, and a structured flight control vector is output. The structured flight control vector is subjected to dimensional transformation and feature mapping processing to output a flight control embedding vector. Then, the amplitude is normalized, and a conditional vector is output.

4. The deep learning-based UAV low-altitude target detection method as described in claim 1, characterized in that: The specific steps for producing the output image sub-block sequence are as follows: Select the current preprocessed low-altitude video frame and its corresponding condition vector from the preprocessed frame sequence; perform a highly correlated scale analysis on the current preprocessed low-altitude video frame based on the condition vector, and generate the scale analysis results; Based on the scale analysis results, adaptive block processing is performed on the current preprocessed low-altitude video frame to generate multiple image sub-blocks. Spatial location information corresponding to the original image coordinate system is assigned to each image sub-block, and the image sub-blocks are associated with the spatial location information to form an image sub-block sequence.

5. The deep learning-based UAV low-altitude target detection method as described in claim 1, characterized in that: The specific steps for obtaining the initial target prediction result are as follows: The image sub-block sequence is encapsulated according to a unified data format to form target detection data. The target detection data is then input into the feature extraction part of the deep learning target detection network to perform convolutional feature extraction on the image sub-block sequence and output the sub-block feature set. Based on the feature set of sub-blocks, after processing by the feature fusion part, the target prediction part performs target localization and category discrimination calculations, and outputs a set of target prediction results; The confidence score is calculated and filtered on the target prediction result set, the confidence score information is output, and the spatial location information of the image sub-block sequence is established to output the initial target prediction result.

6. The deep learning-based UAV low-altitude target detection method as described in claim 5, characterized in that: The deep learning object detection network includes a feature extraction part, a feature fusion part, and an object prediction part connected in sequence. The feature extraction part is used to perform feature encoding on image sub-block sequences to generate multi-layer feature representations. The feature fusion part... The system performs scale alignment and information integration on the multi-layer feature representation, and the target prediction part outputs the target prediction result based on the fused features.

7. The deep learning-based UAV low-altitude target detection method as described in claim 1, characterized in that: The specific steps for generating the full-image target detection result are as follows. Based on the spatial location information of image sub-block sequences, coordinate transformation is performed on the target location parameters in the initial target prediction result, and the mapped target location result is output. The target category information and confidence information in the mapped target location results and the initial target prediction results are combined to form a set of target candidate results for the whole map. Perform overlapping region resolution and redundancy filtering on the candidate result set of the whole map, and output the target detection results of the whole map.

8. The deep learning-based UAV low-altitude target detection method as described in claim 7, characterized in that: The specific steps for performing target region focusing processing on the overall image target detection results, and performing refined detection on local regions to output single-frame target detection results are as follows. Based on the target location distribution and confidence information in the full-image target detection results, a target interest region is generated, and the corresponding local image region is extracted from the current preprocessed low-altitude video frame. High-resolution feature extraction and target detection inference are performed on local image regions to obtain refined target prediction results. The refined target prediction results are then aligned and fused with the prediction information of corresponding regions in the full-image target detection results to form a single-frame target detection result.

9. The deep learning-based UAV low-altitude target detection method as described in claim 1, characterized in that: The specific steps for outputting the time-corrected target detection result are as follows. Read the preprocessed low-altitude video frames that are adjacent to the current preprocessed low-altitude video frame and combine them to form a set of adjacent frames; Based on the set of adjacent frames, feature representations of target regions corresponding to single-frame target detection results are obtained from the inference process of the deep learning target detection network, and cross-frame association is performed to form cross-frame feature association information; Based on cross-frame feature association information, temporal consistency analysis is performed on the single-frame target detection results to obtain temporal correlation relationships. Then, temporal correction processing is performed on the target position parameters and confidence information in the single-frame target detection results to form temporally corrected target detection results.

10. The deep learning-based UAV low-altitude target detection method as described in claim 1, characterized in that: The complete target detection result is obtained by unifying the coordinate integration of the target position parameters in the time-corrected target detection result and performing redundancy suppression processing on targets with spatial overlapping relationships.