Radar-guided rgb multimodal unmanned aerial vehicle target detection method for ground anti-unmanned aerial vehicle scene

By utilizing the time alignment and motion compensation of radar point clouds and RGB images in ground-based anti-drone scenarios, a radar prior guidance representation is constructed, which solves the problems of missed detection and false detection in drone detection, achieves efficient multimodal fusion and bird flock suppression, and improves the reliability and real-time performance of detection.

CN122244843APending Publication Date: 2026-06-19ZHONGBEI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZHONGBEI UNIV
Filing Date
2026-03-10
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In ground-based anti-drone scenarios, drone target detection is prone to missed detections and false detections. Furthermore, existing multimodal fusion methods lack effective internal guidance mechanisms for the detection network and bird flock interference suppression strategies, making it difficult to meet the requirements of real-time performance and low false alarms.

Method used

By establishing timestamp information between RGB images and radar point cloud data, time alignment and motion compensation are performed to construct radar prior-guided representations. These representations are then used to guide fusion during feature extraction and encoding interaction stages. Combined with radar-visual consistency features, target detection and bird flock suppression are performed, improving the reliability and robustness of detection.

Benefits of technology

It improves the detectability of small targets at long distances, reduces the false alarm rate of bird flocks, and achieves efficient UAV target detection in complex backgrounds, making it suitable for real-time anti-UAV applications.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244843A_ABST
    Figure CN122244843A_ABST
Patent Text Reader

Abstract

A radar-guided RGB multimodal UAV target detection method for ground-based anti-UAV scenarios, belonging to the field of computer vision target detection, includes the following steps: S1: Establishing timestamp information for RGB images and radar point cloud data; S2: Performing motion compensation on the radar point cloud data to obtain aligned point cloud data; S3: Projecting onto the pixel plane of the RGB image to obtain the radar projection point set; S4: Constructing a radar prior guidance representation aligned with the pixels of the RGB image; S5: Guiding the fusion of multi-scale visual features to obtain fused features; S6: Outputting the category and bounding box of candidate targets to obtain candidate detection results; S7: Outputting the final UAV target detection result. This invention constructs a pixel-aligned prior guidance representation of radar point clouds, enabling radar information to participate in detection in an interpretable form, thus improving the detectability of distant and weak targets.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of computer vision target detection, specifically a radar-guided RGB multimodal drone target detection method for ground-based anti-drone scenarios. Background Technology

[0002] In ground-based anti-drone scenarios, drones often exhibit characteristics such as long distance, small scale, low contrast, and complex backgrounds (clouds, buildings, forests, power lines, etc.). Relying solely on RGB visual detection is prone to missed detections, false detections, and insufficient resistance to occlusion. Meanwhile, targets such as flocks of birds share similarities with drones in morphology, scale, and motion characteristics, leading to a high false alarm rate. Millimeter-wave radar offers advantages such as all-weather operation and strong ranging and velocity measurement capabilities, but its angular resolution is limited and its semantic representation is insufficient, making it difficult to reliably distinguish categories using radar alone. Existing multimodal fusion methods mostly employ simple feature stitching or post-detection fusion, lacking effective guidance mechanisms for key aspects of the detection network (feature extraction, cross-scale encoding interaction, decoding query decision-making), and lacking targeted suppression strategies for bird flock interference, making it difficult to meet the real-time and low false alarm requirements of anti-drone operations. Summary of the Invention

[0003] This invention provides a radar-guided RGB multimodal drone target detection method for ground-based anti-drone scenarios, in order to overcome the deficiencies in the prior art.

[0004] This invention is achieved through the following technical solution: A radar-guided RGB multimodal UAV target detection method for ground-based anti-UAV scenarios includes the following steps: S1: Acquire RGB images and corresponding radar point cloud data within the same monitoring area, and establish timestamp information for the RGB images and radar point cloud data; S2: Based on the timestamp information, the RGB image and radar point cloud data are time-aligned, and motion compensation is performed on the radar point cloud data to obtain aligned point cloud data; S3: Based on the calibration parameters of the radar and camera, perform coordinate transformation on the aligned point cloud data and project it onto the pixel plane of the RGB image to obtain the radar projection point set; S4: The radar projection point set is preprocessed and a radar prior guidance representation aligned with the RGB image pixels is constructed; S5: Input the RGB image into the target detection network to extract multi-scale visual features, and inject the radar prior guided representation into the target detection network, thereby guiding the fusion of multi-scale visual features in the feature extraction stage or the coding interaction stage to obtain fused features; S6: Based on the radar prior guidance representation, the decoding query of the target detection network is filtered, and the filtered decoding query and fused features are input into the decoder to output the category and bounding box of the candidate target, and the candidate detection result is obtained. S7: Perform bird flock suppression processing based on the radar-visual consistency features corresponding to the candidate detection results, and output the final UAV target detection results.

[0005] As described above, in the radar-guided RGB multimodal drone target detection method for ground-based anti-drone scenarios, step S1 involves acquiring the first... RGB image at any given time and corresponding millimeter-wave radar point cloud data; RGB images are denoted as: ,in, Indicates the sampling time or frame index; Indicates the first Frame RGB image; Represents the real number field; Indicates the image height; Indicates the image width; Indicates RGB three channels; Radar point clouds are denoted as: ,in, Indicates the first Frame radar point cloud ensemble; Indicates the first One radar point; Indicates the first Frame point cloud point count; Represents a set; subscript Point index; Each radar point is represented in polar coordinates as follows: ,in, Indicates the distance from the point to the radar; Indicates azimuth; Indicates the pitch angle; Indicates radial velocity; This indicates the echo intensity or signal-to-noise ratio.

[0006] As described above, in the radar-guided RGB multimodal UAV target detection method for ground-based anti-UAV scenarios, step S2 is set to a camera sampling time of... Radar sampling time is The time difference is defined as: ,in, This indicates the time deviation between the camera and radar sampling times; For camera timestamps; For radar timestamps; First-order distance compensation using radial velocity: ,in, Indicates the distance after compensation; Indicates the distance before compensation; This indicates the radial velocity measured by radar; Indicates time deviation; The platform pose changes are obtained and compensated using a homogeneous transformation matrix. Let the radar position be... Time's up The pose transformation at time t is ,but: ,in, Represents the rigid body pose transformation corresponding to Homogeneous matrix; Represents a three-dimensional rigid body motion group; For point Homogeneous coordinates in the radar coordinate system; Homogeneous coordinates after motion compensation; superscript Indicates transpose; When the radar outputs data in polar coordinates, it must first be converted to Cartesian coordinates in the radar coordinate system: in, Point Three-dimensional coordinates in the radar coordinate system; These are trigonometric functions.

[0007] As described above, in the radar-guided RGB multimodal UAV target detection method for ground-based anti-UAV scenarios, step S3 sets the extrinsic parameter from the radar coordinate system to the camera coordinate system as a rotation matrix. With translation vector ,but: ,in, Point Three-dimensional coordinates in the camera coordinate system; Point Three-dimensional coordinates in the radar coordinate system; It is a rotation matrix; It is a translation vector; The camera intrinsic parameter matrix is: in, This is the camera intrinsic parameter matrix; Focal length in the horizontal / vertical direction; Principal point coordinates; Pixel projection satisfies: in, For point Pixel coordinates on the image; The scale factor is the depth in the camera coordinate system. ; Indicates the depth of the point from the camera; From the above formula, we get: in, , Normalized imaging coordinates; Will satisfy and The projection points form the radar projection point set: in, This is the set of radar points projected onto the image plane.

[0008] The radar-guided RGB multimodal UAV target detection method for ground-based anti-UAV scenarios described above includes the following specific steps in S4: S4-1: Clutter Removal The radar projection points are filtered out, and points that meet any of the following conditions are identified as clutter and removed: in, It is the absolute value; Radial velocity threshold; Echo intensity threshold; This is the upper limit threshold for distance; S4-2: Pixel Weight Modeling Construct weights for each retention point The calculation formula is as follows: in, For point The fusion weight; These are non-negative weighting coefficients; For strength The normalized mapping function; For velocity amplitude The normalized mapping function; It is an exponential function; This is the distance attenuation scale parameter; S4-3: Kernel function diffusion generates dense guide graph Constructing a dense radar prior guidance map Using the maximum response form: in, For the first Frame radar prior guidance map in pixels The value at; This indicates that for all points Take the maximum value; The kernel function diffusion radius; For point The projected pixel coordinates; The square of the Euclidean distance between the pixel planes; S4-4: Multi-channel prior Construct density channels: in, Represents pixels Point cloud density estimation; Constructing a velocity channel: in, Represents pixels Radial velocity estimation; It is a very small positive number, used to avoid the denominator being 0; This leads to the formation of multi-channel radar priors: in, For pixels The multi-channel prior vector at the location; S4-5: Normalization Guided diagram Perform min-max normalization: in, This is the normalized guide graph; For the image The minimum value of all pixels; For the image The maximum value of all pixels; To avoid extremely small positive numbers with a denominator of 0.

[0009] As described above, the radar-guided RGB multimodal UAV target detection method for ground-based anti-UAV scenarios, wherein S5 maps the radar prior guidance representation into a gain coefficient map and / or a gating map, and performs element-wise modulation and fusion on the multi-scale visual features; Let Backbone output the first... The visual features at each scale are: ,in, For the first Feature tensors at each scale; Number of channels; The height and width of the feature map at this scale; superscript Indicates scale index; Downsampling / aligning the radar prior to the same scale yields: ,in, To and Radar prior features aligned to the same scale; This represents the number of radar prior feature channels; Generate a gating / gain coefficient map using a mapping function: in, express Convolution, or equivalent linear mapping, is used for channel alignment; Indicates the activation function; A gate / gain tensor with the same shape as the visual feature; Element-wise gain modulation is employed. in, Modulated visual features; This represents element-wise multiplication; Indicates and A full-size tensor of the same shape; This is the gain intensity coefficient.

[0010] As described above, in the radar-guided RGB multimodal UAV target detection method for ground-based anti-UAV scenarios, step S5 converts the radar prior guidance representation into an attention bias term and introduces the self-attention calculation and / or cross-attention calculation of the target detection network. Flatten the features into a token sequence: ,in, A sequence of tokens; The number of tokens; For token feature dimensions; Standard scaled dot product attention is: in, These are the query, key, and value matrices, respectively. Indicates the transpose of the key matrix; This is the scaling factor; This indicates that the matrix rows are subjected to Softmax normalization; Indicates attention output; To introduce radar priors, a radar bias is constructed for each token. Let the first The normalized radar prior mean within the pixel region corresponding to each token is ,but: ,in, For the first Radar bias scalar for each token; This is the bias coefficient; For the radar prior statistics of the region corresponding to the Token; Add the bias term to the attention weight calculation: in, This represents the attention output after incorporating radar priors; It is the bias vector; It is a vector consisting entirely of 1s; This indicates that the bias will be broadcast as a bias matrix with the same shape as the attention scoring matrix.

[0011] As described above, in the radar-guided RGB multimodal UAV target detection method for ground-based anti-UAV scenarios, the S6 setting decoder initial query count is... , No. Each query is denoted as Decoding output the first The probability vector of each candidate class is The bounding box is ; The category probability vector is: ,in, Number of categories; For the first The candidate belongs to the first The probability of a class; The bounding box is: ,in, The coordinates of the center point of the candidate box; Define the width and height of the candidate box; Take the set of pixels corresponding to the candidate bounding box region as Define radar consistency score: in, For the first The radar consistency score of each candidate; Candidate boxes The set of pixel coordinates covered; Represents a set The number of elements; This represents the summation of pixels within the candidate bounding box; To normalize the radar prior guidance diagram; The formula for classification uncertainty is as follows: in, For the first The classification uncertainty of each candidate; It is the natural logarithm; To prevent The smallest positive number; Overall score and Top-K selection, defining the overall score: ,in, For the first The overall score of each candidate; This represents the uncertainty penalty coefficient. Select the Top-K index set with the highest rating: in, The set of query indexes that are retained; Indicates from arrive The scoring sequence; This represents the operator that retrieves the indices of the K largest elements; This represents the initial number of queries; To preserve the number of queries, .

[0012] The radar-guided RGB multimodal UAV target detection method for ground-based anti-UAV scenarios, as described above, includes the following specific steps in S7: S7-1: Radial velocity consistency For candidates The radial velocity statistics of radar points within the candidate box region are as follows: The velocity vector is obtained by estimating the target motion from the visual side. The visual radial velocity component is obtained by projecting it onto the line of sight. Define the speed difference: ,in, The difference in radial velocity between radar and vision; These are the candidate radar radial velocity statistics. The radial velocity is the visual estimate of the candidate correspondence; It is the absolute value; S7-2: Trajectory Smoothness Suppose the same objective is in continuous The center position of the frame is Define the second-order difference: in, For the first The target center point of the frame; It is a second-order difference vector, reflecting the trajectory's "jitter / acceleration change"; coefficients Indicates the weights in the second-order difference; Define the trajectory jitter index: in, For trajectory jitter / curvature indicators; Represents the L2 norm; The average coefficient; This refers to the number of frames in the timing window. S7-3: Point Cloud Statistical Stability Let the number of radar points within the candidate box area be . Define the coefficient of variation: in, As a point stability index; It is a function of standard deviation; It is a mean function; For continuous Frame point sequence; To avoid extremely small positive numbers with a denominator of 0; S7-4: Bird Flock Identification and Suppression The flock score is constructed as follows: in, Score the bird flocks; These are weighting coefficients; This is an indicator function: it takes the value 1 if the condition is true, and 0 otherwise. For the corresponding threshold; when Suppression processing is performed at that time, and the confidence decay expression is: ,in, The confidence level after suppression; To suppress prior confidence; The attenuation coefficient; It is an exponential function; when The candidate is filtered out at that time, where This is the confidence threshold.

[0013] The radar-guided RGB multimodal UAV target detection method for ground-based anti-UAV scenarios, as described above, includes the following specific training operation for the training loss function in S7: Define total loss: ,in, Total loss; To detect the basic loss; Loss due to radar consistency constraints; To mitigate related losses for bird flocks; These are the loss weighting coefficients; The expression for radar consistency constraint loss is: ,in, The number of candidates or positive samples involved in the calculation; For the first The radar consistency score of each candidate; It is the natural logarithm; It is a very small positive number.

[0014] The advantages of this invention are: This invention constructs radar point clouds as pixel-aligned prior guidance representations, enabling radar information to participate in detection in an interpretable form, thus improving the detectability of small targets at long distances; This invention integrates radar prior guidance throughout feature extraction, cross-scale encoding interaction, and decoding query decision-making, achieving multi-location guidance within the network and improving robustness in complex backgrounds; This invention utilizes the advantages of radar velocity measurement and visual temporal consistency to construct a bird flock suppression mechanism, significantly reducing the false alarm rate of bird flocks, and is suitable for real-time anti-drone applications. Attached Figure Description

[0015] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0016] Figure 1 This is a flowchart of the present invention; Figure 2 This is a schematic diagram of radar point cloud coordinate transformation and pixel projection according to the present invention; Figure 3 This is a schematic diagram of the radar prior guidance representation generation process of the present invention; Figure 4 This is a schematic diagram of the radar prior guidance injection target detection network structure of the present invention; Figure 5 This is a schematic diagram of the bird flock suppression process of the present invention; Figure 6 This is a diagram of the radar and RGB data acquisition and input operation interface of Embodiment 1 of the present invention; Figure 7 This is a diagram of the radar and RGB multimodal data processing and fusion operation interface of Embodiment 2 of the present invention; Figure 8 This is a diagram of the UAV target detection and result output operation interface in Embodiment 3 of the present invention. Detailed Implementation

[0017] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0018] like Figure 1-5 As shown, the radar-guided RGB multimodal UAV target detection method for ground-based anti-UAV scenarios includes the following steps: S1: Acquire RGB images and corresponding radar point cloud data within the same monitoring area, and establish timestamp information for the RGB images and radar point cloud data; S2: Based on the timestamp information, the RGB image and radar point cloud data are time-aligned, and motion compensation is performed on the radar point cloud data to obtain aligned point cloud data; S3: Based on the calibration parameters of the radar and camera, perform coordinate transformation on the aligned point cloud data and project it onto the pixel plane of the RGB image to obtain the radar projection point set; S4: The radar projection point set is preprocessed and a radar prior guidance representation aligned with the RGB image pixels is constructed; S5: Input the RGB image into the target detection network to extract multi-scale visual features, and inject the radar prior guided representation into the target detection network, thereby guiding the fusion of multi-scale visual features in the feature extraction stage or the coding interaction stage to obtain fused features; S6: Based on the radar prior guidance representation, the decoding query of the target detection network is filtered, and the filtered decoding query and fused features are input into the decoder to output the category and bounding box of the candidate target, and the candidate detection result is obtained. S7: Perform bird flock suppression processing based on the radar-visual consistency features corresponding to the candidate detection results, and output the final UAV target detection results.

[0019] Preferably, in this embodiment, S1 collects the data in a ground-based anti-drone scenario. RGB image at any given time and corresponding millimeter-wave radar point cloud data; RGB images are denoted as: ,in, Indicates the sampling time or frame index; Indicates the first Frame RGB image; Represents the real number field; Indicates the image height (in pixels); Indicates the image width (in pixels); Indicates RGB three channels; Radar point clouds are denoted as: ,in, Indicates the first Frame radar point cloud ensemble; Indicates the first One radar point; Indicates the first Frame point cloud point count; Represents a set; subscript Point index; Each radar point is represented in polar coordinates as follows: ,in, This represents the distance from the point to the radar (unit: meters). Indicates the azimuth angle (horizontal angle); Indicates the pitch angle (vertical angle); Represents radial velocity (velocity along the radar line of sight, dimension: meters per second); Represents echo intensity or signal-to-noise ratio (dimensionless or in device units).

[0020] Preferably, in this embodiment, S2 is set to a camera sampling time of... Radar sampling time is The time difference is defined as: ,in, Indicates the time difference between the camera and radar sampling times (unit: seconds); For camera timestamps; For radar timestamps; First-order distance compensation using radial velocity: ,in, Indicates the distance after compensation; Indicates the distance before compensation; This indicates the radial velocity measured by radar; Indicates time deviation; Acquire platform pose changes (e.g., turntable angle, IMU), and use a homogeneous transformation matrix for compensation. Assume the radar is at... Time's up The pose transformation at time t is ,but: ,in, Represents the rigid body pose transformation (rotation + translation) corresponding to Homogeneous matrix; Represents a three-dimensional rigid body motion group; For point Homogeneous coordinates in the radar coordinate system; Homogeneous coordinates after motion compensation; superscript Indicates transpose; When the radar outputs data in polar coordinates, it must first be converted to Cartesian coordinates in the radar coordinate system: in, Point Three-dimensional coordinates in the radar coordinate system; These are trigonometric functions.

[0021] Preferably, in this embodiment, S3 sets the extrinsic parameter from the radar coordinate system to the camera coordinate system as a rotation matrix. With translation vector ,but: ,in, Point Three-dimensional coordinates in the camera coordinate system; Point Three-dimensional coordinates in the radar coordinate system; It is a rotation matrix; It is a translation vector; The camera intrinsic parameter matrix is: in, This is the camera intrinsic parameter matrix; Focal length in the horizontal / vertical direction (in pixels); Principal point coordinates (optical center projection in pixel coordinate system); Pixel projection satisfies: in, For point Pixel coordinates (horizontal and vertical coordinates) on the image; The scale factor is the depth in the camera coordinate system. ; Indicates the depth of the point from the camera; From the above formula, we get: in, , Normalized imaging coordinates; Will satisfy and The projection points form the radar projection point set: in, This is the set of radar points projected onto the image plane.

[0022] Preferably, the specific operation of S4 in this embodiment includes the following steps: S4-1: Clutter Removal The radar projection points are filtered out, and points that meet any of the following conditions are identified as clutter and removed: in, It is the absolute value; Radial velocity threshold; Echo intensity threshold; This is the upper limit threshold for distance; S4-2: Pixel Weight Modeling Construct weights for each retention point The calculation formula is as follows: in, For point The fusion weight (used to reflect the contribution of "more likely to be the target area"). These are non-negative weighting coefficients; For strength The normalized mapping function; For velocity amplitude The normalized mapping function; It is an exponential function; For distance attenuation scale parameters (unit: meters); S4-3: Kernel function diffusion generates dense guide graph Constructing a dense radar prior guidance map (pixel coordinates) (prior strength), using the maximum response form: in, For the first Frame radar prior guidance map in pixels The value at; This indicates that for all points Take the maximum value; The kernel function's diffusion radius (or Gaussian kernel standard deviation, in pixels); For point The projected pixel coordinates; The square of the Euclidean distance between the pixel planes; S4-4: Multi-channel prior Construct density channels: in, Represents pixels Point cloud density estimation; Constructing velocity channels (density-normalized velocity-weighted average): in, Represents pixels Radial velocity estimation; It is a very small positive number, used to avoid the denominator being 0 (e.g. ); This leads to the formation of multi-channel radar priors: in, For pixels The multi-channel prior vector at the location; S4-5: Normalization Guided diagram Perform min-max normalization: in, This is the normalized guide graph; For the image The minimum value of all pixels; For the image The maximum value of all pixels; To avoid extremely small positive numbers with a denominator of 0.

[0023] Preferably, in this embodiment, S5 maps the radar prior guidance representation to a gain coefficient map and / or a gating map, and performs element-wise modulation and fusion on the multi-scale visual features; Let Backbone output the first... The visual features at each scale are: ,in, For the first Feature tensors at each scale; Number of channels; The height and width of the feature map at this scale; superscript Indicates scale index; Downsampling / aligning the radar prior to the same scale yields: ,in, To and Radar prior features aligned to the same scale; Number of radar prior feature channels (in single-channel guidance diagram) Multi-channel prior wait); Generate a gating / gain coefficient map using a mapping function: in, express Convolution, or equivalent linear mapping, is used for channel alignment; Indicates the activation function (optional Sigmoid, which makes the output fall within the range of 10 ... ); A gate / gain tensor with the same shape as the visual feature; Element-wise gain modulation is employed. in, Modulated visual features; This represents element-wise multiplication; Indicates and A full-size tensor of the same shape; This is the gain intensity coefficient (a non-negative real number).

[0024] Preferably, in this embodiment, S5 converts the radar prior guidance representation into an attention bias term and introduces the self-attention calculation and / or cross-attention calculation of the target detection network; Flatten the features into a token sequence: ,in, A sequence of tokens; For the number of tokens (e.g.) (or the number of tokens after multi-scale splicing). For token feature dimensions; Standard scaled dot product attention is: in, These are query, key, and value matrices (the dimensions are usually 1). (or its linear transformation dimension); Indicates the transpose of the key matrix; This is the scaling factor; This indicates that the matrix rows are subjected to Softmax normalization; Indicates attention output; To introduce radar priors, a radar bias is constructed for each token. Let the first The normalized radar prior mean within the pixel region corresponding to each token is ,but: ,in, For the first Radar bias scalar for each token; This is the bias coefficient; For the radar prior statistics (such as mean / maximum value) of the region corresponding to the Token; Add the bias term to the attention weight calculation: in, This represents the attention output after incorporating radar priors; It is the bias vector; It is a vector consisting entirely of 1s; This indicates that the bias will be broadcast as a bias matrix with the same shape as the attention scoring matrix.

[0025] Preferably, in this embodiment, S6 sets the initial query count of the decoder to be... , No. Each query is denoted as Decoding output the first The probability vector of each candidate class is The bounding box is ; The category probability vector is: ,in, Number of categories (including drones and background / other categories); For the first The candidate belongs to the first The probability of a class; The bounding box is: ,in, The coordinates of the center point of the candidate box (which can be normalized to) (relative coordinates or pixel coordinates) The candidate bounding box width and height (which can also be relative or pixel scale); Take the set of pixels corresponding to the candidate bounding box region as Define radar consistency score: in, For the first The radar consistency score of each candidate; Candidate boxes The set of pixel coordinates covered; Represents a set The number of elements; This represents the summation of pixels within the candidate bounding box; To normalize the radar prior guidance diagram; The formula for classification uncertainty is as follows: in, For the first The classification uncertainty of each candidate; It is the natural logarithm; To prevent The smallest positive number; Overall score and Top-K selection, defining the overall score: ,in, For the first The overall score of each candidate; For uncertainty penalty coefficient (a non-negative real number); Select the Top-K index set with the highest rating: in, The set of query indexes that are retained; Indicates from arrive The scoring sequence; This represents the operator that retrieves the indices of the K largest elements; This represents the initial number of queries; To preserve the number of queries, .

[0026] Preferably, in this embodiment, if the model provides multiple predictions or distribution regressions, location uncertainty can be defined: ,in, Due to uncertainty in positioning; This represents the variance operator (which can calculate the variance of box parameters for multiple predictions, or directly take the variance of the distribution regression).

[0027] Preferably, the specific operation of S7 in this embodiment includes the following steps: S7-1: Radial velocity consistency For candidates Within the candidate box region, the radial velocity statistics (such as the median) of the radar points are taken as follows: The velocity vector is obtained by estimating the target motion from the visual side. The visual radial velocity component is obtained by projecting it onto the line of sight. Define the speed difference: ,in, The difference in radial velocity between radar and vision; These are the candidate radar radial velocity statistics. The radial velocity is the visual estimate of the candidate correspondence; It is the absolute value; S7-2: Trajectory smoothness (temporal stability) Suppose the same objective is in continuous The center position of the frame is Define the second-order difference: in, For the first The target center point of the frame; It is a second-order difference vector, reflecting the trajectory's "jitter / acceleration change"; coefficients Indicates the weights in the second-order difference; Define the trajectory jitter index: in, For trajectory jitter / curvature indicators; Represents the L2 norm; The average coefficient; The number of timing window frames (integer, ); S7-3: Point Cloud Statistical Stability Let the number of radar points within the candidate box area be . Define the coefficient of variation: in, The coefficient of variation is a measure of the stability of the points. It is a function of standard deviation; It is a mean function; For continuous Frame point sequence; To avoid extremely small positive numbers with a denominator of 0; S7-4: Bird Flock Identification and Suppression The flock score is constructed as follows: in, Score the bird flocks; These are the weighting coefficients (non-negative real numbers); This is an indicator function: it takes the value 1 if the condition is true, and 0 otherwise. For the corresponding threshold; when Suppression processing is performed at that time, and the confidence decay expression is: ,in, The confidence level after suppression; To suppress prior confidence (which can be taken as...) (or probability of drones); is the attenuation coefficient (a non-negative real number); It is an exponential function; when The candidate is filtered out at that time, where This is the confidence threshold.

[0028] Preferably, the specific training operation of the training loss function of S7 described in this embodiment is as follows: Define total loss: ,in, Total loss; To detect the basic loss (a combination of classification loss and regression loss); Loss due to radar consistency constraints; Suppress related losses for bird flocks (which can be auxiliary classification or a penalty term for bird flock samples); For loss weighting coefficients (non-negative real numbers); The expression for radar consistency constraint loss is: ,in, The number of candidates or positive samples involved in the calculation; For the first The radar consistency score of each candidate; It is the natural logarithm; It is a very small positive number.

[0029] Example 1: Data Acquisition and Input Operation Interface like Figure 6 As shown, Figure 6 This is the data acquisition and input operation interface according to an embodiment of the present invention. This interface is used to achieve synchronous acquisition and input of RGB image data and radar point cloud data.

[0030] Specifically, the left side of the interface is an RGB image display area, used to display RGB images collected within the monitoring area in real time. This visually reflects the drone and its surrounding environment; the right side of the interface is the radar point cloud display area, used to display the point cloud data collected by the radar in polar coordinates. The radial direction represents the target distance, and the angular direction represents the target azimuth, which can intuitively reflect the spatial distribution of targets in the airspace.

[0031] Below the RGB image and radar point cloud display areas are data input indicators to show that the corresponding data has been connected to the system. The center of the interface features control buttons such as "Start Acquisition," "Pause Acquisition," and "Import Data," allowing operators to start or pause real-time data acquisition or import offline data for analysis and testing as needed.

[0032] The bottom of the interface is the status and log display area, which displays system operating status information, including sensor connection status, data synchronization status, current frame rate, and data buffer duration, so that operators can keep abreast of the system's operation and ensure stable input of multimodal data.

[0033] Example 2: Data Processing and Multimodal Fusion User Interface like Figure 7 As shown, Figure 7 This is the data processing and multimodal fusion operation interface according to an embodiment of the present invention. This interface is... Figure 6 Based on the data acquisition and input shown, the RGB image and radar point cloud data are further processed and fused.

[0034] The input RGB images are displayed at the top of the interface. With radar point cloud The interface also indicates that the current data stream is processable via data input indicators. The central part of the interface houses the multimodal data processing and fusion module, which performs processing steps such as time alignment and motion compensation, radar point cloud clutter removal and weight modeling, coordinate transformation and pixel projection, kernel function diffusion and normalization, thereby generating a radar prior guidance representation aligned with the RGB image pixels.

[0035] The bottom of the interface displays the fusion results, showing the fused image. Prior radar information is superimposed on the RGB image with saliency enhancement or as a heatmap to highlight potential UAV target areas. The right side of the interface includes control buttons such as "Start Fusion," "Set Parameters," and "Implementation Feedback," allowing operators to control the fusion process and adjust relevant parameters.

[0036] Example 3: Target Detection and Result Output Interface like Figure 8 As shown, Figure 8 This is the target detection and output operation interface according to an embodiment of the present invention. This interface is used to perform UAV target detection on the fused multimodal features and output the detection results.

[0037] The top of the interface displays the detection results. A target detection network based on fused features infers from the input image and annotates detected targets in the RGB image with bounding boxes, while also providing target category information. The image exemplifies this by distinguishing between drone targets and bird targets, with drone targets highlighted by their corresponding bounding boxes.

[0038] The interface has a control button area at the bottom, including functions such as "View Results", "Adjust Threshold" and "Export Report", which are used to support viewing test results, adjusting parameters, and saving and exporting test results for convenient subsequent analysis and decision support.

[0039] Through this interface, operators can intuitively and in real time grasp the detection status of UAV targets in the airspace and interactively manage the system output results.

[0040] Compared with existing technologies, this invention introduces a multimodal collaborative mechanism between radar and RGB throughout the entire process of data acquisition, processing, and detection. This enables radar point cloud information to provide prior guidance for the visual target detection network, effectively overcoming the problem of missed detections and false detections when relying solely on single visual information under conditions of long distance, small targets, and complex backgrounds. Simultaneously, by introducing a target screening and bird flock suppression mechanism based on radar-visual consistency during the detection stage, it significantly reduces the false alarm rate caused by non-UAV targets such as birds, improving the reliability and robustness of UAV target detection. Furthermore, this invention provides a practical operating interface, enabling visualized and interactive operation from multi-source data acquisition and multimodal fusion to target detection and result output, thus possessing stronger engineering practicality and application promotion value.

[0041] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A radar-guided RGB multimodal UAV target detection method for ground-based anti-UAV scenarios, characterized in that: Includes the following steps: S1: Acquire RGB images and corresponding radar point cloud data within the same monitoring area, and establish timestamp information for the RGB images and radar point cloud data; S2: Based on the timestamp information, the RGB image and radar point cloud data are time-aligned, and motion compensation is performed on the radar point cloud data to obtain aligned point cloud data; S3: Based on the calibration parameters of the radar and camera, perform coordinate transformation on the aligned point cloud data and project it onto the pixel plane of the RGB image to obtain the radar projection point set; S4: The radar projection point set is preprocessed and a radar prior guidance representation aligned with the RGB image pixels is constructed; S5: Input the RGB image into the target detection network to extract multi-scale visual features, and inject the radar prior guided representation into the target detection network, thereby guiding the fusion of multi-scale visual features in the feature extraction stage or the coding interaction stage to obtain fused features; S6: Based on the radar prior guidance representation, the decoding query of the target detection network is filtered, and the filtered decoding query and fused features are input into the decoder to output the category and bounding box of the candidate target, and the candidate detection result is obtained. S7: Perform bird flock suppression processing based on the radar-visual consistency features corresponding to the candidate detection results, and output the final UAV target detection results.

2. The radar-guided RGB multimodal UAV target detection method for ground-based anti-UAV scenarios according to claim 1, characterized in that: In the ground-based anti-drone scenario, S1 collects the first... RGB image at any given time and corresponding millimeter-wave radar point cloud data; RGB images are denoted as: ,in, Indicates the sampling time or frame index; Indicates the first Frame RGB image; Represents the real number field; Indicates the image height; Indicates the image width; Indicates RGB three channels; Radar point clouds are denoted as: ,in, Indicates the first Frame radar point cloud ensemble; Indicates the first One radar point; Indicates the first Frame point cloud point count; Represents a set; subscript Point index; Each radar point is represented in polar coordinates as follows: ,in, Indicates the distance from the point to the radar; Indicates azimuth; Indicates the pitch angle; Indicates radial velocity; This indicates the echo intensity or signal-to-noise ratio.

3. The radar-guided RGB multimodal UAV target detection method for ground-based anti-UAV scenarios according to claim 1, characterized in that: The S2 setting is the camera sampling time as Radar sampling time is The time difference is defined as: ,in, This indicates the time deviation between the camera and radar sampling times; For camera timestamps; For radar timestamps; First-order distance compensation using radial velocity: ,in, Indicates the distance after compensation; Indicates the distance before compensation; This indicates the radial velocity measured by radar; Indicates time deviation; The platform pose changes are obtained and compensated using a homogeneous transformation matrix. Let the radar position be... Time's up The pose transformation at time t is ,but: ,in, Represents the rigid body pose transformation corresponding to Homogeneous matrix; Represents a three-dimensional rigid body motion group; For point Homogeneous coordinates in the radar coordinate system; Homogeneous coordinates after motion compensation; superscript Indicates transpose; When the radar outputs data in polar coordinates, it must first be converted to Cartesian coordinates in the radar coordinate system: in, Point Three-dimensional coordinates in the radar coordinate system; These are trigonometric functions.

4. The radar-guided RGB multimodal UAV target detection method for ground-based anti-UAV scenarios according to claim 1, characterized in that: The S3 setting assumes that the extrinsic parameter from the radar coordinate system to the camera coordinate system is a rotation matrix. With translation vector ,but: ,in, Point Three-dimensional coordinates in the camera coordinate system; Point Three-dimensional coordinates in the radar coordinate system; It is a rotation matrix; It is a translation vector; The camera intrinsic parameter matrix is: in, This is the camera intrinsic parameter matrix; Focal length in the horizontal / vertical direction; Principal point coordinates; Pixel projection satisfies: in, For point Pixel coordinates on the image; The scale factor is the depth in the camera coordinate system. ; Indicates the depth of the point from the camera; From the above formula, we get: in, , Normalized imaging coordinates; Will satisfy and The projection points form the radar projection point set: in, This is the set of radar points projected onto the image plane.

5. The radar-guided RGB multimodal UAV target detection method for ground-based anti-UAV scenarios according to claim 1, characterized in that: The specific operation of S4 includes the following steps: S4-1: Clutter Removal The radar projection points are filtered out, and points that meet any of the following conditions are identified as clutter and removed: in, It is the absolute value; Radial velocity threshold; Echo intensity threshold; This is the upper limit threshold for distance; S4-2: Pixel Weight Modeling Construct weights for each retention point The calculation formula is as follows: in, For point The fusion weights; These are non-negative weighting coefficients; For strength The normalized mapping function; For velocity amplitude The normalized mapping function; It is an exponential function; This is the distance attenuation scale parameter; S4-3: Kernel function diffusion generates dense guide graph Constructing a dense radar prior guidance map Using the maximum response form: in, For the first Frame radar prior guidance map in pixels The value at; This indicates that for all points Take the maximum value; The kernel function diffusion radius; For point The projected pixel coordinates; The square of the Euclidean distance between the pixel planes; S4-4: Multi-channel prior Construct density channels: in, Represents pixels Point cloud density estimation; Constructing a velocity channel: in, Represents pixels Radial velocity estimation; It is a very small positive number, used to avoid the denominator being 0; This leads to the formation of multi-channel radar priors: in, For pixels The multi-channel prior vector at the location; S4-5: Normalization Guided diagram Perform min-max normalization: in, This is the normalized guide graph; For the image The minimum value of all pixels; For the image The maximum value of all pixels; To avoid extremely small positive numbers with a denominator of 0.

6. The radar-guided RGB multimodal UAV target detection method for ground-based anti-UAV scenarios according to claim 1, characterized in that: The S5 process maps the radar prior guidance representation to a gain coefficient map and / or a gating map, and performs element-wise modulation and fusion on the multi-scale visual features. Let Backbone output the first... The visual features at each scale are: ,in, For the first Feature tensors at each scale; Number of channels; The height and width of the feature map at this scale; superscript Indicates scale index; Downsampling / aligning the radar prior to the same scale yields: ,in, To and Radar prior features aligned to the same scale; This represents the number of radar prior feature channels; Generate a gating / gain coefficient map using a mapping function: in, express Convolution, or equivalent linear mapping, is used for channel alignment; Indicates the activation function; A gate / gain tensor with the same shape as the visual feature; Element-wise gain modulation is employed. in, Modulated visual features; This represents element-wise multiplication; Indicates and A full-size tensor of the same shape; This is the gain intensity coefficient.

7. The radar-guided RGB multimodal UAV target detection method for ground-based anti-UAV scenarios according to claim 1, characterized in that: S5 converts the radar prior guidance representation into an attention bias term and introduces the self-attention calculation and / or cross-attention calculation of the target detection network. Flatten the features into a token sequence: ,in, A sequence of tokens; The number of tokens; For token feature dimensions; Standard scaled dot product attention is: in, These are the query, key, and value matrices, respectively. Indicates the transpose of the key matrix; This is the scaling factor; This indicates that the matrix rows are subjected to Softmax normalization; Indicates attention output; To introduce radar priors, a radar bias is constructed for each token. Let the first The normalized radar prior mean within the pixel region corresponding to each token is ,but: ,in, For the first Radar bias scalar for each token; This is the bias coefficient; For the radar prior statistics of the region corresponding to the Token; Add the bias term to the attention weight calculation: in, This represents the attention output after incorporating radar priors; It is the bias vector; It is a vector consisting entirely of 1s; This indicates that the bias will be broadcast as a bias matrix with the same shape as the attention scoring matrix.

8. The radar-guided RGB multimodal UAV target detection method for ground-based anti-UAV scenarios according to claim 1, characterized in that: The S6 setting assumes the initial number of queries for the decoder is... , No. Each query is denoted as Decoding output the first The probability vector of each candidate class is The bounding box is ; The category probability vector is: ,in, Number of categories; For the first The candidate belongs to the first The probability of a class; The bounding box is: ,in, The coordinates of the center point of the candidate box; Define the width and height of the candidate box; Take the set of pixels corresponding to the candidate bounding box region as Define radar consistency score: in, For the first Consistency scores for each candidate radar; Candidate boxes The set of pixel coordinates covered; Represents a set The number of elements; This represents the summation of pixels within the candidate bounding box; To normalize the radar prior guidance diagram; The formula for classification uncertainty is as follows: in, For the first The classification uncertainty of each candidate; It is the natural logarithm; To prevent The smallest positive number; Overall score and Top-K selection, defining the overall score: ,in, For the first The overall score of each candidate; This represents the uncertainty penalty coefficient. Select the Top-K index set with the highest rating: in, The set of query indexes that are retained; Indicates from arrive The scoring sequence; This represents the operator that retrieves the indices of the K largest elements; This represents the initial number of queries; To preserve the number of queries, .

9. The radar-guided RGB multimodal UAV target detection method for ground-based anti-UAV scenarios according to claim 1, characterized in that: The specific operation of S7 includes the following steps: S7-1: Radial velocity consistency For candidates The radial velocity statistics of radar points within the candidate box region are as follows: The velocity vector is obtained by estimating the target motion from the visual side. The visual radial velocity component is obtained by projecting it onto the line of sight. Define the speed difference: ,in, The difference in radial velocity between radar and vision; These are the candidate radar radial velocity statistics. The radial velocity is the visual estimate of the candidate correspondence; It is the absolute value; S7-2: Trajectory Smoothness Suppose the same objective is in continuous The center position of the frame is Define the second-order difference: in, For the first The target center point of the frame; It is a second-order difference vector, reflecting the trajectory's "jitter / acceleration change"; coefficients Indicates the weights in the second-order difference; Define the trajectory jitter index: in, For trajectory jitter / curvature indicators; Represents the L2 norm; The average coefficient; This refers to the number of frames in the timing window. S7-3: Point Cloud Statistical Stability Let the number of radar points within the candidate box area be . Define the coefficient of variation: in, As a point stability index; It is a function of standard deviation; It is a mean function; For continuous Frame point sequence; To avoid extremely small positive numbers with a denominator of 0; S7-4: Bird Flock Identification and Suppression The flock score is constructed as follows: in, Score the bird flocks; These are weighting coefficients; This is an indicator function: it takes the value 1 if the condition is true, and 0 otherwise. For the corresponding threshold; when Suppression processing is performed at that time, and the confidence decay expression is: ,in, The confidence level after suppression; To suppress prior confidence; The attenuation coefficient; It is an exponential function; when The candidate is filtered out at that time, where This is the confidence threshold.

10. The radar-guided RGB multimodal UAV target detection method for ground-based anti-UAV scenarios according to claim 1, characterized in that: The specific training operation of the training loss function of S7 is as follows: Define total loss: ,in, Total loss; To detect the underlying loss; Loss due to radar consistency constraints; To mitigate related losses for bird flocks; These are the loss weighting coefficients; The expression for radar consistency constraint loss is: ,in, The number of candidates or positive samples involved in the calculation; For the first The radar consistency score of each candidate; It is the natural logarithm; It is a very small positive number.