Unmanned aerial vehicle highway litter identification method based on multi-modal semantic fusion

By using a multimodal fusion drone system, combined with radar, visible light and infrared sensors, efficient, all-weather, and network-wide identification and classification of debris on highways has been achieved. This solves the problem of difficult detection in low-light environments at night in existing technologies and provides a reliable detection method for all-weather and network-wide applications.

CN121744008BActive Publication Date: 2026-06-30JILIN UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
JILIN UNIVERSITY
Filing Date
2026-02-26
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing highway debris detection technologies suffer from reduced detection rates in low-light conditions at night, fixed deployment methods cannot provide full coverage, ground radar deployment is costly and cannot flexibly respond to emergencies, manual patrols have poor real-time performance, single sensing methods are difficult to deal with non-metallic low-contrast targets, and there is a lack of reliable detection methods that are compatible with all weather conditions, the entire road network, and multiple materials.

Method used

A method for identifying debris on highways using unmanned aerial vehicles (UAVs) based on multimodal semantic fusion is adopted. This method combines radar, visible light, and infrared sensors, and uses a unified three-modal feature encoding and a cross-modal triple attention fusion mechanism. It utilizes uncertainty-guided dynamic weighted decision-making and a dual-branch network for target identification and classification, and achieves precise positioning and automatic alarm through three-dimensional localization.

Benefits of technology

It achieves high-accuracy identification and classification of scattered objects such as iron blocks, wood blocks, and tire skins under conditions without external lighting, reduces the false detection rate and missed detection probability at night, has all-weather inspection capability, realizes normalized and high-frequency inspection and autonomous obstacle avoidance covering the entire road network, and provides real-time alarm and data security.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121744008B_ABST
    Figure CN121744008B_ABST
Patent Text Reader

Abstract

This invention relates to a method for identifying debris on highways using unmanned aerial vehicles (UAVs) based on multimodal semantic fusion. The method collects UAV data, radar data, and visible light and infrared thermal imaging data. It employs a unified feature encoding mechanism combining radar, vision, and infrared modes, a cross-modal triple attention fusion mechanism, and uncertainty-guided dynamic weighted decision-making with a dual-branch network. Under conditions without external lighting, it performs preliminary identification and classification of debris. Then, the UAV is switched to a low-altitude, close-range mode. By comparing the physical characteristics in this mode with the features input during target identification and classification, it can accurately determine whether a target is a "confirmed obstacle," outputting the location result and final classification label. Finally, different levels of warnings are issued based on characteristics such as material, size, and shape. This provides a rapidly deployable, low-cost, and highly reliable technical solution for road safety, suitable for routine, high-frequency 24-hour road inspection and early warning.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of object recognition technology, specifically relating to a method for identifying scattered objects on highways using unmanned aerial vehicles (UAVs) based on multimodal semantic fusion. Background Technology

[0002] In recent years, with the rapid expansion of my country's expressway network and the continuous increase in traffic flow along the routes, traffic accidents caused by road debris have become increasingly frequent. Common debris on expressways includes iron blocks, wood blocks, plastic sheets, and tire fragments. Once these objects appear on the lanes, they can easily cause tire blowouts, collisions, or even chain-reaction rear-end collisions. Especially in low-light conditions, it is difficult to clearly identify debris with the naked eye and ordinary surveillance cameras; at the same time, manual patrols have low coverage frequency, many blind spots, poor inspection safety, and cannot promptly clean up debris.

[0003] Currently, four main technical solutions are used for debris detection on highways. The first type relies on fixed camera video monitoring, installing high-definition cameras at bridge entrances, tunnel entrances, or service areas, and using image recognition algorithms to detect obstacles. This type of system is limited by lighting conditions and installation angles, resulting in a significant decrease in detection rate in low-light conditions at night. Furthermore, its fixed deployment cannot cover the entire road, creating blind spots. The second type uses ground-based radar for obstacle detection on some highway sections. While this method offers high detection accuracy, it requires the deployment of numerous devices along the route, leading to extremely high construction and maintenance costs, and it cannot flexibly respond to sudden events or temporary obstacles. The third type involves manual monitoring using fixed highway cameras or patrol vehicle-mounted cameras. Some high-end vehicles are equipped with forward-facing millimeter-wave radar for obstacle detection, but this method suffers from large detection delays, high risks at night, and insufficient real-time performance, making continuous, unmanned monitoring impossible. In recent years, with the development of drone technology and lightweight sensors, drone-based road inspection has gradually become a research hotspot. Some studies have proposed using drones equipped with cameras to perform video detection of the road surface and using deep learning algorithms to identify obstacles; this solution belongs to the fourth type. However, such technologies have significant blind spots at night when dealing with targets with low reflectivity and low contrast (such as black tire debris or dark rubber on the road). These targets are extremely similar in color to asphalt pavement under visible light, making them difficult to segment using edge detection. Furthermore, due to their non-metallic nature, they have high electromagnetic absorption rates and small radar cross-sections (RCS) for millimeter-wave radar, making them easily filtered out as background noise. Therefore, relying solely on visual or radar methods is insufficient for accurately detecting such high-risk obstacles.

[0004] Furthermore, at the equipment level, current mainstream solutions are based on airborne visible light cameras, using algorithms to extract and analyze image features to achieve target recognition. To improve positioning accuracy, some systems employ airborne millimeter-wave radar or lidar, detecting the position and motion of obstacles by analyzing the intensity, phase, and Doppler characteristics of echo signals. To enhance detection capabilities in nighttime and low-light environments, a few advanced systems use a multispectral fusion scheme combining infrared thermal imaging cameras and visible light cameras, utilizing differences in thermal radiation to improve the recognition accuracy of low-contrast targets and achieve all-weather adaptability.

[0005] In summary, existing highway debris detection solutions still have significant limitations: fixed video surveillance suffers from a sharp drop in detection rate due to insufficient lighting at night and low-contrast targets; while ground-based radar offers high detection accuracy, it is weak in reflecting signals from non-metallic debris, making it easily filtered out, and its deployment and maintenance costs are high; manual patrols and vehicle-mounted detection suffer from poor real-time performance, limited coverage, and safety risks associated with nighttime operations; emerging drone inspections also face blind spots in low-reflectivity and low-contrast scenarios, similar to the limitations of vision and radar. Overall, single sensing methods are insufficient to address the combined challenges of "nighttime + non-metallic + low-contrast," and a reliable detection method compatible with all weather conditions, the entire road network, and multiple material targets is lacking. Summary of the Invention

[0006] In view of the shortcomings and deficiencies of existing technologies, this invention proposes a UAV-based method for identifying debris on highways based on multimodal semantic fusion. This method proposes a unified encoding mechanism of radar-vision-infrared three-mode features, a cross-modal triple attention fusion mechanism, and an uncertainty-guided dynamic weighted decision and dual-branch network. It can identify and classify debris under conditions without external lighting and provide different levels of warnings based on features such as material, size, and shape. The invention also proposes a verification and precise coordinate back-calculation algorithm based on three-dimensional positioning. After the system completes the fusion identification, it back-calculates the precise location of the debris, realizing real-time identification, precise positioning, and automatic alarm. This provides a new technical solution for road safety that is rapidly deployable, low-cost, and highly reliable, and is suitable for routine, high-frequency 24-hour road inspection and early warning.

[0007] To achieve the above objectives, the present invention adopts the following technical solution:

[0008] A method for identifying debris on highways using unmanned aerial vehicles (UAVs) based on multimodal semantic fusion, comprising the following steps:

[0009] Step 1. During the drone's cruise, collect relevant data from the drone, as well as radar data, visible light and infrared thermal imaging data;

[0010] Step 2. Preprocess the radar data to obtain the radar map of candidate targets;

[0011] Step 3. Extract high-dimensional features from candidate target radar maps, visible light images, and infrared images. Use a three-modal unified encoder to map the three to a unified high-dimensional semantic space. Then, use a cross-modal triple attention module to capture the bidirectional complementary relationships between radar-visible light, radar-infrared, and visible light-infrared, respectively, to generate first-level fusion features. Set up parallel material inference networks and morphology estimation networks. Use the obtained first-level fusion features as inputs and concatenate the output data. After being enhanced by a feature enhancement loop, perform a second fusion with the first-level fusion features to form a joint feature representation. Then, use an uncertainty-guided dynamic weighted fusion strategy to perform decision-level processing on the joint features and output the fused features. Finally, input the fused features into a lightweight classification network to complete target recognition and classification. When a suspected target is detected, proceed to step 4.

[0012] Step 4. The UAV switches to low-altitude close-range mode, uses radar ranging data to perform weighted correction on the visually estimated depth, and generates a fused 3D point cloud; then, it uses an ellipsoid fitting algorithm to calculate the physical envelope volume of the target, and combines the UAV's heading angle to perform inverse geographic coordinate calculation, and compares the physical features in low-altitude close-range mode with the features input during target identification and classification in Step 3. If the comparison result is less than the set threshold and the confidence level is greater than the set threshold, the target is determined to be a "confirmed obstacle", the positioning result and final classification label are output, and Step 5 is executed;

[0013] Step 5. Calculate the comprehensive alarm index based on the identification results, automatically generate the alarm level, and execute alarm and data transmission.

[0014] As a preferred embodiment of the present invention, step 2 specifically includes the following steps:

[0015] Step 2.1. Format the raw data;

[0016] Step 2.2. Perform two Fast Fourier Transforms on the acquired intermediate frequency signal to complete the range domain transformation and Doppler domain transformation respectively, generating a range-velocity two-dimensional matrix; compare the radial velocity parsed from the range-velocity two-dimensional matrix with the velocity threshold to eliminate moving vehicles, and perform the next step of spatial consistency screening on the remaining stationary targets;

[0017] Step 2.3. Construct a background echo model based on the range-velocity two-dimensional matrix. Calculate the difference between the current frame's range-velocity spectrum and the background echo model, and apply a dynamic noise threshold. The difference results are analyzed. If the result exceeds the dynamic noise threshold and the location does not belong to the long-term existing region, it is marked as a candidate stationary target.

[0018] ;

[0019] in, For position The center's neighborhood window, The adjustment coefficient is dynamically updated based on the real-time noise level. The first target in the current range-velocity spectrum target detection unit The noise power value of each unit, The number of cells within the neighborhood window;

[0020] Step 2.4. For the candidate stationary targets marked in Step 2.3, perform weighted attenuation on the current range-velocity spectrum to obtain the candidate target radar map;

[0021] ;

[0022] in, This is the current distance-velocity spectrum; The background suppression coefficient is... This represents the maximum intensity value in the background echo model. To construct the background echo model, Radar map of candidate targets.

[0023] As a preferred embodiment of the present invention, step 3 utilizes the RaFormer radar feature extraction network to process the radar spectrum of the candidate target to obtain a high-dimensional radar feature vector; the MobileNeXt-Vit network is used to extract and process the visible light image to obtain a high-dimensional visible light feature vector; and the IR-ViT network is used to process the infrared image to obtain a high-dimensional infrared image feature vector. The cross-modal triple attention module includes radar-visible light bidirectional attention, radar-infrared bidirectional attention, and visible light-infrared bidirectional attention. The radar-visible light bidirectional attention is used to capture the complementary relationship between radar depth information and visual texture; the radar-infrared bidirectional attention is used to compensate for radar material uncertainty using infrared thermal distribution; and the visible light-infrared bidirectional attention is used to enhance visible light contour recognition using infrared structural thermal difference. Then, the results of the three bidirectional attention methods (radar-visible light, radar-infrared, and visible light-infrared) are... Fusion yields the first-level fusion feature. ;in, =0.3, =0.25, =0.35, =0.10.

[0024] As a preferred embodiment of the present invention, in step 3, the material inference network adopts the MaterialINet model, the shape estimation network adopts the ShapeNet model, and the temperature feature vector in the first-level fusion feature is extracted using the Temperature-Net temperature feature extraction network. Finally, the features are combined to obtain the multi-dimensional object features. Then, a lightweight feature enhancement module is used to enhance the features of multidimensional objects. By performing nonlinear transformations and dimensional compression, the enhanced object features are obtained.

[0025] As a preferred embodiment of the present invention, when step 3 employs an uncertainty-guided dynamic weighted fusion strategy to perform decision-level processing on the joint features, Gaussian uncertainty estimation is used to dynamically weight the three-modal confidence scores, and the final fusion output is:

[0026] ;

[0027] ;

[0028] in, Representing the The reciprocal of the variance of the uncertainty of each sensor mode; This represents the summation of the variances of all sensor modules. For radar; It is a visible light camera; It is an infrared camera. The fused feature vector For the fusion weights of joint features, For joint feature representation, , , These are the corresponding weight coefficients. , , These represent radar, visible light, and infrared in a high-dimensional semantic space, respectively.

[0029] As a preferred embodiment of the present invention, in step 3, the lightweight classification network adopts a convolutional feature extraction + fully connected discriminant structure. Based on the lightweight classification network, the target category label and confidence score are output. When the confidence score is greater than a set threshold, it is judged as a valid obstacle. In the target recognition and classification process, a temporal frame fusion mechanism is introduced to fuse adjacent frames. Within a frame, a stability check is performed on targets at the same spatial location. If the stability confidence is greater than a set threshold, the target is confirmed as a real debris, and its spatial coordinates and time label are recorded.

[0030] As a preferred embodiment of the present invention, step 4 includes the following steps:

[0031] Step 4.1. Based on the target's position in the image coordinate system output in Step 3, preliminarily determine the target's position in the geographic coordinate system. Based on the target's position, combined with the UAV's current status and preset safety parameters, adjust the flight path so that the UAV is located 10-15 meters above the target area.

[0032] Step 4.2. Multi-view, multi-modal verification data acquisition:

[0033] Step 4.3. Generate a point cloud based on the collected continuous radar data, calculate the visually estimated depth based on image frames from different viewpoints, fuse the depth estimated by radar data with the visually estimated depth, and generate a fused 3D point cloud of the target area.

[0034] Step 4.4. Extract the target centroid and spatial shape boundary from the fused 3D point cloud, combine the UAV heading angle to perform inverse geographic coordinate calculation, and output the latitude and longitude of the target debris;

[0035] Step 4.5. Extract the geometrically invariant features of the fused 3D point cloud, the micro-Doppler features of the radar, and the thermal residual features of the infrared image to construct a complex kernel feature vector; input the complex kernel feature vector into a pre-trained lightweight neural network to calculate the Mahalanobis distance between the current physical features and the features input in the initial classification of Step 3. If the distance is less than the set threshold and the confidence level is greater than the set threshold, the target is determined to be a "confirmed obstacle".

[0036] Step 4.6. For "Confirmed Obstacles", output the fused 3D point cloud model, localization results, and final classification labels.

[0037] Step 4.7. After determining the location of the "confirmed obstacle", if the location overlaps with the current flight path of the UAV or the distance is less than the safety threshold of 3m, the obstacle avoidance vector is calculated in real time based on the three-dimensional positioning results; then, the UAV adjusts its heading angle to make the flight path deviate from the safe distance range of the "confirmed obstacle".

[0038] As a preferred embodiment of the present invention, the comprehensive alarm index in step 5 The expression is:

[0039] ;

[0040] in, The final confidence level for target identification; The spatial proximity of the target to the drone. ,in The horizontal distance from the drone to the target; To the maximum effective monitoring distance; Target size weight, i.e., volume; material risk factor. Weighted sum of material probabilities:

[0041] ;

[0042] in, Let be the probability of the material classifier for the k-th material class. Let be the risk coefficient for the k-th material type, and let iron take a value of . =1.00, tire / rubber value is 1.00 =0.90, the value of wood is... =0.6, plastic takes the value =0.45, =0.5, , , , These are weighting coefficients, with values ​​of 0.4, 0.15, 0.2, and 0.25 respectively.

[0043] As a preferred embodiment of the present invention, the alarm level in step 5 is determined according to a threshold and is divided into three levels. When the alarm is a level 1 prompt alarm, the event is recorded and the data is cached without being actively uploaded. When the alarm is a level 2 warning alarm, the drone issues a voice / light warning signal and sends an event summary to the ground station. When a level 3 severe alarm is triggered, the real-time data stream is transmitted to the highway emergency command center.

[0044] As a further preferred embodiment of the present invention, in step 4, geometrically invariant features of the fused 3D point cloud, micro-Doppler features of the radar, and thermal residual features of the infrared image are extracted to construct a complex feature vector; the geometrically invariant features of the fused 3D point cloud include volume and aspect ratio, and the micro-Doppler features of the radar are extracted from the radar time spectrum and instantaneous frequency. ;in, The standard deviation of the micro-Doppler frequency. For sideband energy concentration, For time-frequency spectral entropy, For the kurtosis of the frequency distribution, Represents the transpose symbol; thermal residual features of infrared images ;in, For the maximum temperature rise of the target relative to the background, The temperature decay time constant, For the intensity of the space temperature gradient, Standard deviation of temperature fluctuation in the time domain.

[0045] Advantages and beneficial effects of the present invention:

[0046] (1) This invention proposes a multimodal fusion recognition method for debris on highways. By using a unified encoding of radar-visual-infrared features and a cross-modal triple attention fusion mechanism, the physical, texture, and thermal radiation features of the target are deeply fused, effectively solving the problem of insufficient recognition capability of a single sensor under insufficient lighting conditions. On this basis, uncertainty-guided dynamic weighted decision-making and a dual-branch network are added to achieve the classification of debris such as iron blocks, wood blocks, and tire skins under no external lighting conditions. Furthermore, the multi-frame temporal fusion mechanism is used to improve detection stability, thereby improving the accuracy and reliability of debris detection at night and solving the problem of difficulty in recognizing non-reflective targets such as iron blocks, wood blocks, and tire skins at night.

[0047] (2) This invention implements an optimized FMCW range-velocity two-dimensional imaging algorithm on radar, accurately capturing the range and velocity information of the target, distinguishing between falling objects and moving vehicles, and constructing a background echo model to remove interference from fixed facilities such as green belts and guardrails. Furthermore, through a spectral peak detection and noise threshold dynamic adjustment mechanism, the detection threshold is automatically adjusted according to the real-time environment to reduce interference from ground clutter and vehicle wakes, avoiding false detection of high-reflectivity targets and missed detection of low-reflectivity targets caused by fixed thresholds, effectively improving the recognition accuracy in weak echo environments.

[0048] (3) This invention proposes a verification and precise coordinate back-calculation algorithm based on three-dimensional positioning. After the system completes the fusion recognition, the target positioning is achieved through the near-field maneuvering of the UAV and the attitude information of the UAV. By using ellipsoidal fitting positioning and physical consistency secondary verification, the fully automatic and high-precision positioning of the geographic coordinates of the scattered objects is realized, providing precise coordinate input for subsequent ground obstacle clearing or warning equipment control.

[0049] (4) This invention employs a graded alarm and dual-channel data transmission mechanism. After the target is identified, this invention calculates the alarm index by comprehensively considering confidence level, size, distance, and classification, and divides it into three alarm levels. At the same time, the system adopts a dual-channel communication structure of 4G / 5G main link + LoRa backup line to realize breakpoint resume and encrypted data upload. This mechanism ensures the immediacy of alarms and data security in nighttime environments.

[0050] (5) This invention can work stably at night, in low light conditions, and without external lighting, and has all-weather inspection capabilities. The entire process, from data acquisition, fusion recognition, and 3D positioning to alarm transmission, is automated without human intervention. Furthermore, the UAV is highly mobile and has a wide coverage area, enabling routine, high-frequency inspections of the entire highway network. It also incorporates autonomous obstacle avoidance and path planning, ensuring the safety of UAV operations in complex airspace. Attached Figure Description

[0051] Other objects and results of the invention will become more apparent and readily understood with reference to the following description taken in conjunction with the accompanying drawings. In the drawings:

[0052] Figure 1 This is a flowchart of the UAV highway debris identification method based on multimodal semantic fusion provided by the present invention. Detailed Implementation

[0053] To enable those skilled in the art to better understand the technical solutions and advantages of the present invention, the present application will be described in detail below with reference to the accompanying drawings, but this is not intended to limit the scope of protection of the present invention.

[0054] like Figure 1 As shown in the figure, this embodiment provides a method for identifying debris on highways using unmanned aerial vehicles (UAVs) based on multimodal semantic fusion. The method includes the following steps:

[0055] Step 1. Data Acquisition:

[0056] The drone cruises above the highway at an altitude of approximately 20-60 meters along a pre-set inspection route, with a flight speed controlled at 3-8 m / s. Simultaneously, the millimeter-wave radar onboard the drone continuously scans the road ahead at a downward angle of 8-15°, acquiring raw radar data such as range and velocity images. At the same time, a front-mounted low-light camera (visible light camera) and an infrared camera (infrared thermal imaging module) simultaneously collect visible light and infrared thermal imaging data, and the flight control system records the drone's attitude, speed, and GPS coordinates in real time to ensure subsequent spatial registration.

[0057] Step 2. Radar signal preprocessing and target detection:

[0058] The acquired raw radar signals are demodulated using FMCW and processed with Fast Fourier Transform (FFT) to form a range-velocity two-dimensional matrix. The system performs constant false alarm rate (CFAR) detection and adaptive noise estimation on this matrix to extract candidate target point clouds with significant reflection characteristics. Simultaneously, through multi-frame time series analysis and target motion consistency judgment, false alarm points caused by ground reflection, guardrails, or vehicle interference are filtered out, thus obtaining preliminary detection results for suspected debris. Through the above radar signal preprocessing, background modeling, dynamic noise thresholding, and CFAR detection steps, the system can stably extract newly emerging stationary anomalous scatterers from ground clutter, fixed echoes from guardrails, and vehicle wake interference under low signal-to-noise ratio conditions. This provides highly reliable candidate debris regions for subsequent multimodal semantic fusion-based UAV highway debris identification.

[0059] Specifically, in this embodiment, step S2 includes the following steps:

[0060] Step 2.1. Raw data formatting:

[0061] When performing inspection missions, the drone's radar continuously transmits linear frequency modulated continuous wave (FMCW) signals at a frame rate of approximately 20-100 Hz. Each frame contains multiple chirp cycles, with each cycle's frequency ranging from... linearly rising to The bandwidth is approximately 1 GHz, corresponding to a theoretical range resolution of about 15 cm. Subsequently, the radar receiving antenna synchronously receives the echo signals reflected from ground objects. After passing through a mixer and an analog-to-digital converter (ADC), an intermediate frequency (IF) signal is obtained, with the sampling rate typically set between 1 and 2. The acquired IF signal data is buffered frame by frame, forming a two-dimensional data matrix (chirp × sampling point).

[0062] Step 2.2. Generation of the distance-velocity two-dimensional matrix:

[0063] The acquired intermediate frequency signal (IF signal) is processed by two Fast Fourier Transform (FFT) processes to complete the range domain transformation and Doppler domain transformation, respectively.

[0064] First, sample points within each chirp are processed. Point FFT yields the distance to the VIP line. The calculation formula is as follows:

[0065]

[0066] Each frequency index Corresponding distance resolution for:

[0067]

[0068] in, For the first The echo signal of a chirp, This represents the Fast Fourier Transform of length N. For distance direction points, At the speed of light, This refers to the frequency modulation bandwidth.

[0069] Secondly, a second FFT is performed on the same distance cell for the range spectrum (range wigg line) of all chirps to obtain the Doppler velocity distribution:

[0070]

[0071] in, For Doppler indexing, This represents the number of chirps per frame. The final result is a distance-velocity 2D matrix. Set speed threshold It is 0.3 m / s. The radial velocity is the analytically derived value from the distance-velocity two-dimensional matrix. When When an object is identified as a moving vehicle, it is removed from the list of remaining stationary targets. The next step is spatial consistency screening. Through two Fast Fourier Transforms, the radial position and velocity information of the target can be accurately captured, revealing the object's specific state and allowing for better differentiation of moving vehicles.

[0072] Step 2.3. Noise Suppression and Constant False Alarm Rate Detection:

[0073] Since fixed structures such as guardrails present stable strong reflection bands in the radar field of view, and their positions remain almost unchanged across multiple frames, while scattered debris represents new stationary anomalies, this embodiment constructs a background echo model to remove the influence of fixed structures such as guardrails and green belts:

[0074]

[0075] in, Indicates spatial location The average background echo intensity at that location, i.e., the background echo model. The time window number of frames used to calculate the background average is set to 10 to balance stability and real-time performance.

[0076] After constructing the background echo model, the current frame distance-velocity spectrum is used. Compared with background echo model Doing it wrong:

[0077]

[0078] Simultaneously set a dynamic noise threshold. This is related to the constant false alarm rate (CFAR) detection algorithm. In this embodiment, the average noise power in the surrounding neighborhood is calculated for each distance-velocity unit as a reference threshold, which is used to evaluate the differential results. The analysis yields the following calculation formula:

[0079]

[0080] in, For position The center's neighborhood window, The adjustment coefficient is dynamically updated based on the real-time noise level. The first target in the current range-velocity spectrum target detection unit The noise power value of each unit. If Exceeding the dynamic noise threshold If the location is not in a long-term existing area, it is marked as a "possible debris", i.e., a candidate stationary target; otherwise, the cell is considered to be only noise or clutter.

[0081] Step 2.4. Generation of candidate target radar map:

[0082] For the candidate stationary targets marked in step 2.3, in order to preserve the original reflection characteristics of the targets while suppressing the fixed background, the current range-velocity spectrum is... Perform weighted attenuation:

[0083]

[0084] in, This is the current distance-velocity spectrum; This is the background suppression coefficient, with a value of 0.7, used to control the degree of background attenuation. This is the maximum intensity value in the background echo model, used for normalization; To construct the background echo model, Radar map of candidate targets.

[0085] Step 3. Multi-source data fusion and identification:

[0086] This step employs a secondary enhancement fusion architecture based on semantic feedback, aiming to address the problem of single sensor failure in complex environments. Specifically, the processing logic of this step includes the following core paths:

[0087] First, the system performs first-level feature fusion. It extracts high-dimensional features from the radar's range-velocity spectrum, visible light image, and infrared image using RaFormer, MobileNeXt-ViT, and IR-ViT networks, respectively. A three-modal unified encoder maps these features to a unified high-dimensional semantic space. A cross-modal triple-attention module (Tri-Attention) captures the bidirectional complementary relationships between radar-visible light, radar-infrared, and visible light-infrared regions, respectively, generating the first-level fused features. In particular, addressing the challenge of identifying black rubber debris (such as tire blowout fragments) at night, the radar-infrared interaction mechanism in the fusion architecture of this invention plays a crucial role. Although tire fragments have weak signals in radar echoes and are almost invisible under visible light, they exhibit significant "thermal residue" characteristics in infrared thermal imaging due to the high heat generated by vehicle friction or the difference in specific heat capacity between rubber and asphalt pavement. The system utilizes the bright thermal radiation areas captured by the infrared mode as attention guides, forcing the network to focus on the weak echo signals at corresponding locations in the radar feature map. By leveraging the thermal saliency of infrared light, weak reflection points suppressed by the radar algorithm are recovered, thereby achieving accurate localization of non-metallic, dim targets.

[0088] Secondly, intermediate-layer semantic inference and object feature generation are performed. This embodiment sets up parallel MaterialINet and ShapeNet networks for material inference and shape estimation, and then uses the obtained first-level fused features... As input, a dual-branch network of material and morphology is used for analysis. The material inference network outputs the probability of the physical material category of the scattered objects (such as metal, wood, rubber), while the morphology estimation network outputs the three-dimensional length, width, height, and volume parameters of the target. Then, the inferred material information and morphology parameters are concatenated to generate an explicit object semantic feature vector. .

[0089] Subsequently, secondary feedback fusion and dynamic decision-making are performed. This embodiment uses a feature enhancement loop to transform the semantic feature vector of an object containing explicit physical attributes... After being enhanced by nonlinear transformation, the data is fed back to the feature space and fused with the first-level features. Perform secondary concatenation and joint encoding to form a joint feature representation. Furthermore, an uncertainty-guided dynamic weighting strategy is adopted to perform decision-level processing on joint features, thereby explicitly introducing physical semantic constraints at the feature level and significantly improving recognition accuracy.

[0090] Finally, the fused features are input into a lightweight classification network to complete target recognition and classification. Stability is verified through a multi-frame temporal fusion mechanism, resulting in the output of high-confidence debris categories and confidence information. This invention introduces physical property constraints on debris at the feature layer, avoiding reliance solely on visual appearance for discrimination. Therefore, even without external lighting, it can stably distinguish non-reflective debris such as iron blocks, wood blocks, and tire linings, effectively reducing false positives and false negatives at night.

[0091] Specifically, in this embodiment, step 3 includes the following steps:

[0092] Step 3.1. Radar High-Dimensional Feature Extraction:

[0093] In this embodiment, the RaFormer radar feature extraction network is used to process the radar map of candidate targets to obtain high-dimensional radar feature vectors:

[0094]

[0095] Step S3.2. Image high-dimensional feature extraction and synchronization control:

[0096] In this embodiment, the vision system carried by the UAV includes a visible light camera (RGB Camera) and an infrared thermal imaging module (IR Sensor), which simultaneously acquires visible light image frames and infrared image frames with a time difference of less than 10ms, specifically as follows:

[0097]

[0098] in, For visible light image frames, Infrared image frame, For pixel coordinates, For timestamps.

[0099] For visible light images, the MobileNeXt-Vit network is used for extraction. It has a night enhancement module that can recover effective texture under low light conditions and output a visible light depth feature vector (high-dimensional feature vector), which mainly contains semantic information such as shape, texture, edge, and structural consistency.

[0100]

[0101] in, Represents the MobileNeXt-Vit network. A high-dimensional feature vector representing a visible light image.

[0102] For infrared images, an IR-ViT network is used to encode temperature texture, thermal contrast, and thermal reflection patterns to distinguish different materials.

[0103]

[0104] in, Representing IR-ViT networks, A high-dimensional feature vector representing an infrared image.

[0105] Step 3.3. Trimodal unified latent space mapping + cross-modal triple attention fusion:

[0106] To enable the features of the three types of sensors to be comparable and fused in the same semantic space, this invention employs a trimodal unified encoder (TU-Encorder) + a cross-modal triple attention fusion (Tri-Attention) module.

[0107] To enable the features of the three types of sensors to be comparable and fused within the same semantic space for subsequent processing, a three-modal unified encoder is first used to map the three sets of features to the same high-dimensional space:

[0108]

[0109] in, The representation of radar features in a unified semantics; The representation of visible light features in a unified semantics; The representation of infrared features in a unified semantic space; is the linear transformation matrix of radar characteristics; This is the linear transformation matrix for visible light; is the linear transformation matrix of the infrared features.

[0110] To maintain cross-modal alignment, a semantic alignment loss function is used. To enforce high consistency among radar, visible light, and infrared modal features within a unified semantic space, ensuring comparability and fusion of the three modal features, a semantic consistency loss function is defined. for:

[0111]

[0112] Meanwhile, to fully explore the complementary relationship between the three modes, a Tri-Attention module is added, including radar-visible light bidirectional attention that captures the complementary relationship between radar depth information and visual texture, radar-infrared bidirectional attention that uses infrared thermal distribution to compensate for radar material uncertainty, and visible light-infrared bidirectional attention that uses infrared structural thermal difference to enhance visible light contour recognition.

[0113] For radar-visible light bidirectional attention, since radar can provide structural depth and visible light provides fine-grained texture, the spatial edges of radar reflection points are refined using visible light texture, while radar depth is used to compensate for insufficient texture in low-light conditions of visible light. The specific formula is as follows:

[0114] Radar - Visible Light:

[0115]

[0116]

[0117] Visible light - radar:

[0118]

[0119]

[0120] Two-way integration:

[0121]

[0122] in, Attention weights for radar in the visible light range; The attention weights from visible light to radar; The visible light value mapping matrix; The radar value mapping matrix; This is the visible light key mapping matrix; This is the radar key mapping matrix; This is the visible light query mapping matrix; For radar query mapping matrix; Here is the attention normalization function; This is a scaling factor to prevent the inner product from becoming too large and causing gradient explosion; This is a representation of radar features after visible light attention weighting; This is a representation of visible light features after radar attention weighting; It is a radar-visible light fusion feature; and For weight fusion.

[0123] Specifically, this invention addresses the challenge of identifying black rubber debris (such as tire blowout fragments) at night using a radar-infrared dual-attention mechanism. Although tire fragments have weak signals in radar echoes and are almost invisible under visible light, they exhibit significant "thermal residue" characteristics in infrared thermal imaging due to the high heat generated by vehicle friction or the difference in specific heat capacity between rubber and asphalt. The specific formula is as follows:

[0124] Radar-Infrared:

[0125]

[0126]

[0127] Infrared Radar:

[0128]

[0129]

[0130] Two-way integration:

[0131]

[0132] in, Attention weights for radar to infrared; The attention weights from infrared to radar; For infrared value mapping matrix; This is an infrared key mapping matrix; For infrared query mapping matrix; This is a representation of radar features after infrared attention weighting; This is a representation of infrared features after radar attention weighting; It is a radar-infrared fusion feature; and For weight fusion.

[0133] Visible light - Infrared:

[0134]

[0135]

[0136] Infrared-Visible Light:

[0137]

[0138] Two-way integration:

[0139]

[0140] in, The attention weights are from visible light to infrared. Attention weights from infrared to visible light; This is a representation of visible light features after infrared attention weighting; This is a representation of infrared features after visible light attention weighting; It features visible light-infrared fusion; and For weight fusion.

[0141] In the fusion logic of this invention, the system uses the bright thermal radiation region captured by the infrared mode as attention guidance, forcing the network to focus on the weak echo signal at the corresponding position in the radar feature map. That is, it utilizes the thermal saliency of infrared light to recover weak reflection points suppressed by the radar algorithm, thus enabling accurate location and identification of non-metallic hazardous debris such as tire linings even under extreme conditions of no light and weak echoes. Three bidirectional attention results are obtained sequentially: radar-visible light, radar-infrared, and visible light-infrared, forming three domains:

[0142]

[0143]

[0144] in, For element-wise enhancement, used to capture higher-order interactions in three modalities; =0.3, =0.25, =0.35, =0.10. In this embodiment, since material identification at night is particularly important for thermal information and radar reflection, a larger weight is given to the radar-infrared domain, while the radar-visible domain and visible-infrared domain are assigned similar but smaller weights, and the interaction term is conservatively set at 0.1.

[0145] The first-level fusion features are finally obtained. :

[0146]

[0147] Step 3.4. Fuse the first-level three-modal features The data are fed into parallel MaterialINet and ShapeNet models for deep material inference and shape recognition, respectively.

[0148] First, the MaterialINet model is used to learn the synergistic relationship between radar reflection, thermal reflection and texture to output the material category of the scattered objects;

[0149]

[0150] in, Output material type (iron block, wood block, tire skin, etc.). and These are the parameters for the material classification layer.

[0151] Simultaneously, triangular domain enhancement features are used to predict 3D morphological parameters, and the ShapeNet model is used to estimate the target's shape, size, volume, and boundary integrity to determine the size of the target debris.

[0152]

[0153]

[0154] in, For morphological characteristic phasors (volume); For ShapeNet model to fuse features The mapping function; These are the estimated length, width, and height of the target, respectively. This represents boundary integrity, used for morphological reliability compensation.

[0155] To display the semantic features of thermal radiation of the extracted target, a lightweight temperature feature extraction network, Temperature-Net, is used to extract three-modal fusion features (first-level fusion features). Temperature eigenvectors in :

[0156]

[0157] in, It is a linear rectification activation function; and Network parameters are extracted for temperature features.

[0158] The final combination yields multidimensional object features:

[0159]

[0160] Step 3.5. To further enhance the representational capability of features, the system employs a lightweight feature enhancement module to... Perform nonlinear transformations and dimensionality compression:

[0161]

[0162] in, and For learnable parameters, is the activation function for the Gaussian error linear unit.

[0163] Enhanced object features The first-level three-modal fusion features generated in step S3.3 are then fused again to form a joint feature representation. :

[0164]

[0165] Step 3.6. Uncertainty-guided fusion decision-making:

[0166] To improve the reliability of recognition in complex nighttime environments, this invention employs Gaussian uncertainty estimation to dynamically weight the three-modal confidence scores:

[0167]

[0168] in, Representing the The reciprocal of the variance of the uncertainty of each sensor mode; This means summing the variances of all sensor modules and normalizing the denominator; For radar; It is a visible light camera; It is an infrared camera.

[0169] The final fusion output is:

[0170]

[0171] in, The fused feature vector The fusion weight for the joint features is set to 0.4 to balance the contributions of the original modality and the enhanced semantics.

[0172] Finally, the fused feature set is as follows:

[0173]

[0174] in, This represents the total number of candidate targets in the current frame.

[0175] Step 3.7. Target Recognition and Classification:

[0176] A convolutional feature extraction + fully connected discriminant structure is used to input the fused features into a lightweight deep neural network for classification and recognition. The network outputs the target category label and confidence score, respectively. and The identification types include iron blocks, wooden blocks, and tire linings, etc., when the confidence level is... At that time, the system determined it to be a valid obstacle, among which The value is 0.85. The specific formula is:

[0177]

[0178] in, For activation functions; This is the weight matrix; This is the bias vector. Label the target category; Confidence level for target identification.

[0179] Step 3.8. Multi-frame temporal fusion and stability determination:

[0180] To improve the system's testing stability under high-speed flight conditions and reduce occasional false positives, such as vehicle wake turbulence and raindrop echoes, the system introduces a time-series frame fusion mechanism, which fuses adjacent frames... Within the frame, stability checks are performed on targets at the same spatial location:

[0181]

[0182] in, For frame-level confidence, it represents the first frame. The first goal in Confidence level in the frame. If the confidence level is stable... If the value is greater than 0.8, the target is confirmed to be a real debris, and its spatial coordinates and time label are recorded.

[0183] Step 3.9. Output and Alarm Mechanism:

[0184] Upon identifying a high-confidence obstacle, the system immediately transmits the obstacle's coordinates, category, and confidence level to the ground. Simultaneously, obstacle avoidance behavior is triggered within the UAV control module, and the raw radar and image data are stored for subsequent model optimization and validation. The final data output format is:

[0185]

[0186] in, The horizontal pixel coordinates of the target in the image coordinate system; The vertical pixel coordinate of the target in the image coordinate system; Category labels for the target; The confidence level for target identification; The detection timestamp for the target.

[0187] Step 4. Close-range composite and 3D positioning:

[0188] This step implements an air-ground collaborative variable resolution verification mechanism. Specifically, after the initial long-range screening in step 3 detects a suspected target (the identification result meets the confidence criteria), the UAV automatically switches to a low-altitude, close-range mode (automatically adjusting its flight attitude, lowering its altitude to within 10 to 20 meters, and switching the radar to high-resolution mode for close-range re-detection). The radar ranging data is used to weighted correct the visual depth, generating a fused 3D point cloud. Subsequently, an ellipsoidal fitting algorithm is used to calculate the target's physical envelope volume, and the UAV's heading angle is combined to perform precise geographic coordinate inverse calculation. This mechanism aims to verify and accurately locate the physical properties of scattered objects using high-resolution near-field data, ensuring the spatial accuracy and reliability of the identification results.

[0189] Specifically, in this embodiment, step 4 includes the following steps:

[0190] Step 4.1. Preliminary positioning and flight path adjustment:

[0191] The system uses the target's position in the image coordinate system output in step 3 as a basis. Combined with the distance measured by radar Pitch angle and azimuth The initial spatial position of the target in the UAV coordinate system is calculated. :

[0192]

[0193] Secondly, the attitude transformation matrix is ​​obtained through the UAV's inertial measurement unit (IMU) and GPS. Map it to a geographic coordinate system:

[0194]

[0195] According to the provided geographic coordinate system Based on the current status of the UAV and preset safety parameters, the system plans a "close-range verification route," which consists of a descent phase and an approach maneuver phase. First, the UAV descends along a smooth trajectory to an altitude of approximately 10-15 meters above the target area. The flight control module automatically adjusts its attitude and speed based on real-time GPS and IMU feedback. Subsequently, based on the target shape estimation parameters, it automatically selects and executes a small-angle hovering or lateral micro-translation maneuver strategy. For suspected elongated targets or scenarios requiring area coverage, a set of parallel scans is performed to ensure a lateral overlap rate of no less than 60%, maintaining the target in the center of the field of view while forming an effective imaging baseline.

[0196] Step 4.2. Multi-view verification of data acquisition:

[0197] Once the drone enters the close-range verification area, the system initiates simultaneous multi-sensor data acquisition. Utilizing the spatial displacement generated during flight, the visible light camera continuously captures images from different spatial positions, obtaining image frames from various perspectives. This provides the necessary parallax information for subsequent multi-view stereo vision (MVS) reconstruction. Simultaneously, the radar performs short-range, high-resolution ranging; in low-light conditions, the infrared module is activated to perform secondary feature acquisition on objects with obvious thermal response characteristics. Finally, the acquired multimodal data is synchronously stored according to timestamps.

[0198] Step 4.3. Point Cloud Generation and Spatial Feature Reconstruction:

[0199] For the continuous radar data collected during the close-range verification phase, a coefficient point cloud is generated using the radar range-angle spectrum. The three-dimensional coordinates of each monitoring point are determined through the following relationship:

[0200]

[0201] in, This is the radar ranging result; It is the azimuth angle; The pitch angle is calculated using the UAV's attitude sensor. Next, the ICP algorithm is used to perform inter-frame point cloud registration on all frames of data.

[0202]

[0203]

[0204] in, It is a 3×3 rotation matrix. It is a 3×1 translation vector. It is a 1×3 zero vector. For the transformed first Frame point cloud (and the first) (Frame alignment) From Frame to The transformation matrix of the frame; For the first Frame point clouds. The transformation matrix between adjacent frames is estimated using the ICP algorithm, and the point clouds from multiple frames are registered to the same coordinate system, thereby obtaining a densely fused point cloud. It also reflects the spatial structure of the target surface.

[0205] To further improve accuracy, the system combines radar point cloud data with visual depth estimation results. The fusion was performed using a weighted average method:

[0206]

[0207]

[0208]

[0209] in, It is a multi-view image sequence. For reference image, It projects candidate 3D points onto the first... View functions, This is a measure of pixel similarity. Depth values ​​calculated from radar data. The depth value is the visual estimate. The depth value after fusion, weight The signal-to-noise ratio is dynamically adjusted, ranging from 0.6 to 0.8. The value is smaller when the lighting conditions are good and larger when the lighting conditions are poor, ultimately generating a high-precision fused 3D point cloud model of the target area.

[0210] Step 4.4. Three-dimensional positioning calculation:

[0211] The system extracts the target centroid and spatial shape boundary from the fused 3D point cloud; specifically, it determines the physical envelope volume of the scattered objects through least-squares ellipsoid fitting.

[0212]

[0213] in, The coordinates of the target center are This represents the length of the principal axis of the ellipsoid. The final three-dimensional coordinates of the target center are the final positioning result. Based on the drone's pose and GPS geographic coordinates, combined with the Earth's radius =6371000m, calculate the geographical latitude and longitude of this point. Let the heading angle of the UAV be... Assuming It's the distance to the north. It is the distance to the east, and the target's horizontal coordinate in the machine's coordinate system is... Then convert to the northeast coordinate system for:

[0214]

[0215]

[0216]

[0217] in For the drone's heading angle, For the longitude of the drone, Longitude of the target debris Latitude of the drone The latitude of the target debris.

[0218] Step 4.5. Secondary verification based on physical consistency:

[0219] To prevent false detections caused by long distances in step 3, the system performs a secondary verification based on near-range physical features. The system extracts geometrically invariant features (volume, aspect ratio) from the near-range fused 3D point cloud, as well as micro-Doppler features from near-field radar and thermal residual features from infrared images, to construct a complex feature vector. .

[0220] First, micro-Doppler features are extracted from the radar time spectrum and instantaneous frequency. Characterizing the micro-motion properties of the target:

[0221]

[0222] in, The standard deviation of the micro-Doppler frequency (Hz) is given. For sideband energy concentration, For time-frequency spectral entropy, For the kurtosis of the frequency distribution, Represents the transpose symbol.

[0223] Simultaneously, thermal residual features are extracted from the infrared image sequence to characterize the target's thermal properties:

[0224]

[0225] in, The maximum temperature rise (K) of the target relative to the background. The temperature decay time constant, For the intensity of the space temperature gradient, Standard deviation of temperature fluctuation in the time domain.

[0226] Finally, the features are fused to generate a composite feature vector. :

[0227]

[0228] in, For volume, Aspect ratio, The transpose of the microDoppler feature. This is a transpose of the thermal residual characteristics.

[0229] The composite feature vector is input into a pre-trained lightweight neural network to calculate the current physical features and the standard features (fused feature vector) of the initial category determination in step 3. The Mahalanobis distance between distributions. If the distance is less than a set threshold and the confidence level is... If the target is identified as a "confirmed obstacle," it is downgraded to a "suspected trace." This mechanism utilizes near-field high-resolution data to effectively eliminate false targets such as shadows and water stains in two-dimensional images.

[0230] Step 4.6. Outputting positioning results and storing the 3D model:

[0231] Finally, the system output is as follows:

[0232]

[0233] in, The altitude of the target debris. Final category label for the target , The final identification confidence level of the target.

[0234] Simultaneously, standardized 3D point cloud model files are generated, along with original imagery and radar data indexes. The results are transmitted in real time to the ground monitoring center via 4G / 5G links or dedicated frequency band wireless links for map overlay and visualization.

[0235] Step 4.7. Autonomous Obstacle Avoidance and Path Correction:

[0236] If the target location overlaps with the drone's current flight path or the distance is less than the safety threshold of 3m, the system calculates the obstacle avoidance vector in real time based on the 3D positioning results. :

[0237]

[0238] in, Preliminary spatial position in the UAV coordinate system A safety distance coefficient of 3 was set. Afterwards, the drone automatically adjusted its heading angle. This causes the flight path to deviate from the safe distance range of the obstacle. This obstacle avoidance control is automatically executed by the flight control system without human intervention.

[0239] Step 5. Alarm and Data Transmission:

[0240] After the system completes the 3D localization and category identification of debris on the highway in step 4, it enters the alarm and transmission phase. Based on the identified debris type and its location, the system automatically generates a corresponding risk level. If the target is determined to be a large obstacle, the system immediately triggers an alarm mechanism, uploading the target image, location coordinates, detection time, and risk category to the road monitoring center via a wireless communication module (4G / 5G link). Upon receiving the alarm information, the road monitoring center can automatically generate an emergency response task, reminding maintenance personnel to promptly clear the debris or set up traffic guidance signs, thereby reducing traffic safety risks.

[0241] This invention introduces a comprehensive alarm index based on material risk factors during the alarm triggering phase. This is achieved through joint modeling of the material, size, distance, and identification confidence level of the scattered objects. This enables differentiated risk assessment and tiered early warning, preventing low-risk targets from frequently triggering high-level alarms.

[0242] Specifically, step 5 includes the following steps:

[0243] Step 5.1. Data Packaging and Encoding:

[0244] After the UAV identifies the target, the identification results and raw data are first structured, packaged, and encrypted. The data packet structure includes a header, timestamp, UAV latitude, longitude, and altitude (GPS-Lat, GPS-Lon, Alt), target ID, target type, identification confidence, radar ranging result, radar measured velocity, image feature vector, thermal feature data, alarm level, and CRC checksum. After data packaging, the system uses AES-128 symmetric encryption on key fields to prevent interception or tampering during wireless transmission.

[0245] Step 5.2. Alarm Triggering and Level Determination:

[0246] The system automatically generates alarm levels based on the identification results, specifically calculating the comprehensive alarm index based on the following parameters. :

[0247]

[0248] in, The target identification confidence score (final target identification confidence score) is the output of the multi-source fusion network. The spatial proximity of a target to a drone is defined as... ,in The horizontal distance from the drone to the target; The maximum effective monitoring distance preset by the system; Target size weight, i.e., volume; material risk factor. Defined as a weighted sum of material probabilities:

[0249]

[0250] in, Let be the probability of the material classifier for the k-th material class. Assuming the risk factor for this material, iron is assigned a value of [value missing]. =1.00 (High Risk - Puncture, Obstacles Obvious), Tire / Rubber Value =0.90 (High risk - prone to tire blowout), wood is taken as a value. =0.6 (medium risk), plastic value is =0.45 (lower risk), other values ​​are... =0.5; , , , These are weighting coefficients, with values ​​of 0.4, 0.15, 0.2, and 0.25 respectively. Simultaneously, the alarm level is determined based on the threshold value.

[0251]

[0252] When the alarm is a Level 1 alert, the system records the event and caches the data without actively uploading it; when the alarm is a Level 2 warning, the drone issues a voice / light warning signal and sends an event summary to the ground station; when a Level 3 severe alarm is triggered, real-time data stream transmission and a high-priority communication channel are activated, and the data is pushed to the highway emergency command center.

[0253] Step 5.3. Wireless Communication and Data Upload:

[0254] The system employs a dual-link communication mode for data transmission, ensuring stable communication even in complex terrain or at night. The primary link is a high-speed data channel based on a 4G / 5G cellular network, used for real-time reporting of video streams and target data. The secondary link uses a LoRa / Wi-Fi Mesh module for short-range backup communication, automatically switching when the primary link signal weakens. The main data upload process involves the UAV detecting a target, packaging the data frame, and pushing it to the cloud message queue via the MQTT protocol. The cloud parses the data, writes it to the database, and synchronizes it to the highway management monitoring terminal. The system returns an ACK signal to ensure successful data reception. The communication latency is generally controlled within 100-300ms, which meets the requirements for real-time alarms.

[0255] Step 5.4. Cloud-based event management and visualization:

[0256] Once the cloud receives the data, it analyzes the event, extracting the target type, location, confidence level, and alarm level, and simultaneously marks the target location on the highway electronic map. It then combines the drone's flight path with historical detection records to generate an event trajectory line. Finally, the real-time video footage of the target, target category icon, coordinates, alarm level, and historical detection trend chart are displayed on the dispatch terminal's visualization interface. An early warning log is automatically generated for post-event analysis.

[0257] Step 5.5. Local caching and breakpoint retransmission mechanism:

[0258] To address communication interruptions, the system incorporates a data buffer and breakpoint resume mechanism. When communication fails, the drone locally caches the 300 most recent detection data entries, each with a unique number and CRC checksum. Once communication is restored, unconfirmed data packets are re-uploaded by comparing the numbers. Simultaneously, if the cache overflows, the system automatically deletes the oldest record and issues a "cache warning." This mechanism ensures that data loss does not occur even in communication dead zones such as mountainous areas and tunnels.

[0259] Step 5.6. Ground-based coordinated response mechanism:

[0260] Once a Level 3 alarm is triggered, staff will dispatch drones or inspection vehicles to the scene for a follow-up inspection, or notify nearby traffic police or maintenance personnel to remove the obstacle, and push notifications such as "Obstacle Ahead" to the variable message signs (VMS). At the same time, based on the target size estimated by radar, lane closures or speed limits will be assessed.

[0261] The present invention also provides an electronic device, comprising: one or more processors and a memory; wherein the memory is used to store one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors implement the above-described method for identifying debris on highways by unmanned aerial vehicles based on multimodal semantic fusion.

[0262] The present invention also provides a computer-readable medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the above-described method for identifying debris on highways by unmanned aerial vehicles based on multimodal semantic fusion.

[0263] Those skilled in the art will understand that all or part of the functions of the various methods / modules in the above embodiments can be implemented by hardware or by computer programs. When all or part of the functions in the above embodiments are implemented by computer programs, the program can be stored in a computer-readable storage medium, which may include: read-only memory, random access memory, disk, optical disk, hard disk, etc., and the above functions can be implemented by executing the program with a computer. For example, the program can be stored in the memory of a device, and when the program in the memory is executed by the processor, all or part of the above functions can be implemented.

[0264] In addition, when all or part of the functions in the above embodiments are implemented by computer programs, the programs can also be stored in storage media such as servers, other computers, disks, optical discs, flash drives, or portable hard drives. They can be downloaded or copied to the memory of the local device, or the system of the local device can be updated. When the program in the memory is executed by the processor, all or part of the functions in the above embodiments can be implemented.

[0265] The above-described specific examples are for illustrative purposes only and are not intended to limit the scope of the invention. Those skilled in the art can make various simple deductions, modifications, or substitutions based on the principles of this invention. Therefore, the scope of protection of this invention should be determined by the scope of the claims.

Claims

1. A method for identifying debris on highways using unmanned aerial vehicles (UAVs) based on multimodal semantic fusion, characterized in that: The method includes the following steps: Step 1. During the drone's cruise, collect relevant data from the drone, as well as radar data, visible light and infrared thermal imaging data; Step 2. Preprocess the radar data to obtain the radar map of candidate targets; Step 3. Extract high-dimensional features from candidate target radar maps, visible light images, and infrared images. Use a three-modal unified encoder to map the three to a unified high-dimensional semantic space. Then, use a cross-modal triple attention module to capture the bidirectional complementary relationships between radar-visible light, radar-infrared, and visible light-infrared, respectively, to generate first-level fusion features. Set up parallel material inference networks and morphology estimation networks. Use the obtained first-level fusion features as inputs and concatenate the output data. After being enhanced by a feature enhancement loop, perform a second fusion with the first-level fusion features to form a joint feature representation. Then, use an uncertainty-guided dynamic weighted fusion strategy to perform decision-level processing on the joint features and output the fused features. Finally, input the fused features into a lightweight classification network to complete target recognition and classification. When a suspected target is detected, proceed to step 4. Step 4. The UAV switches to low-altitude close-range mode, uses radar ranging data to perform weighted correction on the visually estimated depth, and generates a fused 3D point cloud; then, it uses an ellipsoid fitting algorithm to calculate the physical envelope volume of the target, and combines the UAV's heading angle to perform inverse geographic coordinate calculation. The physical features in low-altitude close-range mode are compared with the features input during target identification and classification in Step 3. If the comparison result is less than the set threshold and the confidence level is greater than the set threshold, the target is determined to be a "confirmed obstacle", the positioning result and final classification label are output, and Step 5 is executed. Step 5. Calculate the comprehensive alarm index based on the identification results and automatically generate the alarm level.

2. The method for identifying scattered debris on highways by unmanned aerial vehicles based on multimodal semantic fusion according to claim 1, characterized in that, Step 2 specifically includes the following steps: Step 2.

1. Format the raw data; Step 2.

2. Perform two Fast Fourier Transforms on the acquired intermediate frequency signal to complete the range domain transformation and Doppler domain transformation respectively, generating a range-velocity two-dimensional matrix; compare the radial velocity parsed from the range-velocity two-dimensional matrix with the velocity threshold to eliminate moving vehicles, and perform the next step of spatial consistency screening on the remaining stationary targets; Step 2.

3. Construct a background echo model based on the range-velocity two-dimensional matrix. Calculate the difference between the current frame's range-velocity spectrum and the background echo model, and apply a dynamic noise threshold. The difference results are analyzed. If the result exceeds the dynamic noise threshold and the location does not belong to the long-term existing region, it is marked as a candidate stationary target. ; in, For position The center's neighborhood window, The adjustment coefficient is dynamically updated based on the real-time noise level. The first target in the current range-velocity spectrum target detection unit The noise power value of each unit, The number of cells within the neighborhood window; Step 2.

4. For the candidate stationary targets marked in Step 2.3, perform weighted attenuation on the current range-velocity spectrum to obtain the candidate target radar map; ; in, This is the current distance-velocity spectrum; The background suppression coefficient is... This represents the maximum intensity value in the background echo model. For background echo model, Radar map of candidate targets.

3. The method for identifying scattered debris on highways by unmanned aerial vehicles based on multimodal semantic fusion according to claim 1, characterized in that, Step 3 utilizes the RaFormer radar feature extraction network to process the radar spectrum of candidate targets, obtaining high-dimensional radar feature vectors; it then employs the MobileNeXt-Vit network to process the visible light image, obtaining high-dimensional visible light feature vectors, and uses the IR-ViT network to process the infrared image, obtaining high-dimensional infrared image feature vectors. The cross-modal triple attention module includes radar-visible light bidirectional attention, radar-infrared bidirectional attention, and visible light-infrared bidirectional attention. Radar-visible light bidirectional attention is used to capture the complementary relationship between radar depth information and visual texture; radar-infrared bidirectional attention is used to compensate for radar material uncertainties using infrared thermal distribution; and visible light-infrared bidirectional attention is used to enhance visible light contour recognition using infrared structural thermal differences. Finally, the results of the three bidirectional attention methods (radar-visible light, radar-infrared, and visible light-infrared) are analyzed. Fusion yields the first-level fusion feature. ;in, =0.3, =0.25, =0.35, =0.

10.

4. The method for identifying scattered debris on highways by unmanned aerial vehicles based on multimodal semantic fusion according to claim 1, characterized in that, In step 3, the material inference network uses the MaterialINet model, the shape estimation network uses the ShapeNet model, and the temperature feature vector in the first-level fusion feature is extracted using the Temperature-Net temperature feature extraction network. Finally, the features are combined to obtain the multi-dimensional object features. Then, a lightweight feature enhancement module is used to enhance the features of multidimensional objects. By performing nonlinear transformations and dimensional compression, the enhanced object features are obtained.

5. The method for identifying scattered debris on highways by unmanned aerial vehicles based on multimodal semantic fusion according to claim 1, characterized in that, In step 3, when performing decision-level processing of joint features using an uncertainty-guided dynamic weighted fusion strategy, Gaussian uncertainty estimation is used to dynamically weight the three-modal confidence scores, and the final fusion output is: ; ; in, Representing the The reciprocal of the variance of the uncertainty of each sensor mode; This represents the summation of the variances of all sensor modules. For radar; It is a visible light camera; It is an infrared camera. The fused feature vector For the fusion weights of joint features, For joint feature representation, , , These are the corresponding weight coefficients. , , These represent radar, visible light, and infrared in a high-dimensional semantic space, respectively.

6. The method for identifying scattered debris on highways by unmanned aerial vehicles based on multimodal semantic fusion according to claim 1, characterized in that, In step 3, the lightweight classification network employs a convolutional feature extraction + fully connected discriminant structure. Based on this network, it outputs target category labels and confidence scores. When the confidence score exceeds a set threshold, the target is considered a valid obstacle. During target recognition and classification, a temporal frame fusion mechanism is introduced, where adjacent frames are fused together. Within a frame, a stability check is performed on targets at the same spatial location. If the stability confidence is greater than a set threshold, the target is confirmed as a real debris, and its spatial coordinates and time label are recorded.

7. The method for identifying scattered debris on highways by unmanned aerial vehicles based on multimodal semantic fusion according to claim 1, characterized in that, Step 4 includes the following steps: Step 4.

1. Based on the target's position in the image coordinate system output in Step 3, preliminarily determine the target's position in the geographic coordinate system. Based on the target's position, combined with the UAV's current status and preset safety parameters, adjust the flight path so that the UAV is located 10-15 meters above the target area. Step 4.

2. Multi-view, multi-modal verification data acquisition: Step 4.

3. Generate a point cloud based on the collected continuous radar data, calculate the visually estimated depth based on image frames from different viewpoints, fuse the depth estimated by radar data with the visually estimated depth, and generate a fused 3D point cloud of the target area. Step 4.

4. Extract the target centroid and spatial shape boundary from the fused 3D point cloud, combine the UAV heading angle to perform inverse geographic coordinate calculation, and output the latitude and longitude of the target debris; Step 4.

5. Extract the geometric invariant features of the fused 3D point cloud, the micro-Doppler features of the radar, and the thermal residual features of the infrared image to construct a complex kernel feature vector; input the complex kernel feature vector into a pre-trained lightweight neural network to calculate the Mahalanobis distance between the current physical feature and the feature input in the initial classification of Step 3. If the distance is less than the set threshold and the confidence level is greater than the set threshold, the target is determined to be "confirmed obstacle". Step 4.

6. For "Confirmed Obstacles", output the fused 3D point cloud model, localization results, and final classification labels; Step 4.

7. After determining the location of the "confirmed obstacle", if the location overlaps with the current flight path of the UAV or the distance is less than the safety threshold of 3m, the obstacle avoidance vector is calculated in real time based on the three-dimensional positioning results; then, the UAV adjusts its heading angle to make the flight path deviate from the safe distance range of the "confirmed obstacle".

8. The method for identifying scattered debris on highways by unmanned aerial vehicles based on multimodal semantic fusion according to claim 1, characterized in that, In step 5, the comprehensive alarm index The expression is: ; in, The final confidence level for target identification; The spatial proximity of the target to the drone. , The horizontal distance from the drone to the target; To the maximum effective monitoring distance; Target size weight, i.e., volume; material risk factor. Weighted sum of material probabilities: ; in, Let be the probability of the material classifier for the k-th material class. Let be the risk coefficient for the k-th material type, and let iron take a value of . =1.00, tire / rubber value is 1.00 =0.90, the value of wood is... =0.6, plastic takes the value =0.45, =0.5, , , , These are weighting coefficients, with values ​​of 0.4, 0.15, 0.2, and 0.25 respectively.

9. The method for identifying scattered debris on highways by unmanned aerial vehicles based on multimodal semantic fusion according to claim 1, characterized in that, In step 5, the alarm level is determined based on a threshold and is divided into three levels. When the alarm is a level 1 prompt alarm, the event is recorded and the data is cached without being actively uploaded. When the alarm is a level 2 warning alarm, the drone issues a voice / light warning signal and sends an event summary to the ground station. When a level 3 severe alarm is triggered, the real-time data stream is transmitted to the highway emergency command center.

10. The method for identifying scattered debris on highways by unmanned aerial vehicles based on multimodal semantic fusion according to claim 7, characterized in that, Step 4 extracts the geometrically invariant features of the fused 3D point cloud, the micro-Doppler features of the radar, and the thermal residual features of the infrared image to construct a complex feature vector. The geometrically invariant features of the fused 3D point cloud include volume and aspect ratio, while the micro-Doppler features of the radar are extracted from the radar time spectrum and instantaneous frequency. ;in, The standard deviation of the micro-Doppler frequency. For sideband energy concentration, For time-frequency spectral entropy, For frequency distribution kurtosis, Represents the transpose symbol; thermal residual features of infrared images ;in, For the maximum temperature rise of the target relative to the background, The temperature decay time constant, For the intensity of the space temperature gradient, Standard deviation of temperature fluctuation in the time domain.