Multi-type road disease collaborative detection method and system based on unmanned aerial vehicle aerial image

By constructing an end-to-end multi-task disease detection model and adopting a hybrid training set and dynamic gradient blocking mechanism, the problem of accurate identification and segmentation of multiple types of road diseases was solved, realizing automated detection and risk assessment of diseases, and improving detection accuracy and practicality.

CN122199488APending Publication Date: 2026-06-12ZHONGXIN HANCHUANG (JIANGSU) TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZHONGXIN HANCHUANG (JIANGSU) TECH CO LTD
Filing Date
2026-03-16
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies are insufficient for the accurate identification and segmentation of various types of road defects, and they are highly dependent on labeled data. They fail to effectively coordinate and optimize defect detection, segmentation, and 3D parameter extraction, resulting in incomplete detection results and insufficient practicality.

Method used

An end-to-end multi-task disease detection model is constructed, employing a hybrid training set, a dual-path feature aggregation module, a progressive alignment module, and an adaptive multi-task detection head. Combined with hybrid supervised training and a dynamic gradient blocking mechanism, it achieves multi-scale feature fusion and dynamic task optimization, and simultaneously outputs disease classification, segmentation, and depth information.

🎯Benefits of technology

It improves the accuracy and real-time performance of detecting various types of road defects, reduces reliance on labeled data, provides comprehensive and reliable decision support, and realizes automated detection and risk assessment of defects.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122199488A_ABST
    Figure CN122199488A_ABST
Patent Text Reader

Abstract

The application discloses a kind of multi-type road disease collaborative detection method and system based on unmanned aerial vehicle aerial image, by constructing mixed training set;End-to-end multi-task disease detection model is constructed, the model is realized multi-scale feature bidirectional fusion using double-path feature aggregation module, feature map cross-resolution space alignment is completed by step alignment module, and the weight of classification, segmentation and depth estimation task is dynamically adjusted using the three-layer attention mechanism of adaptive multi-task detection head;In the training process, mixed supervision strategy and dynamic gradient blocking mechanism are used, and model optimization is realized by combining multi-stage loss weight scheduling;The image to be detected is input into the trained model, and the disease segmentation map, classification result and depth distribution map are output synchronously, and the disease quantitative evaluation is realized by connected domain analysis and multi-dimensional parameter extraction.The application realizes integrated detection, accurate segmentation and three-dimensional quantitative evaluation of multiple types of road diseases such as cracks, pits and depressions.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of road inspection and maintenance technology, specifically to a collaborative detection method and system for multiple types of road defects based on UAV aerial images. Background Technology

[0002] With the acceleration of urbanization and the continuous development of transportation networks, the timely detection and assessment of road surface defects (such as cracks, potholes, and subsidence) has become a crucial link in ensuring road safety and improving traffic efficiency. Traditional road defect detection mainly relies on manual inspections or vehicle-mounted detection equipment, which suffers from low efficiency, limited coverage, and strong subjectivity, and is particularly difficult to meet the needs of large-scale, high-frequency road condition monitoring. In recent years, the rapid development of drone technology has provided an efficient and flexible solution for road surface image acquisition, significantly improving the breadth and timeliness of data acquisition.

[0003] However, intelligent road defect detection based on UAV imagery still faces numerous technical challenges. First, road defects are diverse in type and complex in form, and are significantly affected by environmental factors such as lighting, shadows, and dirt, making it difficult for a single detection task to accurately identify and segment multiple types of defects. Second, existing methods largely rely on large-scale pixel-level labeled data, which is costly and time-consuming, limiting the widespread application of these models in real-world scenarios. Furthermore, identifying road defects requires not only identifying their type and location but also quantifying their geometric dimensions, depth distribution, and other parameters to support accurate risk assessment and maintenance decisions. Most current methods fail to achieve coordinated optimization of defect detection, segmentation, and 3D parameter extraction. For example, while existing methods like Mask R-CNN can segment defects, they do not integrate depth estimation; multi-task learning frameworks like MTLNet attempt collaborative classification and segmentation but fail to effectively resolve gradient conflicts between tasks, resulting in limitations in the completeness and practicality of the detection results.

[0004] To address the aforementioned issues, various deep learning-based methods have been proposed in existing technologies. For example, some studies employ two-stage object detection frameworks (such as Faster R-CNN and Mask R-CNN) for hazard localization and segmentation, but these suffer from high computational complexity and poor real-time performance. Other studies attempt to simultaneously process classification and segmentation tasks using multi-task learning architectures; however, feature conflicts and gradient interference between different tasks have not been effectively resolved, impacting the overall model performance. Furthermore, existing methods often employ unidirectional or simple concatenation strategies for feature fusion, making it difficult to fully integrate multi-scale, cross-resolution road surface features, resulting in poor detection performance for small-target hazards and edge regions.

[0005] Therefore, there is an urgent need for an intelligent method that can efficiently coordinate the detection, segmentation, and three-dimensional parameter extraction of multiple types of road defects, in order to improve detection accuracy, reduce dependence on labeled data, and provide a comprehensive and reliable basis for road maintenance decisions. Summary of the Invention

[0006] The purpose of this invention is to provide a collaborative detection method and system for multiple types of road defects based on UAV aerial images, in order to solve the problems of incomplete detection of multiple types of road defects, strong dependence on labeled data, insufficient multi-task collaborative optimization, and inaccurate extraction of three-dimensional parameters in the existing technology.

[0007] In a first aspect, embodiments of this application provide a collaborative detection method for multiple types of road defects based on UAV aerial images, the method comprising:

[0008] High-resolution images of road surfaces were collected using drones, and a hybrid training set containing pixel-level disease annotation samples and image-level disease classification labels was constructed.

[0009] An end-to-end multi-task disease detection model is constructed, which includes a feature extraction network, a dual-path feature aggregation module, a progressive alignment module, and an adaptive multi-task detection head. The dual-path feature aggregation module achieves multi-scale feature fusion through a top-down and bottom-up bidirectional path. The progressive alignment module achieves accurate alignment of feature maps at different resolutions through spatial offset learning. The adaptive multi-task detection head includes a parallel classification head, a segmentation head, and a depth estimation head, and dynamically adjusts the weights of each task head through an attention mechanism.

[0010] The multi-task disease detection model is trained using a hybrid supervised training strategy. During backpropagation, a dynamic gradient blocking mechanism is implemented to selectively block gradients backpropagated from the classification head to the feature extraction network in order to reduce the interference of the classification task on the feature learning of the segmentation task. At the same time, a dynamic weighted composite loss function is constructed to calculate the segmentation loss and depth estimation loss for pixel-level labeled samples and the classification loss for image-level labeled samples. The weight ratio of each type of loss is dynamically adjusted according to the training stage during the training process.

[0011] Input the road image to be detected into the trained model, and simultaneously output pixel-level disease segmentation map, disease classification result and disease depth distribution map;

[0012] Based on the output disease segmentation map, connected component analysis and multi-dimensional parameter extraction are performed to obtain information on the number, size, area and depth of diseases, and disease risk level is assessed in combination with preset assessment thresholds.

[0013] The trained disease detection model is deployed to an online intelligent monitoring platform to achieve automated detection, quantitative assessment, and risk warning of road diseases.

[0014] Secondly, embodiments of this application provide a collaborative detection system for multiple types of road defects based on UAV aerial images, applied to the collaborative detection method for multiple types of road defects based on UAV aerial images as described in the first aspect, the system comprising:

[0015] The data acquisition and annotation module is used to collect high-definition images of the road surface using drones and construct a hybrid training set containing pixel-level disease annotation samples and image-level disease classification labels;

[0016] The model building module is used to construct an end-to-end multi-task disease detection model. The model includes a feature extraction network, a dual-path feature aggregation module, a progressive alignment module, and an adaptive multi-task detection head. The dual-path feature aggregation module achieves multi-scale feature fusion through a top-down and bottom-up bidirectional path. The progressive alignment module achieves accurate alignment of feature maps at different resolutions through spatial offset learning. The adaptive multi-task detection head includes a parallel classification head, a segmentation head, and a depth estimation head, and dynamically adjusts the weights of each task head through an attention mechanism.

[0017] The model training module is used to train the multi-task disease detection model using a hybrid supervised training strategy. During the backpropagation process, a dynamic gradient blocking mechanism is implemented to selectively block the gradients backpropagated from the classification head to the feature extraction network. At the same time, a dynamic weighted composite loss function is constructed to dynamically adjust the weight ratio of segmentation loss, classification loss and depth estimation loss according to the training stage.

[0018] The detection and inference module is used to input the road image to be detected into the trained model and simultaneously output pixel-level disease segmentation map, disease classification result and disease depth distribution map;

[0019] The parameter extraction and evaluation module is used to perform connected component analysis and multi-dimensional parameter extraction on the output disease segmentation map to obtain information on the number, size, area and depth of diseases, and to evaluate the disease risk level in combination with preset evaluation thresholds.

[0020] The system deployment and service module is used to deploy the trained disease detection model to the online intelligent monitoring platform to realize automated detection, quantitative assessment and risk warning of road diseases.

[0021] Thirdly, embodiments of this application provide an electronic device, including:

[0022] processor;

[0023] Memory used to store processor-executable instructions;

[0024] The processor is configured to implement the multi-type road defect collaborative detection method based on UAV aerial images as described in the first aspect when executing the instructions.

[0025] Fourthly, embodiments of this application provide a computer-readable storage medium storing a program that instructs a device to perform the multi-type road defect collaborative detection method based on UAV aerial images as described in the first aspect.

[0026] The core idea of ​​this invention lies in constructing an end-to-end multi-task collaborative detection framework. Through four key stages—hybrid training set construction, multi-scale feature fusion, task dynamic optimization, and three-dimensional quantitative evaluation—it achieves integrated identification, segmentation, and depth estimation of various types of road defects. The specific technical approach is as follows: Data level: Constructing a hybrid training set, fusing pixel-level annotations and image-level labels, and utilizing multispectral imagery and multi-expert cross-annotation to improve data quality and model generalization ability; Model level: Designing a dual-path feature aggregation module and a stepwise alignment module to achieve semantic fusion and spatial alignment of multi-scale features, improving the consistency of feature representation; Training level: Employing hybrid supervised training and a dynamic gradient blocking mechanism to alleviate gradient conflicts in multi-task learning and improve model convergence stability; Output level: Simultaneously outputting defect classification, segmentation, and depth information, and constructing a multi-factor risk assessment system based on the analytic hierarchy process (AHP) to achieve quantitative grading and decision support for defects.

[0027] The beneficial effects of this invention include: Synchronously achieving disease classification, pixel-level segmentation, and depth estimation through a multi-task collaborative architecture, supporting integrated detection and quantitative analysis of multiple types of road diseases such as cracks, potholes, and subsidence. Employing a hybrid supervised training strategy and a dynamic gradient blocking mechanism reduces dependence on large-scale pixel-level labeled data, improving model convergence speed and stability. The combination of a dual-path feature aggregation module and a stepwise alignment module significantly improves the fusion quality of multi-scale, cross-resolution features, enhancing the detection accuracy of small targets and edge diseases. A risk assessment system based on multi-dimensional parameter extraction and analytic hierarchy process provides quantitative and hierarchical decision support for road maintenance, improving the targeting and timeliness of maintenance work. An online intelligent monitoring platform automates and visualizes the detection process, reducing manual intervention costs and improving the intelligence level of road maintenance management. This invention can be widely applied to the surface condition monitoring and maintenance management of transportation infrastructure such as highways, urban roads, and airport runways. Attached Figure Description

[0028] Figure 1 This is a schematic diagram of a collaborative detection method for multiple types of road defects based on UAV aerial images, provided in one embodiment of this application.

[0029] Figure 2This application provides an architecture diagram for a collaborative detection system for multiple types of road defects based on UAV aerial images.

[0030] Figure 3 A schematic diagram of an electronic device provided in an embodiment of this application. Detailed Implementation

[0031] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them.

[0032] It should be noted that in the embodiments of this application, "at least one" refers to one or more, and "more than one" refers to two or more. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the specification of this application is for the purpose of describing particular embodiments only and is not intended to be limiting of this application.

[0033] Based on the embodiments described in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0034] Example 1

[0035] Figure 1 This is a schematic flowchart illustrating a collaborative detection method for multiple types of road defects based on UAV aerial images, provided as an embodiment of this application. Figure 1 As shown, a collaborative detection method for multiple types of road defects based on UAV aerial images includes:

[0036] S1. High-resolution images of road surfaces are collected using drones, and a hybrid training set containing pixel-level disease annotation samples and image-level disease classification labels is constructed. This provides training data for subsequent multi-task models that balances localization accuracy and classification diversity, reducing reliance on single-type annotations.

[0037] Specifically, in this embodiment, the process of constructing the hybrid training set further includes: acquiring road images containing visible and near-infrared bands using a multispectral imaging system mounted on a UAV; performing geometric correction, radiometric correction, and image registration preprocessing on the images; expanding the samples using data augmentation methods such as random cropping, mirror flipping, multi-angle rotation, brightness and contrast adjustment, noise injection, and meteorological simulation; labeling the diseased areas with polygons or pixel-level masks to form pixel-level labeled samples, and labeling the entire image with disease level and type to form image-level classification labels, and using a multi-expert cross-labeling and consistency verification mechanism to ensure labeling quality.

[0038] Specifically, to construct a high-quality hybrid training set, this example uses a drone equipped with a multispectral imaging system to collect data in a typical urban main road area, acquiring a total of 3000 high-resolution images including visible light (RGB) and near-infrared (NIR) bands (example data; the actual number can be adjusted according to task requirements). Geometric correction is performed on the acquired image sequence to eliminate distortion caused by drone attitude changes, radiometric correction is used to reduce the impact of uneven illumination, and multi-band image registration is completed through feature point matching. In the data augmentation stage, each original image is randomly cropped (scale range 0.7-1.0), horizontally and vertically mirrored, rotated (-15° to +15°), brightness adjusted (±20%), and contrast adjusted (±15%). Gaussian noise (σ=0.01) and simulated rain and fog degradation are added to enhance the model's robustness. During the annotation process, three annotation experts with road engineering backgrounds respectively performed polygon annotation and pixel-level masking for areas with defects such as cracks, potholes, and subsidence. At the same time, they labeled the entire image with the defect level (mild, moderate, and severe) and type combination labels. The annotation results were cross-compared by a consistency verification system. Only when the annotation consistency of at least two experts exceeded 85% (an exemplary threshold that can be adjusted according to annotation quality control requirements) and the type judgment was consistent were they adopted as the final annotation samples. In the end, a high-quality mixed training set containing 2,800 pixel-level annotation samples and 3,000 image-level classification labels was formed.

[0039] S2. Construct an end-to-end multi-task disease detection model, which includes a feature extraction network, a dual-path feature aggregation module, a progressive alignment module, and an adaptive multi-task detection head. The dual-path feature aggregation module achieves multi-scale feature fusion through a top-down and bottom-up bidirectional path. This module is mainly used for the fusion and enhancement of multi-scale semantic features to improve the ability to identify diseases of different sizes.

[0040] The multi-task disease detection model of this invention adopts an end-to-end integrated architecture, capable of simultaneously completing disease classification, pixel-level segmentation, and depth estimation in a single forward propagation. Compared with traditional two-stage detection models (such as Mask R-CNN), this model significantly improves inference speed while maintaining high accuracy, making it suitable for real-time UAV inspection scenarios. Specifically, the model uses ResNet-50 as the backbone network to extract multi-scale feature maps C2-C5. The dual-path feature aggregation module transmits semantic information downward from high-level semantic features (C5) and positional information upward from low-level detail features (C2), achieving semantic enhancement and spatial adaptation of cross-scale features through deformable convolution and channel recalibration mechanisms. The progressive alignment module achieves sub-pixel-level alignment of feature maps at different resolutions through a differentiable spatial transformation network and bidirectional optical flow estimation, significantly improving the geometric consistency of feature fusion.

[0041] The adaptive multi-task detection head introduces a three-layer attention mechanism, including: channel attention: enhancing feature channels sensitive to diseases (such as edges and textures); spatial attention: capturing the long-distance spatial dependence of disease areas; and task-level attention: dynamically adjusting the weights of each task according to the current input to achieve collaborative optimization of tasks.

[0042] The progressive alignment module achieves precise alignment of feature maps at different resolutions through spatial offset learning. This module is mainly used for spatial geometric alignment of feature maps across resolutions, ensuring accurate positional correspondence of features at different scales. The adaptive multi-task detection head includes parallel classification, segmentation, and depth estimation heads, and dynamically adjusts the weights of each task head through an attention mechanism. A deep learning model with a dual-path feature aggregation module, a progressive alignment module, and an adaptive multi-task detection head is designed. The dual-path aggregation module enhances the model's ability to capture and fuse disease features at different scales; the progressive alignment module achieves spatial consistency alignment of feature maps across resolutions, improving detail perception; the adaptive multi-task detection head dynamically balances the three sub-tasks of classification, segmentation, and depth estimation through an attention mechanism to avoid task conflicts.

[0043] Specifically, in this embodiment, the dual-path feature aggregation module introduces deformable convolution in the top-down path to adapt to the non-uniform deformation of the road surface, integrates a channel recalibration module in the bottom-up path to filter effective features across scales, and enhances the semantic consistency of multi-scale features through cascaded dilated convolution and gated attention mechanism. In the specific implementation of the dual-path feature aggregation module, the top-down path uses deformable convolutional layers to process the C3-C5 feature maps from the backbone network ResNet-50. The kernel offset is dynamically generated through a 3×3 offset learning sub-network to adapt to non-uniform structural features such as road crack expansion and pothole edge deformation. The bottom-up path designs a channel recalibration module. This module first performs global average pooling on the channels from the low-level feature map (C2), and then generates channel attention weights through two fully connected layers (compression ratio of 16) to dynamically suppress noisy channels and enhance effective feature representation across scales. Finally, the outputs of the two paths further expand the receptive field at each feature scale through cascaded dilated convolutions (dilation rates of 1, 2, and 4, respectively), and combine gated attention units (composed of Sigmoid gating and element-wise multiplication) to enhance the semantic consistency of the fused features, enabling the model to simultaneously capture the local details of fine cracks and the global contextual information of large-area subsidence.

[0044] Specifically, in this embodiment, the progressive alignment module is based on a differentiable spatial transformation network. It learns the non-rigid deformation field between feature maps through a multilayer perceptron and establishes a dense correspondence between multi-resolution features using bidirectional optical flow estimation. A cyclic alignment mechanism is then combined to achieve sub-pixel-level feature space alignment. In this embodiment, the progressive alignment module takes the multi-scale feature maps {F2, F3, F4, F5, F6} output by the feature pyramid as input. First, it constructs a differentiable spatial transformation network. This network learns the non-rigid deformation field between adjacent scale feature maps through a multilayer perceptron containing two hidden layers (dimensions 128 and 64, respectively). Specifically, for feature maps Fi and F(i+1), their related feature maps are calculated and input into the perceptron to predict a two-dimensional dense displacement field ∆i, which describes the non-linear spatial offset from each position in Fi to the corresponding position in F(i+1). Simultaneously, a RAFT-based scaffold is introduced... The bidirectional optical flow estimation branch of the structure establishes a dense correspondence between high-resolution feature maps and low-resolution feature maps by iteratively updating the flow field, and calculates the consistency loss of forward and backward optical flow to improve matching accuracy. Based on this, a two-stage cyclic alignment mechanism is designed: the first stage uses the learned deformation field to perform a preliminary spatial transformation on Fi, and the second stage performs sub-pixel fine-tuning alignment on the transformed features based on the bidirectional optical flow results. Finally, the feature maps of all scales achieve pixel-level alignment in spatial position (alignment error controlled within 0.5 pixels), thereby significantly improving the geometric consistency and semantic coherence of subsequent multi-scale feature fusion.

[0045] Specifically, in this embodiment, the adaptive multi-task detection head adopts a three-layer attention mechanism, including: channel attention that achieves channel-level importance allocation through a learnable weight matrix, spatial attention that captures global context dependencies based on a self-attention mechanism, and task-level attention that dynamically adjusts the weights of classification, segmentation, and depth estimation through a task relevance matrix. In the implementation of the adaptive multi-task detection head, the input features are first processed through a channel attention layer: this layer uses a learnable weight matrix to linearly transform the feature channels and generates channel weight vectors through a sigmoid activation function, dynamically enhancing feature channels sensitive to disease (such as edge and texture channels) and suppressing redundant information; subsequently, the spatial attention layer is implemented based on a self-attention mechanism, constructing a global context dependency graph by calculating the similarity matrix between all positions in the feature map, enabling the model to effectively capture the long-distance spatial associations of disease areas (such as continuous cracks or distributed pits); finally, the task-level attention layer introduces a trainable task relevance matrix (3×3 in dimension, corresponding to the three tasks of classification, segmentation, and depth estimation), which dynamically generates task weight vectors based on the current input features. For example, it increases the weight of the segmentation task when detecting fine cracks, strengthens the contribution of the depth estimation task when evaluating subsidence disease, and achieves adaptive allocation and collaborative optimization of the weights of the three tasks through Softmax normalization, finally outputting a weighted fusion result of classification confidence, segmentation probability map, and depth estimation value.

[0046] This module achieves dynamic multi-task collaboration through a three-layer attention mechanism: Channel attention: enhances the expression of disease-sensitive feature channels (such as edges and textures); Spatial attention: captures the long-distance spatial dependence of disease areas, improving the ability to identify continuous cracks and distributed pits; Task-level attention: dynamically adjusts the weights of classification, segmentation, and depth estimation tasks based on input features, achieving adaptive task optimization. When detecting minute cracks, the weight of the segmentation task is increased; when assessing subsidence diseases, the contribution of the depth estimation task is strengthened.

[0047] S3. The multi-task disease detection model is trained using a hybrid supervised training strategy. During backpropagation, a dynamic gradient blocking mechanism is implemented to selectively block gradients propagated from the classification head to the feature extraction network, reducing interference from the classification task to the feature learning of the segmentation task. Simultaneously, a dynamically weighted composite loss function is constructed to calculate segmentation and depth estimation losses for pixel-level labeled samples and classification losses for image-level labeled samples. The weight ratios of each loss type are dynamically adjusted according to the training stage. A hybrid supervised training strategy is employed, combining dynamic gradient blocking and a dynamically weighted loss function. Dynamic gradient blocking prevents classification errors from interfering with feature learning in the segmentation task; dynamic weighted loss adjusts the weights of different tasks according to the training stage, strengthening feature extraction early and enhancing classification discrimination later.

[0048] The hybrid supervised training strategy of this invention has the following advantages: Reduced annotation cost: Only a portion of the samples need pixel-level annotation, while the remaining samples use image-level labels, significantly reducing the manpower and time costs of data annotation. Dynamic gradient blocking mechanism: Through task relevance assessment and soft gradient masking, it selectively suppresses feature interference from classification tasks on segmentation tasks, alleviating gradient conflict problems in multi-task learning. Staged loss scheduling: The training process is divided into three stages: feature learning, task balancing, and discriminative optimization. The weights of each loss are dynamically adjusted, allowing the model to focus on different optimization objectives at different training stages, improving training stability and final performance.

[0049] Specifically, in this embodiment, the hybrid supervised training strategy further includes:

[0050] A task relevance evaluation module is constructed during backpropagation. When the difference in feature distribution between the classification and segmentation tasks exceeds a set threshold, a soft gradient masking mechanism is triggered to selectively block the influence of the classification gradient on the feature extraction network. The training process is divided into three stages: feature learning, task balancing, and discriminant optimization, based on the following criteria: 1. Feature learning stage: the first N rounds, focusing on optimizing the backbone network parameters; 2. Task balancing stage: the middle M rounds, gradually adjusting the weights of the multi-task loss; 3. Discriminant optimization stage: the last K rounds, strengthening the discriminative ability of classification and segmentation. The specific number of rounds can be dynamically set according to the training curve. The weight ratio of segmentation loss and classification loss is dynamically adjusted at different stages. A multi-component composite loss function is designed, including segmentation loss, depth estimation loss, classification loss, and feature consistency loss, where the feature consistency loss is used to align the distribution of classification and segmentation tasks in the feature space.

[0051] Specifically, the task relevance evaluation module is an evaluation unit used to quantify the feature differences between classification and segmentation tasks. It monitors the distribution differences between the two tasks in the feature space. When the difference exceeds a preset threshold (e.g., measured by KL divergence or cosine similarity), it determines that the two tasks may have feature learning conflicts, thus triggering a gradient adjustment mechanism. The soft gradient masking mechanism is a masking technique that dynamically adjusts gradient flow during backpropagation. It selectively masks the gradients passed from the classification task to the feature extraction network, mitigating the negative impact of classification errors on segmentation feature learning while still retaining gradient components beneficial to segmentation.

[0052] In the specific implementation of the hybrid supervised training strategy, the task relevance assessment module is based on the classification feature map output by the backbone network. (Dimensions are H×W×256) and segmentation feature map (Dimensions are H×W×256) Construction: First, calculate the mean vector of the two feature maps along the channel dimension. and Then, the difference in their feature distribution is measured by cosine similarity. When D exceeds a preset threshold of 0.3, a soft gradient masking mechanism is triggered, generating a soft mask matrix M = exp(-5·D) with the same dimension as the gradient, and then combining it with the gradient of the classification task. After element-wise multiplication, it is then combined with the gradient of the splitting task. Weighted summation is used to selectively suppress classification gradients during backpropagation. The classification loss is... The segmentation loss is , It is a weight matrix.

[0053] The training process is divided into three phases: in the feature learning phase (the first 50 rounds), segmentation loss weights are... Set to 3.0, classification loss weight Set to 0.2; during the task balancing phase (rounds 51-120), linear scheduling is used to make... The value decreased from 3.0 to 1.0. Increased from 0.2 to 1.0; during the discrimination optimization phase (after 121 rounds), the value was fixed. =0.8、 =1.5 to enhance classification and discrimination capabilities. The specific configuration of the composite loss function is as follows: the segmentation loss uses an equal-weighted combination of weighted cross-entropy with weight coefficients [0.8, 0.15, 0.05] and Dice loss; the depth estimation loss uses a scale-invariant logarithmic error function; the classification loss uses a focus loss with a focus parameter γ=2; and the feature consistency loss is calculated by... and The cosine distance mean is realized in a multi-scale feature space. The initial weight coefficients of each loss are set and dynamically adjusted during training according to the convergence speed of each sub-task.

[0054] Constructed total loss function The formula is as follows:

[0055] ,

[0056] in, The time-related weights of the segmentation loss vary with the number of training epochs. Cosine annealing decay (initial value 3.0 → final value 0.8). The time-related weights for the classification loss are increased using an inverse cosine function (initial value 0.2 → final value 1.5). A fixed weight (usually set to 0.5) is applied to the depth estimation loss. This is the weight for the feature consistency loss (usually set to 0.2).

[0057] Segmentation loss is The formula is: ,in, For true pixel-level disease segmentation and labeling (binary or multi-class mask). This is the segmentation probability map predicted by the model. To mitigate class imbalance using weighted cross-entropy loss, class weights are used (e.g., the weight of crack pixels is set to 2.0, and the weight of background pixels is set to 0.5). The Dice similarity coefficient loss is used to enhance the sensitivity of small target segmentation.

[0058] Depth estimation loss The formula is: ,in, For the diseased area The true depth values ​​of each sampling point (from laser scanning or stereo vision calibration). This represents the depth value predicted by the model at the corresponding point. This form is a scale-invariant logarithmic error, insensitive to changes in absolute scale, and suitable for estimating the relative depth of the road surface. It represents the number of pixels with valid true depth annotations in the current training batch.

[0059] Classification loss The formula is: ,in, This represents the total number of disease types (such as cracks, potholes, subsidence, etc.). The true category label (one-hot vector). The category predicted by the model The probability, These are class weights, used to balance the differences in sample size between classes. This is a focusing parameter (usually set to 2) to reduce the loss contribution of easily classified samples.

[0060] Feature consistency loss : ,in, For the classification task in the first Vectorized representation on layer feature maps For segmentation tasks, feature representations are used at the same layer. The number of feature layers used for alignment (e.g., selecting the last three layers of the backbone network). Cosine similarity measures the degree of alignment between two feature distributions in a direction.

[0061] Furthermore, the dynamically weighted composite loss function achieves a smooth transition of loss weights through a stage-adaptive weight adjustment mechanism. The segmentation loss weights decay according to a cosine annealing function as training progresses, while the classification loss weights increase according to an inverse cosine function. Simultaneously, the weight allocation is dynamically fine-tuned based on the convergence state of each task in the current batch, and the gradients of each task are normalized and projected to maintain multi-task training stability. In the implementation of the stage-adaptive weight adjustment mechanism of the dynamically weighted composite loss function, the training period is first defined. (Total period T=200), the segmentation loss weight function is set to =3.0×[0.5+0.5×cos(πt / T)], so that it smoothly decays from the initial value of 3.0 to 1.5 at the end of the cycle according to the cosine annealing law; the classification loss weight function is set to =0.2×[1.5-0.5×cos(πt / T)], causing it to grow from the initial value of 0.2 to 1.0 according to the inverse cosine function. Simultaneously, after each training batch, the system calculates the rate of change of the loss for each task relative to the moving average baseline in real time: if the rate of decrease in the loss of the segmentation task is lower than the threshold of 0.01, then... The temporary increase of 10% will continue for 5 batches; if the prediction confidence variance of the classification task exceeds 0.1, then... Reduce by 8% until the variance returns to normal. During the gradient optimization phase, the original gradients generated for the three tasks (segmentation, classification, and depth estimation) are... , , Each amplitude was normalized (divided by its respective L2 norm), and then Gram-Schmidt orthogonalization was performed. and Towards The gradients are projected in the direction of the gradient to eliminate conflicting components. The projected gradients are then weighted and fused before being input into the optimizer. Experiments show that this mechanism can reduce the oscillation amplitude of the loss curve during training by about 35%, and improve the model's segmentation mIoU on the test set by 2.1% and classification accuracy by 1.7%.

[0062] S4. Input the road image to be detected into the trained model, and simultaneously output pixel-level disease segmentation map, disease classification results, and disease depth distribution map. Through disease detection and multimodal output, the integrated output of disease location, category recognition, and three-dimensional morphology information is achieved, providing a complete data foundation for subsequent quantitative assessment.

[0063] Specifically, in this embodiment, the synchronous output process adopts a multi-branch parallel inference architecture. During a single forward propagation, a classification confidence vector, a pixel-level segmentation probability map, and a disease depth matrix are generated synchronously. The output results undergo cross-task consistency verification and spatial alignment post-processing, ultimately encapsulating the multiple outputs into a structured data object containing disease type, location, morphological parameters, depth information, and a timestamp. In the specific implementation of the synchronous output process, after feature extraction from the input image via the backbone network, it is processed synchronously through three parallel convolutional branches: the classification branch outputs a confidence vector of dimension C (where C is the number of disease categories), the segmentation branch outputs a pixel-level probability map with the same resolution as the original image, and the depth estimation branch outputs a pixel-by-pixel depth matrix. Then, cross-task post-processing is performed: First, based on the disease type with the highest confidence in the classification results, connected regions with a probability of less than 0.3 in the segmentation probability map are filtered out to achieve semantic consistency verification between tasks; then, bilinear interpolation is used to spatially align the depth matrix to ensure that it corresponds to the segmentation map pixel by pixel; finally, the three types of outputs are combined with preset morphological extraction algorithms (such as minimum bounding rectangle fitting and depth histogram statistics) and encapsulated into a JSON-formatted structured object, which includes disease type labels, bounding box coordinates, area, average depth, maximum depth and collection timestamp, forming a complete disease data unit that can be directly used for evaluation and decision-making.

[0064] S5. Based on the output disease segmentation map, perform connected component analysis and multi-dimensional parameter extraction to obtain information on the quantity, size, area, and depth of diseases, and combine this with preset assessment thresholds to evaluate the disease risk level. Through parameter extraction and risk assessment, the detection results are transformed into quantifiable engineering parameters and risk assessment indicators to support maintenance decisions.

[0065] Specifically, in this embodiment, the disease parameter extraction and risk assessment include: extracting connected components and semantically grouping the disease segmentation map using morphological operations and multi-scale clustering; extracting four types of parameters—geometric, topological, depth, and evolutionary—from each disease region; constructing a multi-factor assessment system based on the analytic hierarchy process (AHP) that includes structural safety factors, traffic safety factors, and maintenance urgency factors, and classifying the disease risk level into five levels according to the quantitative results; and generating a visualized disease map and a decision report containing repair suggestions and maintenance priorities by combining the spatial distribution and temporal evolution information of the disease.

[0066] Specifically, a mathematical model for a multi-factor risk assessment system is constructed, and a comprehensive risk value is obtained. The calculation formula is as follows:

[0067] ,

[0068] in These are the weights for structural safety, traffic safety, and maintenance urgency factors, respectively. The calculation formulas for each factor are as follows:

[0069] Structural safety factor : ,

[0070] Traffic safety factors : ,

[0071] Maintenance urgency factors : ,

[0072] in, A represents the maximum depth of the disease; A represents the area affected by the disease. ρ: Disease length; ρ: Disease distribution density; : Area expansion rate; Age: Duration of disease presence; , For the corresponding threshold; , , The weights of each sub-item.

[0073] In the implementation of disease parameter extraction and risk assessment, morphological closure operations are first performed on the segmentation results to fill small voids. Then, multi-scale clustering based on distance transformation merges adjacent disease regions with a spatial distance of less than 10 pixels into semantic units. For each disease unit, four types of parameters are extracted: geometric parameters, including the aspect ratio and area of ​​the bounding rectangle; topological parameters, including the unit spacing and distribution density; depth parameters, obtained by registering the depth matrix to obtain the maximum depression depth and average depth; and evolutionary parameters, calculated by comparing historical data to determine the area expansion rate over the past three months. Based on the analytic hierarchy process (AHP), structural safety factors (weight 0.4, based on depth and area), traffic safety factors (weight 0.3, based on location and size), and maintenance urgency factors (weight 0.3, based on expansion rate) are calculated separately. The weighted comprehensive risk value is then divided into five levels (0-0.2 for mild, 0.2-0.4 for moderate, 0.4-0.6 for intermediate, 0.6-0.8 for severe, and 0.8-1.0 for dangerous). Finally, the system combines geographic information systems to generate a visualized disease map with heat maps, and outputs a structured decision report that includes suggested repair techniques, material usage estimates, and construction priorities.

[0074] S6. Deploy the trained road defect detection model to the online intelligent monitoring platform to achieve automated detection, quantitative assessment, and risk warning of road defects. Through system deployment and platform integration, automated defect detection, visual report generation, and risk warning are achieved, improving the intelligence and timeliness of road maintenance.

[0075] Specifically, in this embodiment, during the model deployment phase, the trained disease detection model is optimized for inference using TensorRT and then integrated into an online intelligent monitoring platform based on a microservice architecture. The platform provides web and mobile interfaces, supporting users to upload road images collected by drones. The system automatically calls the model to complete disease detection, segmentation, and depth estimation, returning a visual report containing the disease's location, type, size, depth, and risk level within 5 seconds. Simultaneously, the platform connects to a geographic information system to automatically associate disease data with road markers. When a high-risk disease (such as a pothole deeper than 5cm) is detected, an early warning is automatically pushed via SMS and email, and a maintenance work order containing repair suggestions and budget assessments is generated, forming a closed-loop management process from detection and assessment to early warning.

[0076] Example 2

[0077] like Figure 2 As shown, this application provides an architecture diagram of a collaborative detection system for multiple types of road defects based on UAV aerial images, which is applied to the collaborative detection system for multiple types of road defects based on UAV aerial images as described in Embodiment 1. The system includes: a data acquisition and annotation module 210, a model building module 220, a model training module 230, a detection and inference module 240, a parameter extraction and evaluation module 250, and a system deployment and service module 260.

[0078] The data acquisition and annotation module 210 is used to acquire high-definition images of the road surface using drones and construct a hybrid training set containing pixel-level disease annotation samples and image-level disease classification labels.

[0079] The model building module 220 is used to build an end-to-end multi-task disease detection model. The model includes a feature extraction network, a dual-path feature aggregation module, a progressive alignment module, and an adaptive multi-task detection head. The dual-path feature aggregation module achieves multi-scale feature fusion through a top-down and bottom-up bidirectional path. The progressive alignment module achieves accurate alignment of feature maps at different resolutions through spatial offset learning. The adaptive multi-task detection head includes a parallel classification head, a segmentation head, and a depth estimation head, and dynamically adjusts the weights of each task head through an attention mechanism.

[0080] The model training module 230 is used to train the multi-task disease detection model using a hybrid supervised training strategy. During the backpropagation process, a dynamic gradient blocking mechanism is implemented to selectively block the gradients backpropagated from the classification head to the feature extraction network. At the same time, a dynamic weighted composite loss function is constructed to dynamically adjust the weight ratio of segmentation loss, classification loss and depth estimation loss according to the training stage.

[0081] The detection and inference module 240 is used to input the road image to be detected into the trained model and simultaneously output pixel-level disease segmentation map, disease classification result and disease depth distribution map.

[0082] The parameter extraction and evaluation module 250 is used to perform connected component analysis and multi-dimensional parameter extraction on the output disease segmentation map to obtain information on the number, size, area and depth of diseases, and to evaluate the disease risk level in combination with preset evaluation thresholds.

[0083] The system deployment and service module 260 is used to deploy the trained disease detection model to the online intelligent monitoring platform to realize automated detection, quantitative assessment and risk warning of road diseases.

[0084] Figure 3 This is an electronic device provided in one embodiment of this application. For example... Figure 3 As shown, the electronic device includes at least the following components: processor 301 and memory 300, communication interface 303, and bus 302.

[0085] In this embodiment of the application, memory 300 is used to store executable instructions of processor 301, which, when configured to execute instructions, implements the method as described in the first aspect.

[0086] In embodiments of this application, a computer-readable storage medium includes instructions that instruct a device to perform the method as described in the first aspect. For example, the instructions instruct the device to perform... Figure 1 The method is shown in the process steps.

[0087] In one embodiment of this application, the program operating in the electronic device may be a program that controls a central processing unit (CPU) or similar device to achieve the functions of the above-described embodiments of the present invention (a program that enables the computer to function). Information processed by these systems is then temporarily stored in random access memory (RAM) during processing, and subsequently stored in various ROMs such as read-only memory (FlashROM) and hard disk drives (HDDs), and read, corrected, and written by the CPU as needed.

[0088] It should be noted that a portion of the electronic device described above can also be implemented using a computer. In this case, the program for implementing the control function can be recorded on a computer-readable recording medium, and the program recorded on the recording medium can be read into the computer and executed.

[0089] It should be noted that the computer mentioned here refers to a computer built into an electronic device, employing hardware including an operating system and peripheral devices. Furthermore, computer-readable recording media refers to removable media such as floppy disks, magneto-optical disks, ROMs, and CD-ROMs, as well as storage systems such as hard drives built into the computer.

[0090] Furthermore, computer-readable recording media can include: media that dynamically stores programs for short periods of time, such as communication lines used when transmitting programs via networks like the Internet or communication lines like telephone lines; and media that store programs for fixed periods of time, such as volatile memory inside a computer that serves as a server or client in this case. In addition, the aforementioned program can be a program used to implement the above-mentioned functions, or it can be a program that can implement the above-mentioned functions by combining them with programs already recorded in the computer.

[0091] Furthermore, the electronic device in the above embodiments can also be implemented as an assembly (system group) composed of multiple systems. Each system constituting the system group can possess some or all of the functions or functional blocks of the electronic device in the above embodiments. As a system group, it is sufficient to have all the functions or functional blocks of the electronic device.

[0092] Those skilled in the art should recognize that the above embodiments are only used to illustrate this application and are not intended to limit this application. Any appropriate changes and variations made to the above embodiments within the essential spirit and scope of this application fall within the scope of protection claimed in this application.

Claims

1. A collaborative detection method for multiple types of road defects based on UAV aerial images, characterized in that, Includes the following steps: High-resolution images of road surfaces were collected using drones, and a hybrid training set containing pixel-level disease annotation samples and image-level disease classification labels was constructed. An end-to-end multi-task disease detection model is constructed, which includes a feature extraction network, a dual-path feature aggregation module, a progressive alignment module, and an adaptive multi-task detection head. The dual-path feature aggregation module achieves multi-scale feature fusion through a top-down and bottom-up bidirectional path. The progressive alignment module achieves accurate alignment of feature maps at different resolutions through spatial offset learning. The adaptive multi-task detection head includes a parallel classification head, a segmentation head, and a depth estimation head, and dynamically adjusts the weights of each task head through an attention mechanism. The multi-task disease detection model is trained using a hybrid supervised training strategy. During backpropagation, a dynamic gradient blocking mechanism is implemented to selectively block gradients backpropagated from the classification head to the feature extraction network in order to reduce the interference of the classification task on the feature learning of the segmentation task. At the same time, a dynamic weighted composite loss function is constructed to calculate the segmentation loss and depth estimation loss for pixel-level labeled samples and the classification loss for image-level labeled samples. The weight ratio of each type of loss is dynamically adjusted according to the training stage during the training process. Input the road image to be detected into the trained model, and simultaneously output pixel-level disease segmentation map, disease classification result and disease depth distribution map; Based on the output disease segmentation map, connected component analysis and multi-dimensional parameter extraction are performed to obtain information on the number, size, area and depth of diseases, and disease risk level is assessed in combination with preset assessment thresholds. The trained disease detection model is deployed to an online intelligent monitoring platform to achieve automated detection, quantitative assessment, and risk warning of road diseases.

2. The method according to claim 1, characterized in that, The process of constructing the hybrid training set further includes: acquiring road images containing visible and near-infrared bands using a multispectral imaging system mounted on a UAV; performing geometric correction, radiometric correction, and image registration preprocessing on the images; expanding the samples using data augmentation methods such as random cropping, mirror flipping, multi-angle rotation, brightness and contrast adjustment, noise injection, and meteorological simulation; labeling the diseased areas with polygons or pixel-level masks to form pixel-level labeled samples, and labeling the entire image with disease level and type to form image-level classification labels, and using a multi-expert cross-labeling and consistency verification mechanism to ensure labeling quality.

3. The method according to claim 1, characterized in that, The dual-path feature aggregation module introduces deformable convolution in the top-down path to adapt to the non-uniform deformation of the road surface, integrates a channel recalibration module in the bottom-up path to filter effective features across scales, and enhances the semantic consistency of multi-scale features through cascaded dilated convolution and gated attention mechanism.

4. The method according to claim 1, characterized in that, The stepwise alignment module is based on a differentiable spatial transformation network. It learns the non-rigid deformation field between feature maps through a multilayer perceptron and establishes a dense correspondence between multi-resolution features using bidirectional optical flow estimation. Combined with a cyclic alignment mechanism, it achieves sub-pixel-level feature space alignment.

5. The method according to claim 1, characterized in that, The adaptive multi-task detection head employs a three-layer attention mechanism, including: channel attention that allocates channel-level importance through a learnable weight matrix, spatial attention that captures global contextual dependencies based on a self-attention mechanism, and task-level attention that dynamically adjusts the weights of classification, segmentation, and depth estimation through a task relevance matrix.

6. The method according to claim 1, characterized in that, The hybrid supervised training strategy also includes: In backpropagation, a task relevance evaluation module is constructed. When the difference in feature distribution between the classification task and the segmentation task exceeds a set threshold, a soft gradient masking mechanism is triggered to selectively block the influence of the classification gradient on the feature extraction network. The training process is divided into three stages: feature learning, task balancing, and discrimination optimization. The weight ratio of segmentation loss and classification loss is dynamically adjusted at different stages. The design incorporates a multi-component composite loss function, which includes segmentation loss, depth estimation loss, classification loss, and feature consistency loss. The feature consistency loss is used to align the distribution of classification and segmentation tasks in the feature space.

7. The method according to claim 6, characterized in that, The dynamic weighted composite loss function further achieves a smooth transition of loss weights through a stage-adaptive weight adjustment mechanism. The segmentation loss weights decay according to the cosine annealing function as the training progresses, while the classification loss weights increase according to the inverse cosine function. At the same time, the weight allocation is dynamically fine-tuned according to the convergence state of each task in the current batch, and the gradients of each task are normalized and projected to maintain the stability of multi-task training.

8. The method according to claim 1, characterized in that, The synchronous output process adopts a multi-branch parallel inference architecture. In a single forward propagation, a classification confidence vector, a pixel-level segmentation probability map, and a disease depth matrix are generated synchronously. The output results are then subjected to cross-task consistency verification and spatial alignment post-processing. Finally, the multiple outputs are encapsulated into structured data objects containing disease type, location, morphological parameters, depth information, and timestamps.

9. The method according to claim 1, characterized in that, The extraction of disease parameters and risk assessment include: Morphological operations and multi-scale clustering were used to extract connected components and semantic group the disease segmentation map. Four types of parameters were extracted from each diseased area: geometric, topological, depth, and evolutionary. A multi-factor assessment system was constructed based on the analytic hierarchy process, which includes structural safety factors, traffic safety factors, and maintenance urgency factors. Based on the quantitative results, five levels of disease risk were classified. By combining information on the spatial distribution and temporal evolution of diseases, a visualized disease map and a decision report containing repair suggestions and maintenance priorities are generated.

10. A collaborative detection system for multiple types of road defects based on UAV aerial images, applied to the method described in any one of claims 1 to 9, characterized in that, The system includes: The data acquisition and annotation module is used to collect high-definition images of road surfaces using drones and construct a hybrid training set containing pixel-level disease annotation samples and image-level disease classification labels; The model building module is used to construct an end-to-end multi-task disease detection model. The model includes a feature extraction network, a dual-path feature aggregation module, a progressive alignment module, and an adaptive multi-task detection head. The dual-path feature aggregation module achieves multi-scale feature fusion through a top-down and bottom-up bidirectional path. The progressive alignment module achieves accurate alignment of feature maps at different resolutions through spatial offset learning. The adaptive multi-task detection head includes a parallel classification head, a segmentation head, and a depth estimation head, and dynamically adjusts the weights of each task head through an attention mechanism. The model training module is used to train the multi-task disease detection model using a hybrid supervised training strategy. During the backpropagation process, a dynamic gradient blocking mechanism is implemented to selectively block the gradients backpropagated from the classification head to the feature extraction network. At the same time, a dynamic weighted composite loss function is constructed to dynamically adjust the weight ratio of segmentation loss, classification loss and depth estimation loss according to the training stage. The detection and inference module is used to input the road image to be detected into the trained model and simultaneously output pixel-level disease segmentation map, disease classification result and disease depth distribution map; The parameter extraction and evaluation module is used to perform connected component analysis and multi-dimensional parameter extraction on the output disease segmentation map to obtain information on the number, size, area and depth of diseases, and to evaluate the disease risk level in combination with preset evaluation thresholds. The system deployment and service module is used to deploy the trained disease detection model to the online intelligent monitoring platform to realize automated detection, quantitative assessment and risk warning of road diseases.