Unmanned aerial vehicle traffic sign recognition method based on improved RT-DETR and CLIP cooperation

By improving the combination of RT-DETR and CLIP models, the problems of missed detection of small targets and false detection of complex backgrounds in UAV traffic sign recognition have been solved, achieving efficient and accurate traffic sign recognition and improving the intelligence level of UAV inspection.

CN122200436APending Publication Date: 2026-06-12XINJIANG UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
XINJIANG UNIVERSITY
Filing Date
2026-03-12
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Traditional traffic sign inspection relies on manual labor, which is inefficient. Drone recognition methods are prone to missing small targets under tilted, overhead views, and fine-grained subclass classification is difficult and easily affected by complex backgrounds, resulting in false detections.

Method used

By combining the improved RT-DETR and CLIP models, image distortion is eliminated through geometric normalization, and the real-time detection capability of RT-DETR and the cross-modal semantic matching of CLIP are utilized to achieve efficient and accurate recognition of traffic signs.

🎯Benefits of technology

It significantly improves the accuracy and efficiency of traffic sign recognition by drones, reduces missed detections and false detections, lowers system maintenance costs, and enhances the intelligence level of drone inspections.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure FT_1
    Figure FT_1
  • Figure FT_2
    Figure FT_2
  • Figure FT_3
    Figure FT_3
Patent Text Reader

Abstract

The application discloses an unmanned aerial vehicle traffic sign recognition method based on improved RT-DETR and CLIP cooperation, and belongs to the technical field of computer vision, unmanned aerial vehicle intelligent inspection and intelligent traffic. The method comprises the following steps: preprocessing a traffic scene image with an inclined visual angle acquired by an unmanned aerial vehicle; acquiring a traffic sign candidate region by using improved RT-DETR; constructing a subcategory keyword dictionary and a text prompt according to a national standard of traffic signs, and performing cross-modal similarity matching by using CLIP to determine the subcategory; and outputting a target geographic coordinate in combination with unmanned aerial vehicle position and height information. The method is suitable for unmanned aerial vehicle traffic sign recognition and positioning.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of computer vision, intelligent inspection of unmanned aerial vehicles (UAVs), and intelligent transportation, and particularly to a UAV traffic sign recognition method based on improved RT-DETR and CLIP collaboration. Background Technology

[0002] With the rapid development of intelligent transportation systems, road traffic signs, as the core carriers of traffic rules and road condition information, have their integrity and visibility directly affecting driving safety and traffic efficiency. Against this backdrop, accurate and efficient automated inspection and management of massive numbers of traffic signs has become an urgent need.

[0003] Traditional traffic sign inspections primarily rely on manual patrols or vehicle-mounted mobile measurement systems, which still present significant challenges in meeting the demands of large-scale and refined management. Manual patrols are constrained by manpower and time costs, making it difficult to achieve high-frequency, comprehensive, and routine inspections across large-scale road networks, easily leading to blind spots and missed inspections. Furthermore, manual interpretation is highly subjective, with inconsistent assessments of sign conditions such as obstruction, fading, and damage. While vehicle-mounted inspections offer some efficiency improvements, their ability to acquire information in complex areas such as roadside slopes, ramp merging areas, and median strips is limited, making it difficult to obtain stable and complete sign information.

[0004] On the other hand, with the rise of the low-altitude economy, drone technology is gradually being applied to the field of transportation maintenance. In highway inspection, drones have unique advantages: they possess a wide field of view from high altitudes, are not limited by ground terrain or road conditions, and can flexibly traverse complex road sections. Equipped with high-resolution cameras, drones can quickly acquire road images containing rich geographical information, providing a new data source for traffic sign recognition.

[0005] However, relying solely on images captured by drones for recognition faces unique challenges: drones typically operate from high altitudes or at an angle, causing traffic signs to appear extremely small and subject to severe perspective distortion (e.g., circular signs become elliptical); simultaneously, the background often contains complex textures such as vegetation and buildings, making false detections highly likely. Manually reviewing the massive amounts of drone imagery data would be extremely labor-intensive and fail to meet real-time requirements. Therefore, introducing advanced deep learning technology is essential to improving the efficiency and accuracy of traffic sign recognition from a drone's perspective.

[0006] The selected RT-DETR (Real-Time Detection Transformer) is a cutting-edge real-time object detection model. Compared to traditional CNN architectures such as the YOLO series, it introduces a Transformer architecture, which can more effectively capture global contextual information in images. In UAV traffic sign recognition, the improved RT-DETR model can accurately capture small and deformed traffic sign candidate regions in complex backgrounds, effectively solving the problem that traditional detectors easily miss small targets, thus narrowing down the target range for subsequent accurate identification.

[0007] CLIP (Contrastive Language-Image Pre-training) is a powerful vision-language pre-trained model capable of understanding the semantic relationships between images and text. In fine-grained traffic sign recognition, traditional classification models often struggle to distinguish between subclasses that are extremely similar in appearance and to identify new signs not present in the training set. Using the CLIP model, a fine-grained text description (Prompt) can be constructed to directly perform semantic-level verification and classification of image content, thus accurately reading the meaning of signs without OCR and possessing zero-shot capability for recognizing new categories. Combining the improved RT-DETR model with the CLIP model for traffic sign recognition from a UAV perspective offers several advantages.

[0008] First, this combination allows for the complementarity of coarse detection and fine verification, significantly improving recognition accuracy. The improved detection network, with its strong perception of small targets, reduces missed detections; while the CLIP model, with its powerful semantic understanding capabilities, performs fine-grained verification of the detection results, effectively eliminating false targets such as roadside billboards and accurately distinguishing specific marker subclasses, thereby reducing false recognition.

[0009] Secondly, this combination enhances the system's flexibility and scalability. RT-DETR handles localization, while CLIP handles fine-grained classification. When traffic management departments implement new signs, there's no need to retrain the entire detection network; recognition can be achieved simply by adding the corresponding description to CLIP's text library, significantly reducing system maintenance costs.

[0010] Furthermore, considering the unique perspective of drones, this invention incorporates geometric normalization processing into the workflow. By combining drone attitude parameters to correct image distortion, the image features input to the model are made closer to the standard frontal view, further ensuring recognition stability under complex flight conditions such as oblique photography and large-angle overhead views. This is of great significance for building an intelligent traffic inspection system for drones. Summary of the Invention

[0012] This invention provides a UAV traffic sign recognition method based on improved RT-DETR and CLIP collaboration, addressing the technical problems of low efficiency in traditional manual inspections, easy omission of small targets from oblique, top-down perspectives in existing UAV recognition methods, difficulties in fine-grained subclass classification, and susceptibility to false detections due to complex background interference. Specifically, it includes the following steps:

[0013] The system acquires traffic scene images taken by drone from a tilted perspective and synchronizes flight attitude parameters. It then uses inverse perspective transformation to construct a geometrically normalized view, eliminates top-down perspective distortion, and corrects traffic signs from the tilted perspective to a fitted frontal view to match the view distribution of subsequent models.

[0014] An improved RT-DETR real-time detection network is constructed, which introduces frequency-selective adaptive sampling, learnable gated fusion, and structure-reparameterized convolution in the feature extraction and fusion stages. It performs candidate target detection on preprocessed images and outputs a set of bounding boxes containing small and blurred targets, thus solving the problem of difficult localization.

[0015] Based on the national standard for traffic signs, a fine-grained sub-category keyword dictionary is constructed to generate text prompt templates containing shape, color, and specific numerical semantics. Negative sample prompts containing common background interference are also constructed, and a text prototype embedding library is generated using the CLIP text encoder.

[0016] The cropped candidate image patch is input into the CLIP image encoder to obtain visual features. The cross-modal cosine similarity between the patch and each vector in the text prototype embedding library is calculated. Based on the similarity probability distribution, the accurate identification of specific subclasses and the elimination of false detection targets are achieved.

[0017] By fusing the detection confidence of RT-DETR with the cross-modal matching confidence of CLIP, and combining UAV positioning information, pixel coordinates are mapped to a geographic coordinate system, and geographically labeled recognition results are output.

[0018] The beneficial effects of the technical solution of this invention are as follows: By combining the geometric normalization processing of UAV pose, the problem of top-down deformation is effectively solved; by improving the RT-DETR model architecture, the data structure is clear, significantly improving the perception capability and positioning accuracy of small targets in complex backgrounds, while having low computational load and strong real-time performance; by using CLIP cross-modal semantic matching to replace the traditional classification head, the function of identifying specific numerical subclasses can be realized without a large number of fine-grained annotations, and false detections can be effectively eliminated through semantic differences, greatly improving the intelligence level and operational efficiency of UAV traffic inspection. Attached Figure Description

[0020] Figure 1The flowchart shown is a process for the UAV traffic sign recognition method based on the improved RT-DETR and CLIP collaboration of the present invention.

[0021] Figure 2 The diagram shows a frequency-selective adaptive sampling scheme for the UAV traffic sign recognition method based on the improved RT-DETR and CLIP collaboration of the present invention.

[0022] Figure 3 The diagram shown is a schematic of the learnable gating fusion scheme of the UAV traffic sign recognition method based on the improved RT-DETR and CLIP collaboration of the present invention.

[0023] Figure 4 The diagram shown is a structural reparameterization scheme for the UAV traffic sign recognition method based on the improved RT-DETR and CLIP collaboration of the present invention.

[0024] Figure 5 The diagram illustrates the text prototype library construction and cross-modal matching retrieval scheme of the UAV traffic sign recognition method based on the improved RT-DETR and CLIP collaboration of the present invention. Detailed Implementation

[0025] In order to overcome the problems of traditional highway traffic sign inspection methods, which mainly rely on manual or vehicle-mounted inspections, such as being time-consuming and labor-intensive, having limited coverage, being significantly affected by obstructions, and having difficulty in real-time processing of images collected by large-scale drones.

[0026] The technical solution of this invention is: a UAV traffic sign recognition method based on improved RT-DETR and CLIP collaboration, the steps of which are as follows:

[0027] S1: Acquire a traffic scene image I taken by the UAV from a tilted perspective, and simultaneously acquire the flight and imaging parameters M corresponding to the image. The flight and imaging parameters M include at least one or more of the camera intrinsic parameters, UAV attitude angle and altitude information. After preprocessing, I′ is obtained.

[0028] S11: Perform lens distortion correction and image stabilization on the image, and construct a geometric transformation based on the flight and imaging parameters M to project the tilted viewpoint onto the ground approximate normal view coordinate system to achieve scale normalization and obtain the preprocessed image.

[0029] S12: Perform denoising and contrast enhancement on the preprocessed image to improve the separability of weak texture traffic signs and distant small targets;

[0030] S2: Use RT-DETR for target detection to detect traffic signs in images taken by drones. The RT-DETR model achieves efficient detection and obtains the initial target detection category confidence.

[0031] S21: Train the RT-DETR model using the collected dataset. Input the preprocessed image I′ into the RT-DETR model, which consists of a backbone network, an efficient hybrid encoder, and a Transformer decoder.

[0032] S211: A Frequency Selective Adaptive Sampling (FSAS) module is introduced in the feature extraction stage of the model. FSAS includes Frequency Aware Feature Selection (FFS) to suppress background interference. The spatial feature information is transformed to the frequency domain using Fast Fourier Transform (FFT), and response modeling is performed according to different frequency bands. A set of lightweight convolutions generates a set of weights that are allocated according to different frequencies, resulting in enhanced features after frequency denoising.

[0033] S212: FSAS also includes Radially Adaptive Dilatated Deformable Convolution (RADDC), which generates offsets and sampling points based on traffic sign geometric constraints. The sampling points can be adaptively modulated according to the offsets, and geometrically aligned enhanced features are obtained through deformable convolution.

[0034] S213: A learnable gated fusion module (LGF) is set in the neck network to perform channel alignment and scale alignment on adjacent scale features and then fuse them according to the gating coefficient to alleviate cross-scale semantic and detail misalignment.

[0035] S214: In the neck network, a structurally reparameterized convolutional module RMBConv is introduced. During the training phase, a multi-branch convolutional structure is adopted, and during the inference phase, the multi-branch convolutional structure is equivalently folded into a single convolutional operator to improve deployment efficiency.

[0036] S215: In the decoding phase, a fixed set of learnable object queries is initialized. These query vectors represent potential target locations and features. Using an attention mechanism, the object queries are treated as queries, and the image features output by the hybrid encoder are used as keys and values. Attention weights are calculated to aggregate key feature information in the image, thereby updating the state of the object queries. The calculation formula is as follows: softmax( V in, and These are the query and key matrix, respectively. It is the dimension of the key vector. Used to scale points to prevent gradient problems;

[0037] S22: The model detects traffic signs in the image and directly outputs the predicted bounding boxes and class confidence through the bipartite graph matching mechanism. It outlines the detected target areas and records the coordinates of the four vertices and the corresponding class of each detection box. RT-DETR avoids the cumbersome non-maximum suppression post-processing operation in traditional detectors through end-to-end prediction.

[0038] S3: Establish a set of traffic sign subcategories S based on the national standard for traffic signs, and construct a keyword dictionary D and a text prompt template for each subcategory. The text prompt template shall contain at least two of the following attributes: shape attribute, color attribute, pattern symbol attribute, and text or number attribute.

[0039] S31: For subcategories containing text or number differences, explicitly introduce corresponding text or number constraint descriptions or placeholder descriptions in the text prompt template, and establish mapping rules between text or numbers and subcategories.

[0040] S32: Input the text prompt templates corresponding to each subcategory into the text encoder of the vision-language pre-trained model to obtain text embedding vectors, and cache the vectors as text embedding libraries after normalization.

[0041] S4: Perform cross-modal similarity classification;

[0042] S41: After scaling the cropped target image, input it into the CLIP image encoder segment for encoding to obtain... ;

[0043] S42: Input the target image patch cropped from RT-DETR into the image encoder to convert it into a visual feature vector. The core of the model lies in calculating the alignment degree between image features and text features in the joint embedding space, i.e., cosine similarity. The calculation formula is: in, It is the image feature vector output by the image encoder. It is the text feature vector corresponding to the text description. This indicates the calculation of cosine similarity. It is the temperature parameter used to scale logits. This represents the total number of candidate text categories. Finally, the Softmax function is used for normalization, resulting in a sub-category probability distribution. The category with the highest probability is selected as the semantic recognition result, and the cross-modal confidence score is output. .

[0044] S5: Detect confidence level Cross-modal confidence The final confidence level is obtained by fusion. (e.g., weighted fusion) It outputs traffic sign recognition results including sub-category, candidate box position, and final confidence score;

[0045] S51: Based on the UAV pose and imaging geometry, map the traffic sign recognition results to the geographic coordinate system and output the geographic location information of the traffic sign.

Claims

1. A UAV traffic sign recognition method based on improved RT-DETR and CLIP collaboration, characterized in that: Obtain I, a traffic scene image taken by a drone from a tilted perspective, and preprocess it to obtain I′; Lens distortion correction and image stabilization are performed on image I, and a geometric transformation is constructed based on flight and imaging parameters M to project the tilted viewpoint onto the ground approximate normal view coordinate system to achieve scale normalization, thus obtaining the preprocessed image I′. Denoising and contrast enhancement are performed on the preprocessed image I′ to improve the separability of weak texture traffic signs and distant small targets; RT-DETR is used for object detection, identifying traffic signs in images captured by a drone. The RT-DETR model utilizes a hybrid encoder and Transformer decoder structure to achieve efficient detection and obtain initial class confidence scores for object detection. ; The RT-DETR model is trained using the collected dataset. The processed image I′ is input into the RT-DETR model, which consists of a backbone network, an efficient hybrid encoder, and a Transformer decoder to obtain preliminary detection and classification confidence. In the feature extraction stage of the model, a frequency selective adaptive sampling module (FSAS) is introduced. FSAS includes frequency-aware feature selection (FFS) and radially adaptive dilated deformable convolution (RADDC). FFS is used to suppress background interference, and RADDC is used to align the sampling points with the geometric constraints of traffic signs. A learnable gated fusion module (LGF) is set in the neck network to perform channel alignment and scale alignment on adjacent scale features and then fuse them according to the gating coefficient to alleviate cross-scale semantic and detail misalignment. In the neck network, a structurally reparameterized convolutional module RMBConv is introduced. RMBConv adopts a multi-branch convolutional structure during the training phase and folds the multi-branch convolutional structure into a single convolutional operator during the inference phase to improve deployment efficiency. During the decoding phase, a fixed set of learnable object queries is initialized. These query vectors represent potential target locations and features. Using an attention mechanism, the object queries are treated as queries, and the image features output by the hybrid encoder are used as keys and values. Attention weights are calculated to aggregate key feature information from the image, thereby updating the state of the object queries. The calculation formula is as follows: softmax( ) V in, It is an object query matrix. and It is the image feature matrix. It is the dimension of the key vector. This mechanism is used to scale points to prevent gradient vanishing or exploding problems. Through this mechanism, the model can focus on salient regions in the image related to traffic signs, thus obtaining preliminary object detection confidence. ; Based on the national standards for traffic signs, a set of traffic sign subcategories S is established, and a keyword dictionary D and a text prompt template are constructed for each subcategory. The text prompt template shall include at least two of the following: shape attribute, color attribute, pattern / symbol attribute, and text or number attribute; For subcategories containing text or number differences, the corresponding text or number constraint description or placeholder description is explicitly introduced in the text prompt template, and a mapping rule between text or numbers and subcategories is established. Text prompt templates corresponding to each subcategory The text encoder of the input vision-language pre-trained model obtains the text embedding vector. and to After normalization, it is cached as a text embedding library; The target image patch cropped by RT-DETR is input into the image encoder and converted into a visual feature vector. The core of the model lies in calculating the alignment between image features and text features in the joint embedding space, i.e., cosine similarity. To obtain the probability of belonging to a specific category, the image feature vector is multiplied by all candidate text feature vectors, divided by the temperature parameter, and finally normalized using the Softmax function. The calculation formula is as follows: in, It is the image feature vector output by the image encoder. It is the text feature vector corresponding to the text description. This indicates the calculation of cosine similarity. It is the temperature parameter used to scale logits. This represents the total number of candidate text categories. After normalization, the sub-category probability distribution is obtained. The category with the highest probability is selected as the semantic recognition result, and the cross-modal confidence score is output. Fusion detection confidence Cross-modal confidence Obtain the final confidence level It outputs information including sub-category, candidate box position, and final confidence score. Traffic sign recognition results; Detect confidence With cross-modal confidence The final confidence level is obtained by fusion. Weighted fusion It outputs traffic sign recognition results including sub-category, candidate box position, and final confidence score; The traffic sign recognition results are mapped to a geographic coordinate system based on the UAV pose and imaging geometry, and the geographic location information of the traffic sign is output.

2. The method according to claim 1, characterized in that: RT-DETR's efficient hybrid encoder transforms input image data into a multi-scale feature representation that the model can understand. Through Intra-Scale Feature Interaction (AIFI) and Cross-Scale Feature Fusion (CCFM) modules, it captures key geometric features and semantic information of traffic signs in the image. The model can effectively understand and process complex background data from the perspective of drones. The Transformer decoder in RT-DETR decodes the feature vectors generated by the encoder into the final predicted bounding box and primary class confidence, achieving accurate localization of traffic sign targets.

3. The method according to claim 2, characterized in that: In the feature extraction process, frequency selective adaptive sampling (FSAS) is incorporated into the RT-DETR model. Fast Fourier transform is used to analyze the frequency domain information of the feature map. Based on different frequency responses, weights are selected to enhance the weight of high-frequency edge features and suppress low-frequency background noise, thereby improving the model's ability to perceive small target traffic signs.

4. The UAV traffic sign recognition method based on improved RT-DETR and CLIP collaboration according to claim 3, characterized in that: In the process of cross-scale feature fusion, a learnable gated fusion module (LGF) is integrated into the neck network of the RT-DETR model. The spatial weights of adjacent scale features are calculated through the gating mechanism, and the fusion ratio of deep semantic features and shallow detail features is dynamically adjusted, thereby alleviating the problem of feature loss of small targets in deep networks and improving the alignment effect of multi-scale features.

5. The UAV traffic sign recognition method based on improved RT-DETR and CLIP collaboration according to claim 4, characterized in that: In the feature fusion construction stage, a structured reparameterized convolution module (RMBConv) is introduced. During the training stage, a multi-branch parallel structure is adopted to enrich the feature solution space. During the inference stage, the multi-branch parameters are equivalently folded into a single convolution operator, thereby improving the model's ability to express traffic sign features while ensuring the real-time inference speed of the UAV.

6. The method according to claim 5, characterized in that: The CLIP model's image encoder converts cropped candidate image patches into visual feature vectors; simultaneously, its text encoder converts constructed text prompt templates into text embedding vectors with the same dimensions as the visual feature vectors. By calculating the cosine similarity between the two in the joint feature space, it indicates the degree of matching between the image content and the text description, thus performing fine-grained semantic classification of traffic signs.

7. The UAV traffic sign recognition method based on improved RT-DETR and CLIP collaboration according to claim 6, characterized in that: During the recognition process, the images captured by the drone are preprocessed. The preprocessing first reads the image files stored on the drone or ground station, and simultaneously reads the drone's attitude angle information, which is then loaded into the image processing module. Through the inverse perspective transformation algorithm, the perspective distortion caused by the drone's overhead shooting is eliminated, and the trapezoidal road surface area is corrected to a rectangle to improve the geometric regularity of traffic signs. Then, the contrast is enhanced by the contrast-limited adaptive histogram equalization algorithm, making traffic signs in shadows or strong light more prominent, which facilitates the subsequent RT-DETR feature extraction and CLIP semantic recognition. At the same time, the image quality is evaluated, and frames with excessive motion blur are removed.