A power grid target detection method based on an improved RT-DETR model

By improving the adaptive spatially aware convolutional module and feature enhancement pyramid of the RT-DETR model, and combining IoU-aware query selection and power grid target space prior, the problem of large target scale differences and feature loss of small targets in power grid inspection is solved, achieving higher detection accuracy and faster convergence speed, and improving the robustness and efficiency of power grid target detection.

CN121904047BActive Publication Date: 2026-06-16HEFEI UNIV OF TECH +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HEFEI UNIV OF TECH
Filing Date
2026-03-24
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing RT-DETR models are ill-suited to the challenges of large differences in target scale, loss of features for small targets, insufficient targeting of detection heads, and low training and inference efficiency in power grid inspection scenarios, resulting in insufficient detection accuracy and poor robustness.

Method used

The RT-DETR model is improved by adopting an adaptive spatially aware convolutional module (ASPC) and a feature augmentation pyramid (PEFP). It combines IoU-aware query selection and power grid target space priors to optimize the backbone network, detection head, and training strategy, thereby improving detection accuracy and efficiency.

🎯Benefits of technology

It achieves higher detection accuracy, faster convergence speed and better real-time performance, adapts to multi-scale changes in power grid targets, improves the recall rate and positioning accuracy of small targets, and solves the problems of receptive field incompatibility and low training and inference efficiency in existing technologies.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121904047B_ABST
    Figure CN121904047B_ABST
Patent Text Reader

Abstract

The application discloses a power grid target detection method based on an improved RT-DETR model, relates to the technical field of target detection, and adopts the improved RT-DETR model to perform power grid target detection, including: improving a backbone network, replacing standard convolution in the backbone network with an adaptive spatial perception convolution module, dynamically adjusting a receptive field to adapt to power grid target scale changes through multi-branch dilated convolution, a scale attention mechanism and deformable convolution of the adaptive spatial perception convolution module; designing a feature enhancement pyramid, integrating high-frequency feature protection, cross-scale bidirectional fusion and spatial constraint attention of the feature enhancement pyramid, strengthening small target features and optimizing small target feature representation by using power grid target spatial priori; improving a detection head, introducing IoU perception query selection and power grid target spatial priori, improving small target recall rate and positioning accuracy, and accelerating convergence of a decoder. The application realizes higher detection accuracy, faster convergence speed and better real-time performance.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of target detection technology, and in particular to a power grid target detection method based on an improved RT-DETR model. Background Technology

[0002] Power grid inspection is a crucial link in ensuring the safe and stable operation of the power system. Timely and accurate detection of the operating status of power equipment (such as insulators, transformers, and towers) is of great significance for preventing faults and improving operation and maintenance efficiency. With the development of artificial intelligence technology, deep learning-based object detection algorithms have been widely used in power grid inspection tasks.

[0003] Currently, mainstream object detection methods can be divided into three categories: traditional machine vision methods, deep learning-based two-stage methods, and single-stage methods. Traditional methods rely on manual feature extraction, have poor adaptability, and are difficult to handle complex scenarios; two-stage methods have high detection accuracy, but slow inference speed, making it difficult to meet the needs of real-time inspection; single-stage methods, such as RT-DETR (Real-time Object Detection Algorithm), have achieved a good balance between detection accuracy and efficiency, and have become a research hotspot for current power grid inspection tasks.

[0004] However, power grid inspection scenarios have significant unique characteristics: large differences in target scale, a large number of small targets that are easily affected by background interference, and fixed spatial layout patterns among equipment. For example, large substations and small-sized fittings exist. In transmission line engineering, fittings refer to the general term for metal or alloy components used to connect, fix, protect, and support various parts of the transmission line (conductors, insulators, towers, etc.), and are indispensable basic accessories in the power line system. Existing RT-DETR models, employing fixed receptive field convolution, traditional feature pyramid structures, and general-purpose detection heads, struggle to effectively adapt to these characteristics, resulting in insufficient detection accuracy and poor robustness in power grid target detection tasks. Therefore, it is urgent to optimize the RT-DETR architecture to suit the characteristics of power grid targets and improve its detection performance in power grid inspection scenarios. Summary of the Invention

[0005] To overcome the shortcomings of the prior art, this invention provides a power grid target detection method based on an improved RT-DETR model. An improved RT-DETR model suitable for power grid target detection is designed to solve the problems of inappropriate receptive field, loss of small target features, insufficient targeting of the detection head, and low training and inference efficiency in the prior art, thereby achieving higher detection accuracy, faster convergence speed, and better real-time performance.

[0006] To achieve the above objectives, the present invention adopts the following technical solution, including:

[0007] A power grid target detection method based on an improved RT-DETR model is proposed. The existing RT-DETR model is improved, and the improved RT-DETR model is trained using a dataset for power grid target detection.

[0008] The dataset contains images of K types of power grid targets, including large targets with an average pixel area exceeding a set threshold for large targets and small targets with an average pixel area below a set threshold for small targets.

[0009] Improvements to the existing RT-DETR model include:

[0010] The backbone network is improved by replacing the standard convolution in the backbone network with an adaptive spatially aware convolution module. The adaptive spatially aware convolution module dynamically adjusts the receptive field to adapt to the scale changes of the power grid target through multi-branch dilated convolution, scale attention mechanism and deformable convolution. After the input image is fed into the backbone network, each downsampled feature map is passed through the adaptive spatially aware convolution module to obtain an enhanced multi-scale feature map.

[0011] The feature enhancement pyramid is designed to integrate high-frequency feature protection, cross-scale bidirectional fusion, and spatially constrained attention. It enhances the features of small targets and optimizes the feature representation of small targets by utilizing prior knowledge of the power grid target space. The output of the backbone network is then passed through the feature enhancement pyramid to obtain a feature-enhanced multi-scale feature map.

[0012] The detection head is improved by introducing IoU-aware query selection and power grid target space prior. The IoU-aware query selection obtains the initial query vector from the feature-enhanced multi-scale feature map. The decoder corrects the query vector based on the power grid target space prior and adopts a hierarchical decoding strategy to output the power grid target detection result.

[0013] Preferably, the processing procedure of the adaptive spatially aware convolutional module is as follows:

[0014] (1) Multi-branch dilated convolution: The input feature map is processed in parallel by multiple branches, and the output is a multi-branch feature map, which is adapted to the scale range of the dataset from large to small targets; the multi-branch corresponds to multiple convolutional layers with different dilation rates; the input feature map is a downsampled feature map of the input image obtained by passing it through the backbone network;

[0015] (2) Scale attention weight generation: Global average pooling is performed on the input feature map, and the weights of each branch are generated through a fully connected layer. After normalization, the weight vector is obtained.

[0016] (3) Multi-branch feature fusion: The multi-branch feature maps are weighted and summed according to the normalized weights to obtain the fused feature map;

[0017] (4) Deformable convolution: predict the dynamic offset of sampling points through convolutional layers, and perform variable convolution on the fused feature map based on the predicted dynamic offset of sampling points to obtain the feature map after deformable convolution.

[0018] (5) Perform residual connection between the output feature map of deformable convolution and the fused feature map to obtain the feature map enhanced by the adaptive spatially aware convolution module;

[0019] The input image is fed into the backbone network and then downsampled sequentially to obtain multi-scale downsampled feature maps. Each downsampled feature map is then processed by an adaptive spatially aware convolutional module to obtain an enhanced multi-scale feature map.

[0020] Preferably, for the enhanced multi-scale feature maps output by the adaptive spatially aware convolutional module after sequential downsampling in the backbone network, the processing procedure of the feature enhancement pyramid is as follows:

[0021] (1) High-frequency feature protection:

[0022] High-frequency enhancement is performed on the shallow feature map corresponding to the small target features to enhance the high-frequency details of the small targets in the dataset, resulting in a high-frequency enhanced shallow feature map.

[0023] High-pass filtering is applied to the shallow feature map to retain only high-frequency details and filter out low-frequency background information, resulting in a high-pass filtered shallow feature map.

[0024] The shallow feature map enhanced by high frequency and the shallow feature map after high-pass filtering are concatenated, and the high-frequency attention weight map of the shallow feature is calculated by using an activation function.

[0025] Based on the high-frequency attention weight map, weighted summation and residual connections are performed to obtain the final shallow feature map after high-frequency feature protection.

[0026] (2) Design two paths, one from top to bottom and one from bottom to top, to perform cross-scale bidirectional fusion:

[0027] The top-down approach uses upsampling to refine the semantic information of high-level semantic features, i.e., the semantic information of large targets, step by step.

[0028] The bottom-up approach uses downsampling to enhance the detailed features of the low-level objects, i.e., the detailed features of small targets, across scales.

[0029] Learnable fusion weights are introduced to balance the contributions of each scale, and multi-scale cross-scale bidirectional fusion feature maps are obtained based on the fusion weights.

[0030] (3) Spatial Constrained Attention: Based on the multi-scale cross-scale bidirectional fusion feature map, a multi-scale attention map is generated. The attention map is multiplied with the corresponding scale cross-scale bidirectional fusion feature map and then connected through residuals to obtain the feature-enhanced multi-scale feature map.

[0031] Preferably, for the feature-enhanced multi-scale feature map output by the feature enhancement pyramid, the IoU-aware query selection is used to obtain the initial query vector from the feature-enhanced multi-scale feature map, and the processing procedure is as follows:

[0032] First, anchor points are generated on each layer of the feature map;

[0033] For each anchor point, a feature vector is extracted from the feature map of the corresponding layer through the region of interest alignment operation, which serves as a candidate query vector. The candidate query vectors of all layers constitute a candidate query vector set.

[0034] Input the candidate query vector set into the prediction head to obtain the classification score and predicted bounding box of each candidate query vector, and calculate the IoU value between each predicted bounding box and the corresponding ground truth bounding box in the dataset.

[0035] Design a joint scoring function that integrates classification scores and IoU values; sort the joint scores in descending order and select the top K candidate query vectors as the initial query vectors.

[0036] Preferably, the decoder corrects the query vector based on prior knowledge of the power grid target space, as shown below:

[0037] ;

[0038] in, For spatial prior distribution map, These are the weighting coefficients. This is the initial query vector. This is the corrected query vector, i.e., the initial query vector incorporating prior spatial information. As input to the decoder.

[0039] Preferably, the decoder employs a hierarchical decoding strategy for power grid target detection, as detailed below:

[0040] Based on deep feature map coarse decoding of large targets, semantic complementarity is achieved by downsampling alignment and channel concatenation to obtain a fused feature map for large target detection.

[0041] Filter out the specific query vectors related to the large target from the revised query vectors to obtain the specific query vector set related to the large target;

[0042] The fused large target detection feature map and the large target-related query vector set are input into the Transformer decoder to obtain the attention output features of the large target coarse decoding stage;

[0043] The feature vectors after attention interaction are used by the prediction head to output the class probability and coarse bounding box of the large target;

[0044] The prediction results of large targets are filtered by confidence, and bounding boxes with confidence scores greater than or equal to a set confidence threshold are retained; the coarse bounding boxes of large targets at the feature map scale are mapped back to the original image scale to obtain the coarse localization bounding boxes of large targets at the original image size;

[0045] Based on the coarse positioning bounding box of the large target in the original image, the area is expanded outward by a set ratio as the spatial anchor point area, and then the small target is finely decoded within this spatial anchor point area.

[0046] The spatial anchor point regions at the original image scale are mapped to the shallow feature map scale to obtain the spatial anchor point regions at the feature map scale; only the regional features corresponding to the spatial anchor point regions in the shallow feature map are retained, and the background features outside the regions are removed to obtain the cropped shallow focused feature map.

[0047] Based on the cropped shallow focused feature map, the aligned features are fused by upsampling alignment, channel concatenation and detail enhancement convolution to obtain a fused feature map for small target detection.

[0048] Filter out the specific query vectors related to the small target from the revised query vectors to obtain a set of specific query vectors related to the small target;

[0049] The fused small target detection feature map and the small target-related query vector set are input into the Transformer decoder to obtain the attention output features of the small target fine decoding stage;

[0050] The feature vector after attention interaction is used by the prediction head to output the class probability and precise bounding box of the small target;

[0051] The prediction results of small targets are filtered by confidence, and bounding boxes with confidence scores greater than or equal to a set confidence threshold are retained; the accurate bounding boxes of small targets at the feature map scale are mapped back to the original image scale to obtain the accurate localization bounding boxes of small targets at the original image scale.

[0052] Based on prior spatial information, the dependent larger target of each smaller target is identified, thus obtaining the target correlation relationship.

[0053] Preferably, the target detection results output by hierarchical decoding are post-processed and optimized. Invalid predictions are filtered out and the detection results are corrected by class-aware NMS deduplication and spatial relationship constraints. Finally, a complete detection result containing target category, target attribute, bounding box and confidence is formed. The target detection results output by hierarchical decoding include coarse localization results of large targets, fine localization results of small targets and target association relationships.

[0054] Preferably, the improved RT-DETR model employs multi-task collaborative training, with a total loss function. for:

[0055] ;

[0056] To improve the sum of GIoU loss and L1 loss:

[0057] ;

[0058] in, The weights for GIoU loss; For generalized intersection and comparison of losses; The weights for L1 loss; For L1 loss; The target bounding box predicted by the model; The target ground bounding boxes labeled in the dataset; To predict bounding boxes The t-th component; For the true bounding box The t-th component; the bounding box contains 4 components, namely the center coordinates, width, and height of the bounding box;

[0059] Loss at focus:

[0060] ;

[0061] in, The total number of categories of power grid objectives. For the first Class-based balanced weights The model predicts that the target belongs to the first... The probability of a class Adjustment coefficient for easy and difficult samples;

[0062] Loss of consistency in spatial relationships:

[0063] ;

[0064] Where N is the number of targets detected in a single image. Let be the spatial relationship value between the i-th target and the j-th target predicted by the model. The true spatial relationship value between the i-th target and the j-th target labeled in the dataset;

[0065] For frequency domain fidelity loss:

[0066] ;

[0067] in, This is a Fourier transform operation. The spatial domain feature map output by the model. This is a spatial domain feature map generated based on real annotations in the dataset;

[0068] Classification loss for target attribute:

[0069] ;

[0070] Where M is the total number of instances with defect annotations in a single image. Let m be the true defect label for the m-th instance. Let m be the probability that the model predicts the m-th instance to have a defect. For positive sample weights, The weights are negative.

[0071] The corresponding loss weights;

[0072] The improved RT-DETR model adopts a progressive training process: the first stage freezes the backbone network and trains only the feature augmentation pyramid and the detector head; the second stage unfreezes the backbone network and performs end-to-end tuning; the third stage optimizes for difficult samples in the dataset, including small targets, occluded targets, and foggy samples.

[0073] The present invention also provides a readable storage medium having a computer program stored thereon, which, when executed, implements the aforementioned power grid target detection method based on an improved RT-DETR model.

[0074] The present invention also provides an electronic device, which includes a processor, a memory, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the aforementioned power grid target detection method based on an improved RT-DETR model.

[0075] The advantages of this invention are:

[0076] (1) This invention optimizes the RT-DETR model for the characteristics of power grid targets, and provides an improved RT-DETR model suitable for power grid target detection, which improves its detection performance in power grid inspection scenarios, solves the problems of inappropriate receptive field, loss of small target features, insufficient targeting of detection head and low training and inference efficiency in the existing technology, and achieves higher detection accuracy, faster convergence speed and better real-time performance.

[0077] (2) This invention improves the backbone network and proposes an adaptive spatially aware convolutional module (ASPC). Through multi-branch dilated convolution, scale attention mechanism and deformable convolution, the receptive field is dynamically adjusted to adapt to the changes in the target scale of the power grid.

[0078] (3) The present invention designs a feature enhancement pyramid (PEFP), which integrates high-frequency feature protection, cross-scale bidirectional fusion and spatial constraint attention, enhances small target features and utilizes spatial priors to optimize feature representation.

[0079] (4) The present invention improves the detection head structure, introduces IoU perception query selection and power grid target space prior, improves small target recall rate and positioning accuracy, and accelerates decoder convergence.

[0080] (5) The present invention optimizes the training strategy (multi-task collaborative training, progressive training) and inference process (adaptive inference acceleration, dedicated post-processing) to balance detection accuracy and efficiency. Attached Figure Description

[0081] Figure 1 This is an architecture diagram of the improved RT-DETR model of the present invention.

[0082] Figure 2 This is an architecture diagram of the existing RT-DETR model.

[0083] Figure 3 This is a flowchart of a power grid target detection method based on an improved RT-DETR model according to the present invention. Detailed Implementation

[0084] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0085] The architecture of the existing RT-DETR model is as follows Figure 2As shown, an analysis of existing RT-DETR models reveals the following core problems when directly applied to power grid target detection: The fixed receptive field convolution of the original backbone network cannot adapt to the large-scale changes of power grid targets, resulting in insufficient global context capture for large targets and loss of detailed features for small targets; traditional feature pyramids suffer information loss during cross-scale feature fusion, particularly in preserving high-frequency detailed features of small targets, and fail to utilize prior spatial layout of power equipment; the initial query of the detection head relies solely on classification scores, leading to low recall for small targets; the decoder lacks prior spatial knowledge of power grid targets, resulting in slow convergence and insufficient localization accuracy; the training process is not optimized for power grid scenarios, exhibiting class imbalance and poor generalization ability in complex scenarios; and the inference stage does not dynamically allocate computational resources based on scenario complexity, requiring efficiency improvement.

[0086] Therefore, this invention provides an improved RT-DETR model suitable for power grid target detection, consisting of... Figure 1 As shown, it includes:

[0087] An improved backbone network is proposed, which uses an adaptive spatially aware convolutional module (ASPC) to dynamically adjust the receptive field to adapt to changes in the target scale of the power grid through multi-branch dilated convolution, scale attention mechanism and deformable convolution. The input image is fed into the backbone network and then downsampled sequentially to obtain multi-scale downsampled feature maps. Each downsampled feature map is then processed by the adaptive spatially aware convolutional module to obtain an enhanced multi-scale feature map.

[0088] The Feature Enhancement Pyramid (PEFP) is designed to integrate high-frequency feature protection, cross-scale bidirectional fusion, and spatially constrained attention to enhance the features of small targets and optimize feature representation using spatial priors. The enhanced multi-scale feature map output by the backbone network is then passed through the Feature Enhancement Pyramid to obtain the enhanced multi-scale feature map.

[0089] The detection head structure is improved by introducing IoU-aware query selection and power grid target space prior, which enhances the recall rate and localization accuracy of small targets and accelerates decoder convergence. The training strategy (multi-task collaborative training, progressive training) and inference process (adaptive inference acceleration, dedicated post-processing) are optimized to balance detection accuracy and efficiency. Specifically, IoU-aware query selection is used to obtain the initial query vector from the feature-enhanced multi-scale feature map. The decoder corrects the query vector based on the power grid target space prior and employs a hierarchical decoding strategy to output the power grid target detection results.

[0090] This embodiment uses a drone and a camera to collect images of a 500kV high-voltage transmission line in a certain area, and constructs the InsPLAD dataset. The dataset contains 10,607 images, covering 17 types of power equipment (power grid targets), with a total of 28,933 instances, demonstrating good diversity and annotation quality.

[0091] The InsPLAD dataset contains datasets with an average pixel area exceeding 700. 2 Large-scale projects (such as helical dampers, glass insulators, tower signage, polymer insulators, and iron hoops) with an average pixel area of ​​only 150 pixels also include those with an average pixel area of ​​only 150 pixels. 2 Small targets around px in size (such as various miniature shackles and Stockbridge dampers). Unlike traditional datasets that only label hardware in general terms, the InsPLAD dataset employs an extremely refined classification system. The category names themselves contain spatial location and size boundary information. Spatial location distinction: Taking polymer insulators as an example, the InsPLAD dataset does not uniformly label its auxiliary hardware as shackles, but rather clearly subdivides them into three independent categories based on their physical connection location: upper shackle, lower shackle, and tower shackle. Size boundary classification: Taking glass insulators as an example, the InsPLAD dataset does not uniformly label its auxiliary hardware as shackles, but rather clearly subdivides them into three independent categories based on boundary size: large shackle, small shackle, and tower shackle.

[0092] The object detection subsets in the InsPLAD dataset are shown in Table 1 below:

[0093] Table 1 Target Detection Subset

[0094]

[0095] The target attribute (defect or anomaly) detection subset in the InsPLAD dataset is shown in Table 2 below:

[0096] Table 2 Target Attribute Detection Subset

[0097]

[0098] The InsPLAD dataset in this embodiment covers multiple environments and multi-angle scenarios. The improved RT-DETR model is trained using the InsPLAD dataset and combined with the environment adaptation strategy in the inference stage, so that the model can maintain stable performance under different shooting angles, weather conditions and terrain scenarios. It has strong adaptability to multiple scenarios and is suitable for all scenarios of power grid inspection.

[0099] The improvements to the RT-DETR model in this invention specifically include the following:

[0100] 1. Improve the backbone network

[0101] Adaptive Spatial Aware Convolutional Module (ASPC) is used to replace some of the standard convolutions in the backbone network. The ASPC module dynamically adjusts the receptive field to adapt to the changes in the target scale of the power grid through multi-branch dilated convolution, scale attention mechanism and deformable convolution, and the module parameters are optimized based on the multi-scale sample characteristics of the InsPLAD dataset.

[0102] The processing procedure of the Adaptive Spatial Aware Convolutional Module (ASPC) is as follows:

[0103] (1) Multi-branch dilated convolution: The input feature map is processed in parallel by multiple branches, and the output is a multi-branch feature map, which is adapted to the scale range of large to small objects in the InsPLAD dataset. In this invention, it contains 3 branches, specifically 3 3×3 convolutional layers with different dilation rates, namely 1, 2, and 3. The input feature map is the downsampled feature map obtained by passing the original input image through the ResNet-50 (backbone network) of the model.

[0104] (2) Scale attention weight generation: Global average pooling is performed on the input feature map, and the weights of each branch are generated through the fully connected layer. After softmax normalization, the weight vector is obtained, so that the model pays more attention to the small target features in the InsPLAD dataset.

[0105] By capturing global information, learning the importance of branches, and normalizing weights, the model automatically assigns higher weights to branches that can effectively extract features of small targets. This allows the model to highlight the details of small targets and suppress the feature suppression of small targets by large targets when fusing features from multiple branches.

[0106] The specific process of generating scale attention weights is as follows: global average pooling is performed on the input feature map to compress the complex two-dimensional image features (H×W×C) into a one-dimensional global feature vector (1×1×C); the one-dimensional global feature vector is input into the fully connected layer, and finally an initial weight vector with a dimension equal to the number of branches is output; softmax normalization is performed on the initial weight vector output by the fully connected layer to obtain the normalized weight vector [ω1,ω2,ω3], which satisfies ω1+ω2+ω3=1, and each weight is between 0 and 1. ω1, ω2, and ω3 are the weights of the three branches, respectively.

[0107] This invention specifically designs a branch with an inflation rate of 1, which is a key physical channel for capturing high-frequency details of small targets. During training, because the InsPLAD dataset contains a large number of small targets whose details are easily lost, the model discovers that when encountering such small target features, increasing the weight value corresponding to the branch with an inflation rate of 1 leads to more accurate detection and lower loss. Therefore, the trained fully connected layer can learn this pattern. Once the image is perceived to contain complex texture details through global average pooling, the fully connected layer generates a high weight biased towards the branch with a small inflation rate. This significantly enhances the detail information of small targets in the final fused feature map, while relatively suppressing background noise, thus solving the problem of easily lost small target features.

[0108] (3) Multi-branch feature fusion: The multi-branch feature maps are weighted and summed according to the learned weights to obtain the fused feature map. The formula is:

[0109] ,

[0110] in For the first The weight of each branch, For the first The feature maps output by each branch are i=1,2,3 in this embodiment.

[0111] (4) Deformable convolution: predicting the dynamic offset of sampling points through a 1×1 convolutional layer. ,

[0112] in, For model learning parameters, Offset prediction function; dynamic offset of predicted sampling points For fused feature maps Perform variable convolution to obtain deformable convolutional feature maps:

[0113] ;

[0114] in, The output feature map of deformable convolution; These are the position coordinates on the deformable convolution output feature map; It is the fixed original offset of the k-th sampling point; It is the convolution kernel weight of the k-th sampling point; K is the total number of sampling points of the convolution kernel (i.e., the size of the convolution kernel). In this embodiment, K=1; k is the enumeration index of the sampling points of the convolution kernel. This represents the dynamic prediction offset for the k-th sampling point.

[0115] (5) Through residual connections, the final output feature map enhanced by the adaptive spatially aware convolutional module is obtained as follows: .

[0116] The ASPC module of this invention combines the multi-scale sample characteristics of the InsPLAD dataset, dynamically adjusts the receptive field and frequency domain weight allocation, and takes into account both the global context of large targets and the detailed features of small targets, thus solving the problem of large differences in the scale of power grid targets.

[0117] After preprocessing, the input image is fed into the backbone network (ResNet-50) and downsampled sequentially to obtain multi-scale downsampled feature maps. The corresponding downsampling values ​​are 1 / 2, 1 / 4, 1 / 8, and 1 / 16, respectively. Each downsampled feature map is then processed by the ASPC module to obtain enhanced multi-scale feature maps P2, P3, P4, and P5.

[0118] 2. Design Feature Enhancement Pyramid (PEFP)

[0119] The original feature pyramid is replaced by PEFP. PEFP integrates high-frequency feature protection, cross-scale bidirectional fusion and spatial constraint attention. First, high-frequency feature protection strengthens the edge information of small targets that are easily lost. Then, through cross-scale bidirectional fusion and spatial constraint attention mechanism, spatial priors (such as the co-occurrence relationship of power equipment) are injected into the features to strengthen the features of small targets and optimize the feature representation by using spatial priors. This is suitable for the characteristics of high proportion of small targets and fixed spatial layout in the InsPLAD dataset.

[0120] (1) High-frequency feature protection:

[0121] For shallow feature maps (P2 and P3 correspond to small target features in the InsPLAD dataset) High-frequency enhancement is performed to strengthen the high-frequency details of small targets (such as shackles and Stockbridge dampers) in the InsPLAD dataset, resulting in a shallow feature map after high-frequency enhancement. .

[0122] shallow feature map (j=2,3) Perform high-pass filtering (such as Sobel operator, Laplacian operator) to retain only high-frequency details (edges, textures) and filter low-frequency background information to obtain the high-pass filtered feature map. .

[0123] Calculate the high-frequency attention weight map of shallow features ,

[0124] in, This indicates a convolution operation with a kernel size of 1×1, used for feature fusion and channel adjustment; It is the sigmoid activation function.

[0125] By using weighted summation and residual connections, the final shallow feature map after high-frequency feature protection is obtained. .

[0126] (2) Design two paths, one from top to bottom and one from bottom to top, to perform cross-scale bidirectional fusion:

[0127] The top-down path (P5→P4→P3→P2) refines the high-level semantic features (corresponding to the semantic information of large targets in the InsPLAD dataset) level by level. , z = 2, 3, 4, 5; Indicates an upsampling operation; This indicates a convolution operation with a kernel size of 3×3; P2 and P3 are shallow feature maps after high-frequency feature protection.

[0128] The bottom-up path (P2→P3→P4→P5) enhances the low-level detail features (corresponding to small object detail features in the InsPLAD dataset) across scales. , z=2,3,4,5 This indicates a downsampling operation.

[0129] Learnable fusion weights are introduced to balance the contributions of each scale, and multi-scale, cross-scale bidirectional fused feature maps are obtained based on the fusion weights. ;

[0130] in, , The weights are learnable and can be adjusted based on the detection difficulty of targets at different scales in the InsPLAD dataset; GAP represents global average pooling, and MLP represents multilayer perceptron.

[0131] (3) Spatial Constraint Attention: Using spatial prior information obtained from the bounding box position statistics based on object detection annotations in the InsPLAD dataset (such as the connection relationship between insulator strings and shackles, and the coexistence relationship between tower body signs and tower structures), multi-scale attention maps are generated. The attention map is multiplied with the corresponding cross-scale bidirectional fused feature map to enhance the response of the target region and suppress background interference. Through residual connections, the final multi-scale feature map enhanced by the feature augmentation pyramid is obtained. .

[0132] The Feature Enhancement Pyramid (PEFP) of this invention protects high-frequency features and performs cross-scale bidirectional fusion, specifically adapting to the high proportion of small targets in the InsPLAD dataset, reducing the loss of small target features and significantly improving the detection accuracy of small targets; spatial constraint attention enhances the response of target regions, improving the recall rate of small targets.

[0133] 3. Improve the detection head

[0134] This invention is based on the original Transformer decoder of the RT-DETR model and incorporates fine annotations (small target bounding boxes) from the InsPLAD dataset to make two core improvements: introducing IoU-aware query selection and power grid target space priors to improve the recall rate and localization accuracy of small targets and accelerate decoder convergence.

[0135] (1) IoU-aware query selection:

[0136] From the enhanced multi-scale feature map Initial query vectors (adapted to multi-scale targets in the InsPLAD dataset) are obtained from z=2, 3, 4, 5, and a multi-scale dedicated query set is constructed. Query selection considers both classification scores and the IoU value between predicted and ground truth bounding boxes to improve recall for small targets, especially enhancing the detection performance of defects and anomalies in the InsPLAD dataset. IoU-aware query selection ensures that not only high-confidence targets are selected, but also targets with relatively accurate localization.

[0137] IoU-aware query selection to obtain the initial query vector The process is as follows:

[0138] First, in each layer of feature maps Above, dense anchor points are generated according to a preset step size. The number of anchor points is related to the feature map size, and the formula is:

[0139] ,

[0140] For the z-th layer feature map Width and height, The preset width and preset height of the anchor point on the z-th layer. This represents the center coordinates of the anchor point on the z-th layer feature map. Let be the set of anchor points on the feature map of layer z.

[0141] For each anchor point, a feature vector is extracted from the corresponding layer's feature map using the RoI Align (Region of Interest) operation, serving as the initial candidate query. The formula is as follows:

[0142] ,

[0143] in, For the first Layer Anchor points, Indicates anchor point The corresponding feature vectors and the candidate query vectors of all layers constitute the candidate query vector set. .

[0144] Set of candidate query vectors Input the prediction head (lightweight FFN) to obtain the classification score for each candidate query vector. and predicted bounding boxes q represents the query vector. Calculate each predicted bounding box. Corresponding true bounding boxes in the dataset The IoU value is 0 if the candidate query vector has no matching ground bounding box (background anchor point).

[0145] Design a joint scoring function that integrates classification scores and IoU values ​​to highlight queries with high confidence and high positioning accuracy. The formula is as follows:

[0146] ,

[0147] in, IoU weighting coefficient; based on joint score Sort in descending order and select the top K query vectors as the initial query vectors. .

[0148] (2) Decoder based on prior modeling of power grid target space:

[0149] The initial query vector is corrected by incorporating spatial prior information statistically collected from the InsPLAD dataset (spatial prior information obtained from the statistical analysis of bounding box positions based on object detection annotations, such as the connection relationship between insulator strings and shackles, and the coexistence relationship between tower identification signs and tower structures).

[0150] ,

[0151] in, The spatial prior distribution map is generated based on the spatial relationship statistics of tower identification signs, poles, insulator strings and shackles in the InsPLAD dataset. These are the weighting coefficients; This is the initial query vector; This is the corrected query vector, i.e., the initial query vector incorporating prior spatial information. As input to the decoder.

[0152] Simultaneously, a hierarchical decoding strategy is employed for power grid target detection, as detailed below:

[0153] Based on deep feature maps For coarse decoding of large targets such as towers, semantic complementarity is achieved through downsampling alignment and channel concatenation, as shown in the formula:

[0154] ,

[0155] Downsample is a bilinear interpolation downsampling operation; This is a dedicated feature map for large target detection after fusion. This indicates a splicing operation.

[0156] From the modified query vector Specific query vectors related to large targets are selected to reduce the interference of small target queries on coarse decoding and improve decoding speed. The set of specific query vectors related to large targets is as follows:

[0157] ,

[0158] in This is the threshold for coarse decoding query filtering. Let q be the spatial prior probability of the query vector q.

[0159] Will and The input is to the Transformer decoder, which uses a self-attention mechanism to match the query vector with deep features. The formula is as follows:

[0160] ,

[0161] in, , for The key-value vector obtained through linear transformation; For activation functions; The dimension of the key vector; This is the adjacency matrix of spatial relationships for large targets (elements are 1 only between large target pairs, such as tower-polymer insulator, glass insulator-iron hoop, etc., and 0 for the rest).

[0162] Feature vectors after attention interaction The Lightweight Prediction Head (FFN) outputs the class probabilities and coarse bounding boxes for large objects.

[0163] The probability of the large target category is:

[0164] ,

[0165] in, This represents the confidence level of the large target category corresponding to the query vector q. The feature vector of the query vector q output from the coarse decoding stage of the large target; A classification feedforward network for the coarse decoding stage of a large target; For activation functions;

[0166] The rough bounding box of the large target is as follows:

[0167] , ,

[0168] in, The center coordinates of the bounding box This represents the width and height of the bounding box; For the coarse decoding stage of the large target, a bounding box regression feedforward network is used.

[0169] For large target prediction results, a confidence level screening process is performed, retaining those with a confidence level greater than or equal to a set confidence threshold. The bounding box is selected, and background bounding boxes with low confidence are removed using the following formula:

[0170] .

[0171] Will The formula for mapping the feature map scale coordinates to the original image scale is:

[0172] ,

[0173] in, This represents the downsampling factor of the feature map relative to the original image. Coarsely locate the bounding box of the large target at the scale of the original image.

[0174] by Based on this, extend outwards by 20% to serve as the spatial anchor point area. The formula is:

[0175] ,

[0176] Subsequently in the space anchor point area Internally decode small targets to reduce unnecessary calculations.

[0177] scale the original image Mapping to shallow feature maps The scale, the formula is:

[0178] ,

[0179] in is the downsampling factor of the z-th layer feature map relative to the original image. These are spatial anchor points at the feature map scale, used to constrain the feature extraction range. Only retain... and middle Corresponding regional characteristics This is the shallow focused feature map after cropping. More detailed It takes into account local semantics.

[0180] right Perform upsampling to make its size consistent with... The aligned features are then fused using channel concatenation and detail-enhancing convolution to highlight key details such as edges and textures of small objects. The formula is as follows:

[0181] ,

[0182] in, Indicates a downsampling operation; This is a dedicated feature map for small target detection after fusion.

[0183] From the modified query vector The relevant query vectors related to the small target are selected from the data. The set of relevant query vectors related to the small target is as follows: ,

[0184] in, for To refine the decoding query filtering threshold, Let q be the spatial prior probability of the query vector q.

[0185] Will and The input to the Transformer decoder enhances the interaction between detailed features and small target queries, as shown in the formula:

[0186] ,

[0187] in , for The key-value vector obtained through linear transformation; Attention output features for the fine decoding stage of small targets; This is the adjacency matrix of spatial relationships for small targets (only the elements representing reasonable relationships between small targets and their corresponding large targets, and between small targets, are 1, such as glass insulators and small shackles, etc.; the rest are 0).

[0188] Feature vectors after attention interaction The fine-grained prediction head (FFN) outputs the class probabilities and precise bounding boxes of small objects.

[0189] The probability of the small target category is ,

[0190] in, This represents the confidence level of the small target category corresponding to the query vector q; The feature vector of the query vector q output by the fine decoding stage of the small target; A classification feedforward network for the fine decoding stage of small objectives;

[0191] The precise bounding box of the small target is , ,

[0192] in, The center coordinates of the bounding box For the width and height of the bounding box, For the small target fine decoding stage, a bounding box regression feedforward network is used.

[0193] The prediction results for small targets are filtered based on confidence level, retaining those with a confidence level greater than or equal to a set confidence threshold. The bounding box is selected, and background bounding boxes with low confidence are removed using the following formula:

[0194] .

[0195] Will The formula for mapping the feature map scale coordinates to the original image scale is:

[0196] ,

[0197] in This represents the downsampling factor of the feature map relative to the original image. Precisely locate the bounding box for small targets at the scale of the original image.

[0198] Based on prior spatial information, the dependent larger target of each smaller target is identified, thus obtaining the target correlation relationship, as shown in the formula:

[0199] ,

[0200] Wherein, BigObj represents the large target set located by coarse decoding. The query vector q represents the spatial correlation probability between the small target and the large target obj. To ensure that the detection results conform to the physical laws governing the dependence of the smaller target on the larger target, the smaller target is associated with the larger target.

[0201] The target detection results output by hierarchical decoding (coarse localization results for large targets, fine localization results for small targets, and target correlation results) will be directly sent to the post-processing stage. Through two core operations, category-aware NMS deduplication and spatial relationship constraints, invalid predictions will be filtered out and unreasonable detection results will be corrected. Finally, a complete detection result containing target category, target attributes, accurate bounding box and confidence level will be formed to ensure that the output data meets the accuracy and practicality requirements of power grid inspection.

[0202] 4. Optimize training strategy (based on InsPLAD dataset characteristics)

[0203] (1) Multi-task collaborative training: Combining the characteristics of object detection annotation and image-level classification annotation in the InsPLAD dataset, a total loss function is designed. :

[0204] ;

[0205] in,

[0206] To improve the sum of GIoU loss and L1 loss (adapting to accurate bounding boxes in the InsPLAD dataset):

[0207] ;

[0208] in, The weights for GIoU loss; For generalized intersection and comparison of losses; The weights for L1 loss; For L1 loss; The target bounding box predicted by the model; The target ground bounding boxes labeled in the dataset; To predict bounding boxes The t-th component; For the true bounding box The t-th component; the bounding box contains 4 components, namely the center coordinates, width, and height of the bounding box.

[0209] Focus loss (to address class imbalance in the InsPLAD dataset):

[0210] ;

[0211] in, The total number of categories of power grid objectives. For the first Class-based balanced weights The model predicts that the target belongs to the first... The probability of a class The adjustment coefficient is for easy and difficult samples.

[0212] Spatial relation consistency loss (calculated based on spatial relation labels in the InsPLAD dataset):

[0213] ;

[0214] Where N is the number of targets (power equipment) detected in a single image. Let be the spatial relationship value between the i-th target and the j-th target predicted by the model. The true spatial relationship between the i-th target and the j-th target labeled in the InsPLAD dataset.

[0215] For frequency domain fidelity loss (to address the problem of high-frequency detail loss in downsampling of small targets):

[0216] ;

[0217] in, This is a Fourier transform operation. The spatial domain feature map output by the model. This is a spatial domain feature map generated based on the real annotations of the InsPLAD dataset.

[0218] Classification loss for target attribute:

[0219] ;

[0220] Where M represents the total number of device instances with defect annotations in a single image. Let m be the true defect label for the m-th instance. Let m be the probability that the model predicts the m-th instance to have a defect. The weights for positive samples (defective samples) are calculated using the following formula: ( The total number of samples, (number of defective samples), used to balance out defects with fewer samples; The weights for negative samples (normal samples) are calculated using the following formula: ( (Number of normal samples), to avoid loss dominated by normal samples.

[0221] (2) A progressive training process is adopted: In the first stage, the backbone network is frozen and only PEFP and the detection head are trained (using the category labels and bounding box annotations of the InsPLAD dataset to quickly learn the basic features of power equipment); in the second stage, the backbone network is unfrozen and end-to-end fine-tuning is performed (combining the spatial relationship labels of the InsPLAD dataset to optimize feature extraction and fusion); in the third stage, the difficult samples (small targets, occluded targets, and foggy samples) in the InsPLAD dataset are optimized to improve the robustness of the model.

[0222] In the third stage, the focus is on optimizing the difficult samples in the InsPLAD dataset: an online difficult sample mining strategy is adopted, and based on the loss value distribution of the previous stage, the 20% of samples with the highest loss values ​​are selected (mainly extremely small shackles, severely occluded towers, and foggy blurred images); targeted data augmentation is performed on these samples (such as copy-paste enhancement for small targets, random erasing for occluded targets, and contrast attenuation synthesis for foggy samples), and the focus loss parameter (difficulty sample adjustment coefficient) is adjusted simultaneously. Increasing the value forces the model to focus on correcting prediction errors for high-difficulty samples, thereby significantly improving the model's robustness under complex conditions.

[0223] 5. Optimize the reasoning process

[0224] (1) Adaptive reasoning acceleration:

[0225] The computation path is dynamically selected based on image complexity (based on scene complexity thresholds from samples in the InsPLAD dataset). Lightweight paths are used for simple scenes (such as clear, unobstructed views at eye level, accounting for 60% of the InsPLAD test set), while the full network is used for complex scenes (such as foggy, oblique views with occlusion, accounting for 40%). The number of decoder queries is dynamically adjusted (30 queries for simple scenes and 100 queries for complex scenes) to balance speed and accuracy.

[0226] The lightweight path refers to reducing the number of Transformer decoder layers to 1, the number of query vectors to 30, and disabling deformable convolutional branches in the ASPC module during inference. The complete network refers to enabling all layers of the Transformer decoder (6 layers in this embodiment), selecting 100 query vectors, and enabling all branches of the ASPC module.

[0227] (2) Post-processing optimization:

[0228] Category-aware Non-Maximum Segmentation (NMS) is used for deduplication. The deduplication threshold is set based on the target scale characteristics of the InsPLAD dataset: the IoU threshold for large targets is set to 0.7, and the IoU threshold for small targets is set to 0.5. When detecting the same target, the model often outputs multiple predicted bounding boxes with slightly different positions (for example, for the same insulator, the model may predict 3 bounding boxes, slightly to the left, slightly to the right, and centered). If the IoU between bounding box A and bounding box B exceeds the IoU threshold, it indicates that bounding box A and bounding box B are duplicates. The role of category-aware NMS is to deduplicate, retaining the bounding box with the highest score and deleting the remaining duplicate bounding boxes.

[0229] By utilizing the spatial relationships between power devices statistically analyzed in the InsPLAD dataset (e.g., insulator strings cannot exist detached from conductors) constraints, physically impossible detection results (e.g., isolated insulator string bounding boxes) are eliminated.

[0230] like Figure 3 As shown, the present invention provides a power grid target detection method based on an improved RT-DETR model. This method utilizes the improved RT-DETR model to detect power grid targets in an input image, and includes the following steps:

[0231] S1, After preprocessing, the input image is fed into the improved backbone network to extract multi-scale downsampled feature maps. Each downsampled feature map is enhanced by an adaptive spatially aware convolutional module (ASPC) to adapt the receptive field and obtain an enhanced multi-scale feature map that is adapted to multi-scale targets.

[0232] S2, the enhanced multi-scale feature map is input into the Feature Enhancement Pyramid (PEFP), and high-frequency feature protection (for small target details), cross-scale bidirectional fusion and spatial constraint attention optimization (using spatial priors) are performed in sequence to obtain the enhanced multi-scale feature map.

[0233] S3: The enhanced multi-scale feature map is input into the improved detection head, and an initial query vector is generated through IoU-aware query selection, which is then decoded by the Transformer decoder.

[0234] S4, the decoding result is post-processed and optimized (category-aware NMS + spatial relationship constraints), and the final detection result (target category, bounding box, confidence score, target attribute) is output.

[0235] The above are merely preferred embodiments of the present invention and are not intended to limit the scope of the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A power grid target detection method based on an improved RT-DETR model, characterized in that, An improved RT-DETR model was developed and trained using a dataset for power grid target detection. The dataset contains images of K types of power grid targets, including large targets with an average pixel area exceeding a set threshold for large targets and small targets with an average pixel area below a set threshold for small targets. Improvements to the existing RT-DETR model include: The backbone network is improved by replacing the standard convolution in the backbone network with an adaptive spatially aware convolution module. The adaptive spatially aware convolution module dynamically adjusts the receptive field to adapt to the scale changes of the power grid target through multi-branch dilated convolution, scale attention mechanism and deformable convolution. After the input image is fed into the backbone network, each downsampled feature map is passed through the adaptive spatially aware convolution module to obtain an enhanced multi-scale feature map. The feature enhancement pyramid is designed to integrate high-frequency feature protection, cross-scale bidirectional fusion, and spatially constrained attention. It enhances the features of small targets and optimizes the feature representation of small targets by utilizing prior knowledge of the power grid target space. The output of the backbone network is then passed through the feature enhancement pyramid to obtain a feature-enhanced multi-scale feature map. The detection head is improved by introducing IoU-aware query selection and power grid target space prior. The IoU-aware query selection obtains the initial query vector from the feature-enhanced multi-scale feature map. The decoder corrects the query vector based on the power grid target space prior and adopts a hierarchical decoding strategy to output the power grid target detection result. The decoder employs a hierarchical decoding strategy for power grid target detection, as detailed below: Based on deep feature map coarse decoding of large targets, semantic complementarity is achieved by downsampling alignment and channel concatenation to obtain a fused feature map for large target detection. Filter out the specific query vectors related to the large target from the revised query vectors to obtain the specific query vector set related to the large target; The fused large target detection feature map and the large target-related query vector set are input into the Transformer decoder to obtain the attention output features of the large target coarse decoding stage; The feature vectors after attention interaction are used by the prediction head to output the class probability and coarse bounding box of the large target; The prediction results of large targets are filtered by confidence, and bounding boxes with confidence scores greater than or equal to a set confidence threshold are retained; the coarse bounding boxes of large targets at the feature map scale are mapped back to the original image scale to obtain the coarse localization bounding boxes of large targets at the original image size; Based on the coarse positioning bounding box of the large target in the original image, the area is expanded outward by a set ratio as the spatial anchor point area, and then the small target is finely decoded within this spatial anchor point area. The spatial anchor point regions at the original image scale are mapped to the shallow feature map scale to obtain the spatial anchor point regions at the feature map scale; only the regional features corresponding to the spatial anchor point regions in the shallow feature map are retained, and the background features outside the regions are removed to obtain the cropped shallow focused feature map. Based on the cropped shallow focused feature map, the aligned features are fused by upsampling alignment, channel concatenation and detail enhancement convolution to obtain a fused feature map for small target detection. Filter out the specific query vectors related to the small target from the revised query vectors to obtain a set of specific query vectors related to the small target; The fused small target detection feature map and the small target-related query vector set are input into the Transformer decoder to obtain the attention output features of the small target fine decoding stage; The feature vector after attention interaction is used by the prediction head to output the class probability and precise bounding box of the small target; The prediction results of small targets are filtered by confidence, and bounding boxes with confidence scores greater than or equal to a set confidence threshold are retained; the accurate bounding boxes of small targets at the feature map scale are mapped back to the original image scale to obtain the accurate localization bounding boxes of small targets at the original image scale. Based on prior spatial information, the dependent larger target of each smaller target is identified, thus obtaining the target correlation relationship.

2. The power grid target detection method based on the improved RT-DETR model according to claim 1, characterized in that, The processing procedure of the adaptive spatially aware convolutional module is as follows: (1) Multi-branch dilated convolution: The input feature map is processed in parallel by multiple branches, and the output is a multi-branch feature map, which is adapted to the scale range of the dataset from large to small targets; the multi-branch corresponds to multiple convolutional layers with different dilation rates; the input feature map is a downsampled feature map of the input image obtained by passing it through the backbone network; (2) Scale attention weight generation: Global average pooling is performed on the input feature map, and the weights of each branch are generated through a fully connected layer. After normalization, the weight vector is obtained. (3) Multi-branch feature fusion: The multi-branch feature maps are weighted and summed according to the normalized weights to obtain the fused feature map; (4) Deformable convolution: predict the dynamic offset of sampling points through convolutional layers, and perform variable convolution on the fused feature map based on the predicted dynamic offset of sampling points to obtain the feature map after deformable convolution. (5) Perform residual connection between the output feature map of deformable convolution and the fused feature map to obtain the feature map enhanced by the adaptive spatially aware convolution module; The input image is fed into the backbone network and then downsampled sequentially to obtain multi-scale downsampled feature maps. Each downsampled feature map is then processed by an adaptive spatially aware convolutional module to obtain an enhanced multi-scale feature map.

3. The power grid target detection method based on the improved RT-DETR model according to claim 1, characterized in that, The enhanced multi-scale feature maps output from the backbone network after sequential downsampling and adaptive spatially aware convolutional modules are processed as follows: (1) High-frequency feature protection: High-frequency enhancement is performed on the shallow feature map corresponding to the small target features to enhance the high-frequency details of the small targets in the dataset, resulting in a high-frequency enhanced shallow feature map. High-pass filtering is applied to the shallow feature map to retain only high-frequency details and filter out low-frequency background information, resulting in a high-pass filtered shallow feature map. The shallow feature map enhanced by high frequency and the shallow feature map after high-pass filtering are concatenated, and the high-frequency attention weight map of the shallow feature is calculated by using an activation function. Based on the high-frequency attention weight map, weighted summation and residual connections are performed to obtain the final shallow feature map after high-frequency feature protection. (2) Design two paths, one from top to bottom and one from bottom to top, to perform cross-scale bidirectional fusion: The top-down approach uses upsampling to refine the semantic information of high-level semantic features, i.e., the semantic information of large targets, step by step. The bottom-up approach uses downsampling to enhance the detailed features of the low-level objects, i.e., the detailed features of small targets, across scales. Learnable fusion weights are introduced to balance the contributions of each scale, and multi-scale cross-scale bidirectional fusion feature maps are obtained based on the fusion weights. (3) Spatial Constraint Attention: Based on multi-scale cross-scale bidirectional fusion feature maps, multi-scale attention maps are generated. The attention maps are multiplied with the corresponding scale cross-scale bidirectional fusion feature maps and connected through residuals. The enhanced multi-scale feature map is obtained.

4. The power grid target detection method based on the improved RT-DETR model according to claim 1, characterized in that, For the feature-enhanced multi-scale feature map output by the feature enhancement pyramid, the IoU-aware query selection is used to obtain the initial query vector from the feature-enhanced multi-scale feature map. The processing procedure is as follows: First, anchor points are generated on each layer of the feature map; For each anchor point, a feature vector is extracted from the feature map of the corresponding layer through the region of interest alignment operation, which serves as a candidate query vector. The candidate query vectors of all layers constitute a candidate query vector set. Input the candidate query vector set into the prediction head to obtain the classification score and predicted bounding box of each candidate query vector, and calculate the IoU value between each predicted bounding box and the corresponding ground truth bounding box in the dataset. Design a joint scoring function that integrates classification scores and IoU values; sort the joint scores in descending order and select the top K candidate query vectors as the initial query vectors.

5. The power grid target detection method based on the improved RT-DETR model according to claim 1, characterized in that, The decoder modifies the query vector based on prior knowledge of the power grid target space, as shown below: in, For spatial prior distribution map, These are the weighting coefficients. This is the initial query vector. This is the corrected query vector, i.e., the initial query vector incorporating prior spatial information. As input to the decoder.

6. The power grid target detection method based on the improved RT-DETR model according to claim 1, characterized in that, The target detection results output by hierarchical decoding are post-processed and optimized. Invalid predictions are filtered out and the detection results are corrected by class-aware NMS deduplication and spatial relationship constraints. Finally, a complete detection result containing target category, target attribute, bounding box and confidence is formed. The target detection results output by hierarchical decoding include coarse localization results of large targets, fine localization results of small targets and target association relationships.

7. The power grid target detection method based on the improved RT-DETR model according to claim 1, characterized in that, The improved RT-DETR model employs multi-task collaborative training, with a total loss function. for: To improve the sum of GIoU loss and L1 loss: in, The weights for GIoU loss; For generalized intersection and comparison of losses; The weights for L1 loss; For L1 loss; The target bounding box predicted by the model; The target ground bounding boxes labeled in the dataset; To predict bounding boxes The t-th component; For the true bounding box The t-th component; the bounding box contains 4 components, namely the center coordinates, width, and height of the bounding box; Loss at focus: in, The total number of categories of power grid objectives. For the first Class-based balanced weights The model predicts that the target belongs to the first... The probability of a class Adjustment coefficient for easy and difficult samples; Loss of consistency in spatial relationships: Where N is the number of targets detected in a single image. Let be the spatial relationship value between the i-th target and the j-th target predicted by the model. The true spatial relationship value between the i-th target and the j-th target labeled in the dataset; For frequency domain fidelity loss: in, This is a Fourier transform operation. The spatial domain feature map output by the model. This is a spatial domain feature map generated based on real annotations in the dataset; Classification loss for target attribute: Where M is the total number of instances with defect annotations in a single image. Let m be the true defect label for the m-th instance. Let m be the probability that the model predicts the m-th instance to have a defect. For positive sample weights, Negative sample weights; The corresponding loss weights; The improved RT-DETR model adopts a progressive training process: the first stage freezes the backbone network and trains only the feature augmentation pyramid and the detector head; the second stage unfreezes the backbone network and performs end-to-end tuning; the third stage optimizes for difficult samples in the dataset, including small targets, occluded targets, and foggy samples.

8. A readable storage medium, characterized in that, It stores a computer program, which, when executed, implements a power grid target detection method based on an improved RT-DETR model as described in any one of claims 1 to 7.

9. An electronic device, characterized in that, It includes a processor, a memory, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the power grid target detection method based on the improved RT-DETR model as described in any one of claims 1 to 7.