A Defect Identification Method for Power Transmission and Transformation Unmanned Aerial Vehicles During Loitering Flights Based on a Homogeneous Model

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By adopting the SwinL model and multi-expert routing technology, enhancing the FPN framework, cascading detection heads, and spatial adaptive attention mechanism, the problems of computational redundancy and performance degradation in defect identification of power transmission and transformation UAVs were solved, achieving efficient identification and accurate detection of defects in all scenarios.

CN121170635BActive Publication Date: 2026-06-30SICHUAN SHUJU INTELLIGENT MFG TECH CO LTD

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: SICHUAN SHUJU INTELLIGENT MFG TECH CO LTD
Filing Date: 2025-09-03
Publication Date: 2026-06-30

AI Technical Summary

Technical Problem

Existing technologies for defect identification in power transmission and transformation drones suffer from computational redundancy, reduced identification performance, and poor identification results for different categories, especially for defect categories with few defect samples and discrete feature distributions.

Method used

The SwinL model is used as the backbone network of the isomorphic model. It is pre-trained through device detection tasks and feature mask reconstruction tasks. Combined with multi-expert routing technology, enhanced FPN framework, cascaded detection heads and parallel auxiliary tasks, and spatial scale adaptive attention mechanism, multi-dimensional auxiliary detection head model and environment perception head model are generated to achieve accurate identification of different defects.

Benefits of technology

By eliminating redundant calculations through the shared device identification model backbone and dynamically allocating expert models for targeted identification, the feature expression capability is enhanced, thereby improving the robustness of component and instance-level defect detection and the accuracy of environmentally conscious defect identification.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN121170635B_ABST

Patent Text Reader

Abstract

This invention relates to a defect identification method for power transmission and transformation drone patrols based on a homogeneous model, belonging to the field of power inspection. It addresses the problems of unbalanced defect identification indicators and redundant model parameters in existing solutions for different equipment types. First, this invention uses a detection model with a large number of parameters to refine the backbone and equipment identification capabilities. The model is pre-trained using equipment detection tasks, with the backbone serving as a common baseline feature representation model for all defect identifications. Second, candidate regions are divided by equipment area. Based on the identity information of the candidate regions, multi-expert routing is used to apply different tasks to different equipment objects for defect identification. Finally, for classification tasks, a spatial adaptive classification head model is proposed; for detection tasks, a local and joint detection head model is proposed; and for scene perception tasks, an environmental perception head model is proposed.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of power line inspection and provides a method for identifying defects during power transmission and transformation drone patrols based on a homogeneous model. Background Technology

[0002] Defect identification in power transmission and transformation has shifted from manual inspections to drone inspections. Drones have addressed the manpower shortage caused by the rapidly increasing mileage of power lines, but they have also created a need for analyzing the inspection images. Because drones take massive amounts of data, this presents a significant challenge for maintenance teams in subsequent defect identification.

[0003] To address this issue, deep learning-based defect identification solutions have been gradually applied to various defects. For equipment such as fittings, vibration dampers, and towers, solutions have evolved from pure CNN techniques (such as YOLOv8) to DETR series models (such as the CO-DETR model) and large-scale models (such as the Guangming large model). While these models have shown excellent identification performance for certain defects, they perform very poorly for defect categories with limited sample sizes and discrete distributions of associated features, exhibiting a severe imbalance in performance.

[0004] For defect identification across different devices, existing technologies typically employ two approaches: identifying defects on a device-by-device basis and using the same deep learning model to identify all defects across all devices. The former involves significant computational redundancy, with over 70% of the computation being ineffective and repetitive. The latter, which integrates defects into a single model, suffers from performance degradation in identifying difficult or incompatible categories compared to the current task type, leading to a decline in the recognition performance of other categories. Summary of the Invention

[0005] The purpose of this invention is to solve the technical problems of computational redundancy, performance degradation of aggregate recognition, and huge differences in recognition effects among different categories caused by model schemes in defect identification of power transmission and transformation drones.

[0006] To achieve the above objectives, the present invention employs the following technical means:

[0007] This invention provides a method for identifying defects during unmanned aerial vehicle (UAV) patrols in power transmission and transformation based on an isomorphic model, characterized by comprising:

[0008] Step 1: The SwinL model is used as the backbone network of the isomorphic model. It is pre-trained through the device detection task and the feature mask reconstruction task. The pre-trained device recognition model is used as the baseline model for all defect recognition.

[0009] Step 2: For the baseline model, use multi-expert routing technology to distribute the local feature representations output by the backbone network to the corresponding expert models;

[0010] Step 3: For component, instance-level defects and local feature representations, use the enhanced FPN framework to fuse multi-level features of the backbone network to obtain a spatial adaptive classification head model and output the spatial adaptive classification result;

[0011] Step 4: For equipment-level defects, a multi-dimensional auxiliary detection head model is generated through cascaded detection heads and parallel auxiliary task technology, and the local defect identification results are output for the identification of local defects such as corrosion, displacement and detachment.

[0012] Step 5: For environmental perception defects, generate an environmental perception head model through a spatial scale adaptive attention mechanism, and output global defect identification results, such as vibration damper slippage.

[0013] In the above scheme, the SwinL model is used as the backbone network of the isomorphic model, and it is pre-trained through device detection tasks and feature mask reconstruction tasks. The loss function for pre-training is:

[0014] L total =L d (g(f(X)),Y)+λ·||(h(f(X)⊙M)-x)⊙(1-M)||1

[0015] Where f is the backbone network encoder function, g is the detector head, h is the mask reconstruction model, M is the binary mask matrix, representing the region to be masked, and λ is the weighted weight of the losses from the two tasks.

[0016] In the above scheme, step 2 includes the following steps:

[0017] Step 2.1: Based on the defect characteristics of the power transmission and transformation equipment, obtain the equipment-route mapping relationship table, wherein the equipment-route mapping relationship table is a one-to-many mapping relationship;

[0018] Step 2.2: Assign the device regions output by the baseline model to the corresponding expert models according to the device-routing mapping relationship to obtain the backbone network feature distributor.

[0019] In the above scheme, the feature distributor adaptively scales the task area size according to the feature region label and routing type, and divides the identification results into target device identification results and non-target device identification results;

[0020] The target device identification results are then fused using a predefined device aggregation method.

[0021] The non-target device identification results are summarized into a global non-target device set. After excluding target devices, a device aggregation method is applied and hyperparameters are set for filtering. The target device identification results and non-target device identification results are merged to obtain the device identification results.

[0022] In the above scheme, step 3 uses an enhanced FPN framework to fuse multi-level features of the backbone for component, instance-level defects, and local feature representations, resulting in a spatially adaptive classification head model. The fusion process of the enhanced FPN framework includes:

[0023] fpn i =CNN(WindowAttn(b outi ))

[0024] fpn i-1 =fpn i-1 +Interprolate(fpn i ), i>1

[0025] FPN out =Concat([AP(LCNN(fpn0)),…,AP(LCNN(fpn cn ))])

[0026] Among them, b outi Let fpn0 represent the i-th level output feature of the backbone network, CNN represent a convolutional neural network, WindowAttn represent a window attention mechanism, Interprolate represent an interpolation function, AP represent an adaptive pooling operator, LCNN represent a convolutional layer module, cn represent the number of intermediate layers in the backbone network, and fpn0 represent the output of the 0-th level feature of the backbone after processing by CNN and WindowAttn; and

[0027] The classification model output is obtained using fully connected convolutional techniques, and the calculation process is as follows:

[0028] category = Sigmoid(FC(FPN) out ))

[0029] Here, FC represents a fully connected layer, and Sigmoid represents an activation function.

[0030] In the above scheme, step 4 includes:

[0031] Step 4.1: Using cascade technology, multiple detectors are connected in series on the detector head to obtain cascaded detector predictions. The processing of the first n-1 layers of the decoder is as follows:

[0032] DeLn 0：n-1 =Transformer(last) out Q, Vis)

[0033] The processing of the m prediction layers in the cascade part is as follows:

[0034] DeLnn-1：m+n-1 =Transformer(last) out last out ,Q,Vis)

[0035] Among them, last out The last line represents the output of the previous level of the main branch, Q represents the query array of the main branch, Vis represents the visual features output by the backbone network, and last' represents the output of the previous level of the main branch. out This represents the gradient cutoff object of the previous layer's output;

[0036] Step 4.2: Using a parallel auxiliary task, the weights of the decoder in the shared detector are shared, and the object detection loss is calculated using different types of objective functions to obtain the optimized shared weight decoder. The training objective is:

[0037]

[0038] Where X represents the input image, Y represents the detection label, f represents the backbone network encoder function, MSA represents the multi-scale adapter, and MT... i Let λ represent the loss function for the corresponding task. i This represents the weighting parameter.

[0039] In the above scheme, step 5 includes:

[0040] Step 5.1: For the dynamic attention mechanism, a hybrid technique of window attention and global attention is used to obtain a spatial scale adaptive attention module, where the latent variables of the image patch are represented as:

[0041] h(patch)=FC(PC(DWConv(xpatch)))

[0042] Window attention is calculated as follows:

[0043]

[0044] The number of regions is predicted to be:

[0045] k = DF(h)

[0046] The spatial adaptive attention module is:

[0047]

[0048] Where, x patch Represents an image patch, DWConv represents depthwise convolution, PC represents pointwise convolution, FC represents a fully connected layer, and FC... q and FC kdenoted as the fully connected layer used for feature mining, d represents the feature dimension, DF represents the distributed prediction function, AG represents the region feature fusion operator, and k represents the number of regions selected.

[0049] Step 5.2: Replace the shallow layer of the QWEN2.5-VL model encoder with the spatial scale adaptive attention module to obtain the encoder of the environment perception model;

[0050] Step 5.3: Combine the semantic decoder to obtain the environment-aware head model.

[0051] Because the present invention employs the above-mentioned technical means, it has the following beneficial effects:

[0052] 1. This invention solves the technical problems of redundant model parameters and redundant computational resources in the identification of defects of different types of equipment in the prior art by using the SwinL model as the backbone network of the isomorphic model in step 1, and performing pre-training through device detection tasks and feature mask reconstruction tasks. The pre-trained device identification model is used as the baseline model for all defect identification. This achieves the effect of eliminating redundant computation for defect identification in all scenarios by sharing the device identification model backbone.

[0053] 2. This invention solves the technical problem of unbalanced indicators caused by the inability of a single task model to adapt to the identification needs of different defect types in the prior art by using multi-expert routing technology in step 2 to distribute the local feature representation output by the backbone network to the corresponding expert model. It achieves the effect of dynamically allocating expert models for targeted defect identification according to the characteristics of equipment defects.

[0054] 3. This invention, through step 3, uses an enhanced FPN framework to fuse multi-level features of the backbone network to obtain a spatially adaptive classification head model. This solves the technical problem in existing technologies where the limited number of learnable parameters in classification models restricts feature representation capabilities. It achieves the effect of enhancing the feature mining capability of classification models for component and instance-level defects by fusing multi-level features. 4. This invention, through step 4, uses a cascaded detection head and parallel auxiliary task technology to generate a multi-dimensional auxiliary detection head model. This solves the technical problem in existing technologies where a single learning objective leads to insufficient detection rate and accuracy of the detection model. It achieves the effect of improving the robustness and recognition accuracy of the detection model for device-level defects through cascaded prediction and multi-task assistance. 5. This invention, through step 5, uses a spatial scale adaptive attention mechanism to generate an environment-aware head model. This solves the technical problem in existing technologies where non-aggregated feature defects (such as environment-aware defects) have low recognition accuracy due to weak semantic association. It achieves the effect of enhancing environmental semantic perception capabilities through dynamic attention mechanisms to improve the overall defect recognition effect. Attached Figure Description

[0055] Figure 1 Multi-expert routing diagram;

[0056] Figure 2 A diagram illustrating classification experts;

[0057] Figure 3 Detection expert routing diagram;

[0058] Figure 4 Schematic diagram of the spatial scale adaptive attention encoder structure. Detailed Implementation

[0059] The embodiments of the present invention will be described in detail below. Although the present invention will be described and illustrated in conjunction with some specific embodiments, it should be noted that the present invention is not limited to these embodiments. On the contrary, any modifications or equivalent substitutions made to the present invention should be covered within the scope of the claims of the present invention.

[0060] Furthermore, to better illustrate the present invention, numerous specific details are set forth in the following detailed embodiments. Those skilled in the art will understand that the present invention can be practiced without these specific details.

[0061] A method for defect identification during unmanned aerial vehicle (UAV) patrols in power transmission and transformation based on an isomorphic model includes:

[0062] Step 1: Use the SwinL model as the backbone of the isomorphic model, and pre-train it through the device detection task and feature mask reconstruction task. Use the pre-trained device recognition model as the baseline model for all defect recognition.

[0063] Step 2: For the baseline model, use multi-expert routing technology to distribute the local feature representations output by the backbone to the corresponding expert models;

[0064] Step 3: For component, instance-level defects and local feature representations, use the enhanced FPN framework to fuse multi-level features of the backbone to obtain a spatial adaptive classification head model and output the spatial adaptive classification result;

[0065] Step 4: Based on the equipment-level defects and regional feature representations, auxiliary branching techniques are used to obtain a multi-dimensional auxiliary detection head model for the identification of local defects such as corrosion, displacement, and detachment;

[0066] Step 5: For environmental perception defects and scene feature representations, use a dynamic attention mechanism to obtain an environmental perception head model for global defect identification, such as vibration damper slippage.

[0067] In the above scheme, step 1 includes the following steps:

[0068] Step 1.1 Based on the pre-training, use industry data and multi-task fine-tuning techniques to train a model that identifies all categories of equipment in the power industry. This equipment identification model serves as the baseline model for all defect identifications. (Sharing the backbone of the equipment identification model eliminates redundant computations for defect identification across all scenarios.)

[0069] In the baseline model, SwinL is used as the backbone, followed by two different types of pre-training tasks. The first is a detection pre-training model, employing near-optimal detection schemes in the neck and head parts, such as the CO-DETR decoder and auxiliary head. High-precision labeled device data from the industry is used as training data, allowing the model to learn the ability to recognize devices. The second task is a feature masking and reconstruction task. Some feature points are randomly masked from the backbone's feature representation, and the remaining feature points are used to reconstruct the image using a MIM-like model. This allows the model to learn semantic features beyond device-specific features, preventing overfitting to devices and ensuring defect recognition has transferability.

[0070] For the input image X, after backbone encoding, the encoded features are fed to the detection head for device detection; then, through sparse sampling of the encoded features, they are fed to a mask reconstruction model, such as MAGE, for image reconstruction. Its training follows the formula below:

[0071] L total =L d (g(f(X)),Y)+λ·||(h(f(X)⊙M)-x)⊙(1-M)||1

[0072] Where f is the backbone encoder function, g is the detector head, h is the mask reconstruction model, M is the binary mask matrix, representing the region to be masked, and are the weighted weights of the losses from the two tasks.

[0073] In the above scheme, step 2 includes the following steps:

[0074] Step 2.1 Based on the equipment identification model and backbone feature representation, a backbone feature distributor is obtained using multi-expert routing technology; (In power transmission and transformation defect identification, the appropriate task type is selected for routing distribution according to the equipment category)

[0075] In the power industry, we can register equipment for defect identification tasks based on factors such as equipment type and function. During registration, equipment can be bound to multiple expert models. After defect identification, the final decision-making module judges the defect report. A schematic diagram is shown below. Figure 1.

[0076] Step 2.1.1 For the power transmission and transformation equipment, based on the characteristics of the defects contained in the equipment, obtain the equipment-routing mapping relationship table;

[0077] In defect identification of power transmission and transformation equipment, a single piece of equipment often has multiple defects to identify. For example, for the suspension clamp unit, we need to focus on hull corrosion, bolt loosening, and hull misalignment. Hull corrosion defects are characterized by localized features, which can be mapped to both a classification head model (classification expert routing) and a detection head model (detection expert routing). Hull misalignment defects have two types: significant misalignment and suspected misalignment. The former can be handled by both classification and detection expert routing, while the latter can only be handled by a perception head model (perception expert routing). Similarly, we can handle other equipment defects, obtaining a one-to-many "equipment-route" mapping table. After obtaining the mapping table, we configure the aggregation method for the task identification results of each equipment, such as intersection or union.

[0078] Step 2.1.2 Based on the device identification results of the baseline model, assign expert models to the corresponding device regions according to the predefined device-route mapping relationship to obtain the backbone feature distributor.

[0079] The feature distributor adaptively scales the task area size based on the obtained feature region labels and routing types, performing defect identification on the device within the area. The identification results are categorized into target device and non-target device identification results. For target devices, a predefined device aggregation method is used to fuse the identification results. Non-target devices are grouped into a global non-target device set, and target devices are excluded from this set to prevent duplication. The device aggregation method is applied to the global non-target device set, and hyperparameters are set to filter the fusion targets, resulting in the final supplementary identification. The target device identification results and non-target device identification results are merged to obtain the isomorphic model of the device identification result.

[0080] In the above scheme, step 3 includes the following steps:

[0081] Step 3.1 For the backbone feature representation, use FPN technology to obtain the fused feature representation;

[0082] The backbone has *n* intermediate layer outputs, and traditional classification models use the results of the last layer for feature reprocessing. This approach leads to the loss of basic shallow features, which are crucial for semantic discrimination. In detection models, structures like FPN and PAN are introduced to fuse features from different levels for dense prediction. Inspired by detection models, we use a simple FPN structure to fuse these features. Its structure is the same as the traditional FPN, but it uses large-kernel convolutions to expand the receptive field of the corresponding layers and window attention to enhance local feature representations. Using two operators allows for feature mining from different dimensions and increases the number of model parameters. Therefore, we increase the number of modules in the traditional FPN from 2 *n* to 3 *n*, ensuring the model has sufficient parameters to learn classification features.

[0083] fpn i =CNN(WindowAttn(b outi ))

[0084] Among them, fpn i This involves directly processing the i-th level output of banckbone. The fusion process is as follows:

[0085] fpn i-1 =fpn i-1 +Interprolate(fpn i ), i>1

[0086] Interprolate is an interpolation function for the i-th level fpn features, scaling the spatial scale of the features to the size of the i-th level fpn features. Figure 1 The sample size is large, and the fused features are obtained by adding them point by point. The final output will undergo the following processing:

[0087] FPN out =Concat([AP(LCNN(fpn0)),…,AP(LCNN(fpn cn ))])

[0088] In this context, AP stands for Adaptive Pooling Operator, and LCNN stands for Layer Convolutional Module.

[0089] Step 3.2 For the fused feature representation, use fully connected convolutional technology to obtain the classification model output;

[0090] Obtaining FPN from FPN out Then, the processing formula is:

[0091] category = Sigmoid(FC(FPN) out ))

[0092] The FC function performs a fully connected operation on the FPN output, the Sigmoid function smooths the values, and finally, softmax is applied to the category to obtain the predicted probability for each category.

[0093] A diagram illustrating the expert routing method is shown below. Figure 2 .

[0094] In the above scheme, step 4 includes the following steps:

[0095] Step 4.1 For the so-called auxiliary branching technique, the cascade technique is used to connect more detection heads in series on the detection head to obtain cascaded detection head prediction;

[0096] Recently, the DETR head has undergone significant improvements, resulting in greatly enhanced prediction performance and consistently ranking among the top in object detection leaderboards. To enhance recall capability for one-to-one output, this patent employs cascade technology. On an n-layer decoder, the configuration of the last layer used for the output layer is serially stacked m times, thus yielding an m+n-1 layer cascade decoder.

[0097] The first n-1 layers of the decoder process are as follows:

[0098] DeLn 0：n-1 =Transformer(last) out Q, Vis)

[0099] Where Q is the QUERY array for the main branch, and last... out The output of the layer above the main branch, Vis, is the visual feature output from the backbone, and Transformer is the cross-attention module. The m prediction layers in the cascade part are processed as follows:

[0100] DeLn n-1：m+n-1 =Transformer(last) out last out Q, Vis)

[0101] last' out It is the gradient cutoff object of the previous layer's output. Because the gradient is cut off, it is convenient to quickly stabilize the current layer during training. At the same time, in order to pass gradient values and the loss of the current layer between different layers, one branch passes gradient values to the current layer and updates the parameters of the current layer; another branch is used to update the parameters of the previous layer. That is, the last m prediction layers use look-forward-twice.

[0102] During training, all m+n-1 layers of the decoder need to be aligned with the labels, but during the inference phase, only the last m layers participate in prediction. (See diagram below.) Figure 3 .

[0103] Step 4.2 For the so-called auxiliary branch technique, a parallel auxiliary task is used to share the weights of the decoder in the detector, and the target detection loss is calculated using different types of objective functions to obtain the optimized shared weight decoder;

[0104] Step 4.1 primarily focuses on increasing the number of target alignment iterations to enhance the representational potential of one-to-one predictions. In the detection field, dense prediction is a crucial means of preserving recall, complementing the shortcomings of traditional one-to-one methods. The CO-DETR paper has already demonstrated that using parallel hybrid one-to-many tasks, while sharing encoder weights, significantly improves the model's recognition capabilities. Inspired by CO-DETR, we integrate the ATSS, FCOS, Faster R-CNN, RetinaNet, and CornerNet detection heads. Through multi-head alignment, the instability of Hungarian matching in one-to-one predictions is effectively resolved, ensuring rapid model convergence and improving learning capabilities.

[0105] Let MT = (ATSS, FCOS, Faster R-CNN, RetinaNet, CornerNet) represent different task sets. Using weight parameters λ = (λ1, λ2, λ3, λ4, λ5) to harmonize different types of target alignment losses, the training objective for the parallel task-assisted task can be designed as follows:

[0106]

[0107] Among them is MT i Let X represent the loss function for the corresponding task, Y represent the detection label, f be the backbone encoder function, and MSA be the multi-scale adapter, used to adapt features of different scales for different tasks. With the assistance of this one-to-many detection head, a large number of object detections are simply injected into the one-to-one branch through these parallel branches, achieving only accumulation and helping the DETR head to converge quickly and stably.

[0108] In the above scheme, step 5 includes the following steps:

[0109] Step 5.1 For the aforementioned dynamic attention mechanism, a hybrid technique of window attention and global attention is used to obtain a spatial scale adaptive attention module; (combining global and local attention, and combining self-attention and cross attention, to generate a spatial scale adaptive attention structure)

[0110] Since the introduction of self-attention in the Transformer paper, attention modules have undergone a long period of development, with the QKV computation mechanism still dominating the foundational modules of deep learning. To address the exponential growth in computational cost of Transformer modules, researchers have recently developed architectures such as Mamba, WindowAtten, and DLformer. The Mamba community introduced state space, reducing operator complexity to linear complexity. WindowAtten technology performs Transformer computation in a smaller window in vision, achieving a balance between efficiency and performance. In the QWEN2.5-VL model, the model architecture employs a hybrid Full Attention and Window Attention during the encoder stage to enhance global perception capabilities and provide a shortcut for feature encoding of sparse relational semantics. DLformer's architecture predicts remaining lifetime from fine-grained to coarse-grained levels. Because coarse-grained feature representations are fused with fine-grained features, it achieves feature reuse and weights the contributions of different time steps to improve performance.

[0111] Inspired by the above three technologies, the basic module of the environmental perception model of this invention first encodes each patch of the image into a hidden space, where the latent variables of the patch are represented as follows:

[0112] h(patch) = FC(PC(DWConv(x) patch )))

[0113] DWConv is a depthwise convolution, and PC is a pointwise convolution. After two convolutions, a local feature map with 1 channel is obtained. Flatten is used to expand the space, and finally a fully connected layer is used to obtain the latent space feature representation.

[0114] Construct a global patchAtten using the hidden representation of the patch:

[0115]

[0116] All FCs are used for feature mining to form different feature representations, and patchAtten is obtained using the self-attention calculation formula. Another controllable parameter k is needed in the sequence, representing how many patches are collected for subsequent attention calculations. The calculation of k is as follows:

[0117] k = DF(h)

[0118] The DF distributed prediction function, similar to the DFL technique, pre-defines bias terms and then predicts the probabilities of these terms to predict the number of regions. Based on the output k and patchAtten, each patch selects the top K other patches to update the current region features, thus obtaining a cross-regional spatial adaptive attention module.

[0119]

[0120] Here, AG is an operator that fuses TopK region features, such as by directly adding them on the channels and then scaling them. This yields the spatial scale adaptive attention module.

[0121] Step 5.2 For the spatial scale adaptive attention module, replace the shallow layer of the QWEN2.5-VL encoder to obtain the encoder of the environment perception model;

[0122] The encoder uses the apAtten layer in the first half, the local attention encoding layer in the QWEN2.5-VL model in subsequent layers, and the global attention encoding layer in the last layer. Therefore, we obtain an encoder that enhances shallow feature association and environmental semantic awareness, the structure of which is shown in Appendix 4.

[0123] Step 5.3: For the encoder of the environment perception model, combine it with a semantic decoder to obtain the environment perception head model; (normal operation)

[0124] The encoder performs multiple feature mining operations across spatial regions, then fully mines visual semantics using a self-attention mechanism, and finally optimizes these features using global attention. In the decoding stage, we use a mature visual decoder to obtain discriminative analysis of devices and their attributes within the environment.

Claims

1. A method for identifying defects during unmanned aerial vehicle (UAV) patrols in power transmission and transformation based on an isomorphic model, characterized in that, include: Step 1: The SwinL model is used as the backbone network of the isomorphic model. It is pre-trained through the device detection task and the feature mask reconstruction task. The pre-trained device recognition model is used as the baseline model for all defect recognition. Step 2: For the baseline model, use multi-expert routing technology to distribute the local feature representations output by the backbone network to the corresponding expert models; Step 3: For component, instance-level defects, and local feature representations, use the enhanced FPN framework to fuse multi-level features of the backbone network to obtain a spatial adaptive classification head model and output the spatial adaptive classification result; Step 4: For equipment-level defects, a multi-dimensional auxiliary detection head model is generated using cascaded detection heads and parallel auxiliary task technology, and the local defect identification results are output. Step 5: For environmental perception defects, generate an environmental perception head model through a spatial scale adaptive attention mechanism and output global defect identification results; Step 2 includes: Step 2.1: Based on the defect characteristics of the power transmission and transformation equipment, obtain the equipment-route mapping relationship table, wherein the equipment-route mapping relationship table is a one-to-many mapping relationship; Step 2.2: Assign the device regions output by the baseline model to the corresponding expert models according to the device-route mapping relationship to obtain the backbone network feature distributor; The feature distributor adaptively scales the task area size based on the feature region label and routing type, and divides the identification results into target device identification results and non-target device identification results; The target device identification results are then fused using a predefined device aggregation method. The non-target device identification results are summarized into a global non-target device set. After excluding target devices, a device aggregation method is applied and hyperparameters are set for filtering. The target device identification results and non-target device identification results are merged to obtain the device identification results. In step 3, for component-level defects and local feature representations, the enhanced FPN framework is used to fuse multi-level features of the backbone to obtain a spatially adaptive classification head model. The fusion process of the enhanced FPN framework includes: in, The first part represents the backbone network. Level output features, CNN stands for Convolutional Neural Network. This indicates the window attention mechanism. Represents the interpolation function. Let LCNN denote the adaptive pooling operator, LCNN denote a layer convolutional module, and cn denote the number of intermediate layers in the backbone network; and The classification model output is obtained using fully connected convolutional techniques, and the calculation process is as follows: Here, FC represents a fully connected layer, and Sigmoid represents an activation function.

2. The method according to claim 1, characterized in that, include: The SwinL model is used as the backbone network of the isomorphic model. It is pre-trained using device detection and feature mask reconstruction tasks. The loss function for pre-training is: in, It is the backbone network encoder function. It's a detection head. This is a mask reconstruction model, where M is a binary mask matrix representing the region to be masked. It is the weighted average of the losses of the two tasks.

3. The method according to claim 1, characterized in that, Step 4 includes: Step 4.1: Using cascade technology, multiple detectors are connected in series on the detector head to obtain cascaded detector predictions. The processing of the first n-1 layers of the decoder is as follows: The processing of the m prediction layers in the cascade part is as follows: in, This represents the output of the previous level of the main branch, and Q represents the query array of the main branch. This represents the visual features output by the backbone network. This represents the gradient cutoff object of the previous layer's output; Step 4.2: Using a parallel auxiliary task, the weights of the decoder in the shared detector are shared, and the object detection loss is calculated using different types of objective functions to obtain the optimized shared weight decoder. The training objective is: in, This refers to the input image. Indicates the detection label, This represents the backbone network encoder function, and MSA represents the multi-scale adapter. This represents the loss function for the corresponding task. This represents the weighting parameter.

4. The method according to claim 1, characterized in that, Step 5 includes: Step 5.1: For the dynamic attention mechanism, a hybrid technique of window attention and global attention is used to obtain a spatial scale adaptive attention module, where the latent variables of the image patch are represented as: Window attention is calculated as follows: The number of regions is predicted to be: The spatial adaptive attention module is: in, Representing an image patch, DWConv indicates depthwise convolution, PC indicates pointwise convolution, and FC indicates a fully connected layer. and This represents a fully connected layer used for feature mining. Let DF represent the feature dimension, AG represent the distributed prediction function, and DF represent the region feature fusion operator. Indicates the number of regions selected; Step 5.2: Replace the shallow layer of the QWEN2.5-VL model encoder with the spatial scale adaptive attention module to obtain the encoder of the environment perception model; Step 5.3: Combine the semantic decoder to obtain the environment-aware head model.