A power transmission channel hidden danger identification method and system based on a hybrid backbone network
By combining the dynamic denoising DINO model and the MambaVision module into a hybrid backbone network, the adaptability and accuracy issues of power transmission channel hazard identification technology in complex environments are solved, enabling efficient identification and positioning of dynamic targets and improving the intelligent inspection capabilities of the power system.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- STATE GRID JIANGSU ELECTRIC POWER CO LTD NANJING POWER SUPPLY COMPANY
- Filing Date
- 2026-02-26
- Publication Date
- 2026-06-19
Smart Images

Figure CN122244630A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the interdisciplinary field of computer vision and power engineering, and relates to visual recognition technology for hidden dangers in power transmission channels. In particular, it relates to a method and system for identifying hidden dangers in power transmission channels based on a hybrid backbone network. This invention is a technical solution that integrates the dynamically denoised DINO model and the enhanced time series processing MambaVision for accurate identification of hidden dangers in power transmission channels. Background Technology
[0002] With the continuous development of power systems and the increasing complexity of transmission lines, the detection of hidden dangers in transmission channels has become a crucial link in ensuring the safe and stable operation of power systems. Transmission lines often face various potential hazards, such as insulator spontaneous explosion, foreign object hanging, tree encroachment, and bird damage, which may lead to serious power accidents. Therefore, how to efficiently and accurately detect and identify hidden dangers in transmission channels is a significant challenge in current intelligent monitoring of power systems.
[0003] Current methods for identifying potential hazards in power transmission channels primarily rely on target detection techniques based on deep learning. While these techniques have achieved some success, they suffer from insufficient adaptability and reduced accuracy in complex environments. Current approaches employing a Transformer-based visual model, which gains powerful spatial feature extraction capabilities through pre-training on large datasets and then fine-tuning in downstream tasks, demonstrate excellent performance in single-frame image target detection. However, these methods have two inherent limitations: First, they struggle to effectively model temporal information, failing to utilize the dynamic evolution patterns between consecutive frames to aid in identifying temporal hazards such as floating object movement and conductor galloping. They also lack the ability to reason about and compensate for single-frame occlusion and false detections through temporal context. Second, the models' robustness and generalization ability remain insufficient when dealing with real-world scenarios such as drastic changes in lighting, extreme weather, and complex background interference, leading to a significant decrease in detection accuracy in complex and ever-changing real-world environments.
[0004] Therefore, in order to overcome this problem, this invention proposes a novel method for identifying potential hazards in power transmission channels that combines the dynamic denoising DINO model and the MambaVision module with enhanced time series processing. The aim is to synergistically leverage the global spatial perception advantage of Transformer and the efficient long-range time series modeling capability of State Space Model (SSM) to comprehensively improve the overall performance of the detection system. Summary of the Invention
[0005] The purpose of this invention is to overcome the shortcomings of existing technologies, such as difficulty in detecting small targets in complex environments and poor adaptability to dynamic scenes, and to provide a method and system for identifying potential hazards in power transmission channels based on a hybrid backbone network.
[0006] First, a dataset of potential hazards in power transmission channels was constructed. This dataset, comprising 2749 images, is derived from on-site inspection data from multiple provincial power companies under the State Grid Corporation of China, comprehensively covering different seasons, weather conditions, and day / night scenarios. The dataset focuses on typical hazards such as insulator spontaneous explosions, loose hardware, bird nests within the transmission channel, plastic film, excessively tall trees (tree obstructions), and construction machinery. Labeling was performed using the LabelMe tool, and the labeling format was uniformly converted to the COCO dataset standard format. To improve the model's generalization ability and robustness under adverse weather and complex backgrounds, this invention systematically implemented data augmentation, including random flipping, rotation, scaling, color adjustment, noise addition, and simulated occlusion. Finally, all images were randomly divided into training, validation, and test sets in a 7:2:1 ratio.
[0007] Subsequently, the cutting-edge DINO detection model was enhanced and improved for power transmission scenarios. While the DINO model possesses powerful end-to-end detection capabilities and excellent small target detection performance, its native backbone network has limitations in processing dynamic targets and temporal evolution features in power transmission channel images. Therefore, the core innovation of this invention lies in designing a novel CSP-MambaVision hybrid backbone network to replace the traditional backbone network, providing a more powerful multi-scale feature representation for the DINO detection head.
[0008] The improved model architecture mainly consists of the CSP-MambaVision backbone network, the Feature Pyramid Network (FPN), and the DINO encoder-decoder. The backbone network adopts a four-stage progressive structure, with each stage integrating convolutional layers and the CSP-MambaVision Block module. This innovatively combines the local detail extraction capability of convolution with the long-range dependency and temporal dynamic modeling capabilities of state-space sequence models. In the core CSP-MambaVision Block, this invention introduces a cross-stage partial connection concept, dividing the feature map into deep processing branches and direct connection branches, which are then concatenated and fused. This achieves comprehensive utilization of multi-scale information, significantly improving the representation capability of targets at different scales, from macroscopic iron towers to microscopic cracks.
[0009] To verify the effectiveness of the improved scheme, gradient descent optimization was performed on the training set, and the training process was monitored and hyperparameters were adjusted using the validation set. Finally, the model performance was evaluated on an independent test set. Key metrics such as precision, recall, mean average precision (mAP), computational cost (FLOPs), and number of model parameters were calculated to comprehensively ensure its accuracy, efficiency, and deployment feasibility in the task of identifying potential hazards in power transmission channels.
[0010] Finally, the optimal model weights obtained from training are deployed into the monitoring system of the power transmission channel for automatic identification and location of potential hazards such as insulator spontaneous explosions and foreign object suspension. This system can adapt to complex changes in the field environment, providing reliable technical support for the intelligent inspection and safe and stable operation of power facilities.
[0011] This invention is achieved through the following technical solution:
[0012] A method for identifying potential hazards in power transmission channels based on a hybrid backbone network includes the following steps:
[0013] S1. Collect images of potential hazards in power transmission channels, preprocess the images, construct a dataset of potential hazards in power transmission channels, and divide the dataset into a training set, a validation set, and a test set.
[0014] S2. Construct the DINO model, which includes a CSP-MambaVision hybrid backbone network, a pyramid network FPN, and a Transformer encoder-decoder. The CSP-MambaVision hybrid backbone network extracts multi-scale features from the power transmission channel hazard dataset. The extracted multi-scale features are then fused and enhanced by the pyramid network FPN. The enhanced features are serialized. The encoder of the Transformer encoder-decoder deeply fuses global features through a self-attention mechanism, gathers image context information to distinguish similar targets, and the decoder interacts with object queries through cross-attention to focus on the potential target location and generate preliminary prediction results.
[0015] S3. Combining denoising training and Top-K feature selection optimization, the preliminary prediction results are optimized, and the hazard bounding box and hazard category prediction results are output.
[0016] S4. The DINO model is trained, validated, and tested using divided training, validation, and test sets, thereby optimizing the DINO model.
[0017] Step S1 specifically includes the following:
[0018] The data and images of potential hazards in the power transmission channels mentioned in 1.1 are derived from on-site inspection data, comprehensively covering the power transmission channel environment under different seasons, different weather conditions, and day and night alternation, totaling 2749 images;
[0019] 1.2 Image Annotation
[0020] The collected images were annotated using the LabelMe annotation tool to complete the bounding box annotation and type label definition of typical hidden dangers in the power transmission channel, and the annotation results were converted from the native LabelMe format to the standard format of the COCO dataset.
[0021] 1.3 Data Augmentation and Partitioning
[0022] Data augmentation strategies were implemented on the power transmission channel hazard dataset, including geometric transformations such as random horizontal and vertical flipping, small-angle rotation, random scaling, and cropping. At the same time, complex and variable lighting conditions and weather effects were simulated by adjusting image brightness, contrast, saturation, and adding Gaussian noise. The 2749 images were randomly divided into training, testing, and validation sets in a 7:2:1 ratio.
[0023] The specific working steps of the DINO model are as follows:
[0024] (1) Multi-scale feature extraction stage
[0025] First, input the feature map. To the CSP-MambaVision hybrid backbone network, where For height, For width, The CSP-MambaVision hybrid backbone network processes images using a hierarchical downsampling structure, progressively downsampling while increasing the number of channels, ultimately outputting a set of multi-scale feature maps, denoted as . Subsequently, these multi-scale features are rapidly fused and enhanced by the Feature Pyramid Network (FPN) to generate a set of feature maps for detection. ;
[0026] (2) Transformer encoding and decoding stage
[0027] First, the feature map output by the Feature Pyramid Network (FPN) After a projection layer uniformly adjusts the channel dimensions, the data is flattened and concatenated into a serialized feature token with added positional encoding. This token is then input into the Transformer encoder, which uses a multi-layer self-attention mechanism to deeply fuse and enhance global features. This allows each feature point to incorporate contextual information from the entire image, effectively distinguishing between similar-looking normal devices and potential hazards. Subsequently, the Transformer decoder receives a set of learnable object queries and enhanced features from the encoder. Through a cross-attention mechanism, each object query interacts with the global features, gradually focusing on the potential target location. Finally, the data is decoded into a series of preliminary prediction outputs.
[0028] (3) Prediction, optimization and output stage
[0029] The DINO model introduces advanced mechanisms such as denoising training and hybrid query selection. During the training phase, a portion of noisy ground truth bounding boxes are added to the input of the decoder, and the model learns to reconstruct the real target in the presence of interference. At the same time, the model selects the most significant Top-K feature points from the features output by the encoder to initialize the anchor boxes. Finally, the output processed by the decoder is passed through the prediction head to generate the final target bounding box and specific category prediction results.
[0030] The CSP-MambaVision hybrid backbone network adopts a progressive structure consisting of four stages, Stage 1 to Stage 4. Each stage sequentially includes convolutional layers and a CSP-MambaVision Block module, achieving progressive deepening and dimensional transformation of the input features. The input to the CSP-MambaVision hybrid backbone network is a preprocessed power transmission channel image. The image first enters Stage 1, where initial feature extraction and spatial downsampling are performed by the initial convolutional layer. Subsequently, the CSP-MambaVision Block module captures the spatial correlation and preliminary long-distance dependency of the features. Stage 2 receives the output features from Stage 1 and further compresses the feature map spatial size and increases the channel dimension through convolutional layers to learn richer feature representations. The subsequent CSP-MambaVision Block module enhances the ability to capture the correlation of cross-regional features. Stages 3 and 4 continue the same logic, using convolutional downsampling and deep semantic modeling by the CSP-MambaVision Block module to enable CSP-MambaVision... As the receptive field of the hybrid backbone network continues to expand, it eventually gains the ability to understand image content from a global perspective, thereby accurately identifying a wide range of hidden dangers and encoding their high-level semantic attributes.
[0031] The CSP-MambaVision Block module, serving as the core processing unit in each stage, is combined with the Cross-Stage Partial Network (CSPNet). By concatenating the outputs of multiple MambaVision Block modules with the original feature maps, the model can comprehensively utilize multi-scale and multi-level information, as detailed below:
[0032] Given an input feature map ,in For height, For width, The number of channels, after being input into the CSP-MambaVision Block module, first undergoes dimensionality adaptation through a convolutional layer (Conv) to obtain an intermediate feature map. :
[0033]
[0034] X in Represents the input feature map;
[0035] Subsequently, the intermediate feature map Perform a split operation to divide the feature into two feature branches, following the CSPNet approach of direct feature transfer for some features and in-depth feature processing for others. The formula is as follows:
[0036]
[0037] in, and These represent two feature branches, respectively. The feature extraction will then proceed to the MambaVisionBlock module for deep feature extraction. This serves as a direct connection branch, preserving the basic information of the original features for subsequent fusion with deep processing features;
[0038] feature The features are processed sequentially through n MambaVision Block modules. After the first MambaVision Block module, the features are updated to... This feature is then input into the next n MambaVisionBlocks, where n is the number of flexibly configurable branches. After processing by the i-th MambaVision Block module, the i-th level time-aware multi-scale feature is output. As the value of i increases, the MambaVision Block module's ability to capture local fine-grained associations of features and model global long-term temporal dependencies is enhanced layer by layer. The level of abstraction and semantic information content of the features are gradually improved, ultimately resulting in deeply processed branch features. ;
[0039] Ultimately, the direct-connection branch feature will be... , and n deep processing branch features output by the MambaVision Block module The concatenation is performed along the channel dimension, using the following formula:
[0040]
[0041] X concat These are the features after splicing.
[0042] The structure of the MambaVision Block module is as follows:
[0043] Assuming the input feature map ,in For height, For width, The output of the MambaVisionBlock module is calculated as follows to determine the number of channels:
[0044]
[0045]
[0046] in, Represents the input feature map Output after processing by the Mamba Vision Mixer module express The output after MLP processing, where MLP stands for Multilayer Perceptron, Norm and Mixer represent the selection of layer normalization and token mixing modules, respectively. Norm adopts the layer normalization method. In the N layers, the first N / 2 layers use MambaVisionMixer, and the last N / 2 layers use the self-attention mechanism.
[0047] The MambaVisionMixer module replaces causal convolution with regular convolution. It adds a symmetric branch without SSM (Sequential Structural Modeling), which includes additional convolutional layers and a sigmoid linear unit activation function to compensate for content loss caused by SSM sequence constraints. Finally, the outputs of the two branches are concatenated and projected through a final linear layer, as detailed below:
[0048] The input image first undergoes a linear transformation, then is processed through convolution and activation functions. The MambaVisionMixer module has two parallel paths: one path uses regular convolution for feature processing, and the other path uses an SSM model for temporal modeling. Finally, the outputs of the two paths are concatenated, and the information from both is fused through a linear transformation to obtain the output features. This process is described by the following formula:
[0049] Given input The output of the MambaVision Mixer module It is calculated using the following formula:
[0050]
[0051]
[0052]
[0053] In the formula, , This represents the output of two different branches. This indicates that the input dimension is C and The linear layer consists of Scan, a selective scanning operation, σ, a function using the SiLU activation function to introduce nonlinear features, and Conv and Concat, which represent one-dimensional convolution and concatenation operations respectively, used to handle spatial and temporal features.
[0054] The self-attention mechanism calculation process follows the following formula:
[0055]
[0056] Where Attention represents the attention mechanism, and Q, K, and V represent query, key, and value, respectively. The number of attention heads is represented in the formula. Q and The product result is scaled to mitigate the numerical offset problem caused by dimensionality growth; This is a normalization function used to map the attention weights to the [0,1] interval, ensuring that the sum of the weights is 1; " represents matrix multiplication operation, This represents the number of attention heads and the square root operation.
[0057] Follow these steps to input the divided training set, validation set, and test set into the DINO model:
[0058] First, the training set is input into the DINO model for training. The network parameters are iteratively optimized through gradient descent, enabling the DINO model to learn image features and target representations until the training loss converges. During training, the accuracy and mAP metrics are periodically calculated based on the validation set to evaluate performance, and the model's hyperparameters are adjusted accordingly. Finally, after training and parameter tuning are completed, the model is independently validated using the test set, and the final metrics of detection accuracy and computational efficiency are output, along with the optimal weights on the validation set.
[0059] A power transmission channel hazard identification system based on a hybrid backbone network includes:
[0060] Data acquisition module: Collects images of potential hazards in power transmission channels, preprocesses these images, constructs a dataset of potential hazards in power transmission channels, and divides the dataset into training, validation, and test sets;
[0061] DINO Model Construction Module: The DINO model includes a CSP-MambaVision hybrid backbone network, a pyramid network FPN, and a Transformer encoder-decoder. The CSP-MambaVision hybrid backbone network extracts multi-scale features from the power transmission channel hazard dataset. These extracted multi-scale features are then fused and enhanced by the pyramid network FPN. The enhanced features are then serialized. The encoder of the Transformer encoder-decoder deeply fuses global features through a self-attention mechanism, aggregating image context information to distinguish similar targets. The decoder interacts with object queries through cross-attention, focusing on potential target locations to generate preliminary prediction results.
[0062] Prediction result optimization module: Combining denoising training and Top-K feature selection optimization, the module optimizes the preliminary prediction results and outputs the hazard bounding box and hazard category prediction results.
[0063] Model optimization module: The DINO model is optimized by training, validating, and testing the model using divided training, validation, and test sets.
[0064] A computer device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to implement the steps of the method for identifying potential hazards in power transmission channels based on a hybrid backbone network.
[0065] A computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of a method for identifying potential hazards in power transmission channels based on a hybrid backbone network.
[0066] The advantages of this invention are: (1) Enhanced spatiotemporal modeling capability
[0067] This invention innovatively combines the powerful temporal modeling capabilities of the State-Space Model (SSM) with the local perception advantages of the Convolutional Neural Network (CNN) through a self-designed CSP-MambaVision hybrid backbone network. Compared to traditional feature extraction networks, this invention can effectively capture the evolution process and long-range dependencies of dynamic targets (such as floating objects and foreign object damage) in power transmission channels, ensuring that it can simultaneously process the structural details of static equipment such as insulators and fittings in images. The introduction of the CSPNet architecture further ensures the efficient fusion of multi-scale features, enabling the model to handle complex visual tasks ranging from macroscopic scenes to microscopic defects, providing richer and more robust feature representations for subsequent accurate identification.
[0068] (2) High-precision detection capability based on global context awareness
[0069] The model of this invention features a Transformer encoder and decoder structure, enabling accurate global modeling and enhancing it in the temporal dimension through a cross-attention mechanism. The Transformer encoder uses a self-attention mechanism to model the global context of the input image, capturing long-range correlation information within the image. This allows the model to understand not only local features but also a comprehensive analysis of the target from a global perspective. The Transformer decoder utilizes the cross-attention mechanism, enabling each target query to interact with global features, thereby focusing on the accurate location of the potential target. Combined with the introduced denoising training and hybrid query selection mechanism, the model learns to reconstruct the true target in noisy environments during training, thus maintaining high-precision localization and classification capabilities even in the face of noise such as rain, snow, and shadows.
[0070] (3) Practical value of the project
[0071] This invention fully considers the complexity of real power transmission channel scenarios in its data construction, model design, and training strategies. By employing inspection data from multiple sources, time periods, and weather conditions, and applying targeted data augmentation, the model possesses strong adaptability to various changes in the real environment. Simultaneously, it achieves a good balance between the high efficiency of the Mamba module in sequence modeling and the accuracy of the Transformer, ensuring detection performance while also considering the model's inference efficiency. This lays a solid foundation for subsequent engineering deployment on edge devices and greatly enhances the practical value of this technology in intelligent power system inspection. Attached Figure Description
[0072] Figure 1 This is a flowchart of the technology of the present invention;
[0073] Figure 2 This is a diagram of the overall system architecture.
[0074] Figure 3 The overall architecture diagram of the backbone network;
[0075] Figure 4 Here is a diagram of the CSP-MambaVision module structure;
[0076] Figure 5 This is a structural diagram of the MambaVision Mixer module. Detailed Implementation
[0077] like Figure 1-5 As shown, a method for identifying potential hazards in power transmission channels based on a hybrid backbone network includes the following steps:
[0078] 1. Dataset Construction and Preprocessing
[0079] 1.1 Data Collection
[0080] The dataset of this invention is derived from on-site inspection data from multiple provincial power companies of the State Grid Corporation of China. It comprehensively covers the transmission channel environment under different seasons, weather conditions (such as sunny, rainy, foggy, and snowy), and day-night cycles, totaling 2,749 images. The dataset focuses on typical potential hazards such as insulator spontaneous explosions, loose hardware, bird nests in the channel, plastic film, excessively tall trees (tree obstructions), construction machinery, and smoke and fire below.
[0081] 1.2 Image Annotation
[0082] The annotation process was implemented using the LabelMe annotation tool. After defining the bounding boxes and type labels for typical hidden dangers in the power transmission channel, the annotation results were converted from the native LabelMe format to the COCO dataset standard format. This format fully records the spatial location and category attributes of each target instance, providing standardized and high-quality supervision signals for subsequent model training.
[0083] 1.3 Data Augmentation and Partitioning
[0084] To comprehensively improve the model's generalization ability and robustness, this invention implements a systematic data augmentation strategy during the training phase. This invention employs geometric transformations including random horizontal and vertical flipping, small-angle rotation, random scaling, and cropping. Simultaneously, by adjusting image brightness, contrast, saturation, and adding Gaussian noise, it effectively simulates complex and variable lighting conditions and weather effects. Furthermore, the introduction of random rectangular occlusion aims to train the model's ability to recognize partially occluded or incomplete targets, thereby better addressing real-world scenarios such as vegetation occlusion and mutual occlusion between equipment. The 2749 images were randomly divided into training, testing, and validation sets in a 7:2:1 ratio.
[0085] 1.4 Evaluation Indicators
[0086] The evaluation metrics are based on the confusion matrix, which includes: TP (predicted as a positive sample, which is also a positive sample), FP (predicted as a positive sample, which is also a negative sample), and FN (predicted as a negative sample, which is actually a positive sample).
[0087] (1) Precision, )
[0088] Based on the confusion matrix, the accuracy can be obtained. The calculation is as follows:
[0089]
[0090] (2) Recall rate ):
[0091] Based on the confusion matrix, the recall rate can be obtained. The calculation is as follows:
[0092]
[0093] (3) Mean of average precision ( ):
[0094] Based on the obtained accuracy and recall rate It can draw Precision-Recall ( The average precision (AP) curve can be obtained by integrating the area under the AP curve. ):
[0095]
[0096] Mean precision across all categories ( The calculation is as follows:
[0097]
[0098] In the formula, for line integrals, This represents the number of categories detected.
[0099] (4) Computational cost (FLOPs):
[0100] It is an indicator that measures the total number of floating-point operations in the core layers such as convolution and fully connected layers during the forward propagation of an object detection model when processing an image in a single pass. It is used to quantify the computational complexity of the model.
[0101] (5) Number of Model Parameters:
[0102] This represents the total number of all trainable parameters that make up the neural network, used to quantify the size and complexity of the model.
[0103] 2. System Overall Framework and Workflow
[0104] This invention addresses the core challenges of power transmission channel hazard identification, including complex environments, diverse targets, low accuracy in small target detection, and poor adaptability to dynamic scenes. It proposes a power transmission channel hazard identification method based on the DINO model and MambaVision temporal enhancement. The invention selects the DINO model as the baseline model for power transmission channel hazard detection. DINO is an end-to-end Transformer detector, which can better utilize complex environments and improve the detection accuracy of small targets in power transmission channel hazard identification tasks. The core innovation of the system lies in the design of a novel CSP-MambaVision hybrid backbone network to replace the traditional backbone network, providing a more powerful multi-scale feature representation for the Transformer-based DINO detection head. The overall system architecture is as follows: Figure 1 As shown, it mainly includes the following three core stages:
[0105] (1) Multi-scale feature extraction stage
[0106] First, input the feature map. (in For height, For width, The image is processed using a hierarchical downsampling structure, progressively downsampling while increasing the number of channels. This innovatively combines the local detail-awareness of convolution with the long-range dependency and temporal dynamic modeling capabilities of the State-Space Model (SSM). Convolutional layers effectively capture the subtle texture features of components such as insulators and bolts, while the SSM simulates the evolution of dynamic processes such as bird flight and tree branch swaying in a power line corridor. The final output is a set of multi-scale feature maps, denoted as... Subsequently, these multi-scale features are rapidly fused and enhanced through a Feature Pyramid Network (FPN) to generate a set of feature maps for detection. To better characterize targets at different scales.
[0107] (2) Transformer encoding and decoding stage
[0108] First, the feature pyramid output by FPN. After a projection layer uniformly adjusts the channel dimensions, the data is flattened and concatenated into a serialized feature token, with positional encoding added, before being input into the Transformer encoder. The encoder uses a multi-layer self-attention mechanism to deeply fuse and enhance global features, enabling each feature point to incorporate contextual information from the entire image, effectively distinguishing between similar-looking normal devices and potential hazards. Subsequently, the Transformer decoder receives a set of learnable object queries and the enhanced features from the encoder. Through a cross-attention mechanism, each object query interacts with the global features, gradually focusing on the potential target location, and is ultimately decoded into a series of preliminary prediction outputs.
[0109] (3) Prediction, optimization and output stage
[0110] The DINO framework introduces advanced mechanisms such as denoising training and hybrid query selection to improve performance. During training, a portion of ground truth (GT) boxes with noise are added to the decoder input. The model learns to reconstruct the true target even with interference, significantly improving the model's detection stability and generalization ability in complex noisy environments such as rain, snow, and shadows. Simultaneously, the model selects the top-K most significant feature points from the encoder output to initialize anchor boxes, providing a better starting point for the decoder. Finally, the decoder output is passed through the prediction head to generate the final target bounding box (e.g., defining the defect region and foreign object range) and specific category prediction results (e.g., insulator self-explosion, foreign object suspension, etc.).
[0111] 3. Backbone Network Design
[0112] 3.1 CSP-Mamba Backbone Network
[0113] In the identification of potential hazards in power transmission channels, commonly used feature extraction backbone networks (such as ResNet and VGG) rely heavily on feature extraction from static images. These networks struggle to effectively capture the temporal dynamic characteristics and evolution of potential hazards (such as floating objects and animal activity), often exhibiting limitations when faced with complex environmental changes and dynamic targets. Especially in dynamic scenarios, these traditional networks fail to effectively capture the temporal changes and evolution of targets, making them unsuitable for real-time and accurate monitoring in practical applications.
[0114] The feature extraction network designed in this invention innovatively integrates the core ideas of Cross-Stage Partial Network (CSPNet) with the temporal modeling advantages of MambaVision. This enables efficient and comprehensive extraction of hidden danger features from power transmission channel images. Its specific structure and workflow are as follows:
[0115] The feature extraction network employs a progressive structure consisting of four stages (Stage 1 to Stage 4). Each stage sequentially includes convolutional layers and a CSP-MambaVision Block, achieving progressive deepening and dimensionality transformation of the input features through this combination. The network input is a preprocessed image of a power transmission corridor. This image first enters Stage 1, where initial feature extraction and spatial downsampling are performed by an initial convolutional layer. Subsequently, the CSP-MambaVision Block captures the spatial correlation and preliminary long-distance dependencies of the features. Stage 2 receives the output features from Stage 1 and further compresses the spatial size of the feature map and increases the channel dimension through convolutional layers to learn richer feature representations. The subsequent CSP-MambaVision Block enhances the ability to capture the correlation of cross-regional features, helping to locate small potential hazards (such as bird nests) in the vast background of the power transmission corridor and effectively suppressing interference from complex backgrounds. Stage 3 follows the same logic as Stage 4, using convolutional downsampling and deep semantic modeling of the CSP-MambaVision Block to continuously expand the network's receptive field, ultimately enabling it to understand image content from a global perspective, thereby accurately identifying large-scale hidden dangers (such as large-scale construction machinery intrusion) and encoding their high-level semantic attributes.
[0116] 3.2CSP-MambaVision Block Module
[0117] As the core processing unit of each stage, the CSP-MambaVision Block combines the core idea of Cross-Stage Partial Network (CSPNet). By concatenating the outputs of multiple MambaVisionBlocks with the original feature maps, the model can comprehensively utilize multi-scale and multi-level information, which is particularly important for power transmission scenarios that simultaneously include macroscopic structures (towers) and microscopic defects (insulator cracks).
[0118] Given an input feature map (in For height, For width, After the feature map (number of channels) is input into the CSP-MambaVision Block module, it first undergoes dimensionality adaptation through a convolutional layer (Conv) to obtain an intermediate feature map:
[0119]
[0120] Subsequently, Perform a split operation, dividing the feature into two feature branches. Following the CSPNet principle of direct feature transfer for some features and in-depth processing for others, the formula is as follows:
[0121]
[0122] in, It will then proceed to the subsequent MambaVision Block for deep feature extraction; This serves as a direct connection branch, preserving the basic information of the original features for subsequent fusion with deep processing features.
[0123] The features are processed sequentially through n MambaVision Blocks. After the first MambaVision Block, the features are updated to... This feature is then input into the next n MambaVision Blocks (where n is the number of flexibly configurable branches used to control the depth of feature extraction in the dimensions of "temporal-spatial interaction and multi-scale expression"). The "Mixer→MLP→Self-Attention→MLP" process is repeated, passing through the i-th... After processing by the MambaVision Block, the i-th level temporal-aware multi-scale features are output. As i increases, the MambaVision Block's ability to capture local fine-grained associations and model global long-term temporal dependencies is progressively enhanced. The abstraction level and semantic information content of the features gradually increase, making it particularly suitable for simulating the temporal evolution patterns between consecutive frames, such as wire dancing or foreign object falling. Finally, the deep-processed branch features are obtained. .
[0124] Ultimately, the direct-connection branch feature will be... , and n deep processing branches output by the MambaVision Block The concatenation is performed along the channel dimension, using the following formula:
[0125]
[0126] 3.3 Internal Structure of MambaVision Block
[0127] 3.3.1 Overall Module Processing Flow
[0128] The ManbaVision Block is key to implementing deep feature modeling. This module combines the ManbaVisionMixer module and the Transformer concept. Specifically, assuming the input feature map... (in For height, For width, The output of this module (considering the number of channels) can be calculated as follows:
[0129]
[0130]
[0131] Here, Norm and Mixer represent the selection of layer normalization and token mixing modules, respectively. For generality, this invention uses layer normalization for Norm. In the N layers, the first N / 2 layers use MambaVisionMixer, while the last N / 2 layers use a self-attention mechanism. This alternating design of the two modules allows the model to first perform temporal feature abstraction and long-range dependency modeling with high efficiency, and then use the self-attention mechanism to finely adjust and consolidate spatial context relationships, achieving a balance between computational efficiency and model performance.
[0132] 3.3.2 MambaVisionMixer Module
[0133] The MambaMixer module within the MambaVision Block is crucial for achieving deep feature modeling. This module improves upon the original MambaMixer module, making it more suitable for vision tasks. The MambaVision Mixer module is designed to enhance the spatiotemporal modeling capabilities of object detection models, especially excelling in handling the temporal variations of dynamic objects.
[0134] The core improvement lies in replacing causal convolution with regular convolution, as causal convolution only allows unidirectional information transmission, which is both unnecessary and limiting for visual tasks. Furthermore, this module adds a symmetric branch without SSM (Sequential Sequencing Module), which includes additional convolutional layers and Sigmoid linear unit (SiLU) activation functions to compensate for content loss caused by the sequence constraints of SSM. Finally, the outputs of the two branches are concatenated and projected through a final linear layer. This combined design ensures that the final features integrate both sequential and spatial information, fully leveraging the advantages of both branches.
[0135] Specifically, the input image first undergoes a linear transformation, then is processed by convolution and the SiLU activation function. The module has two parallel paths: one path uses regular convolution for feature processing, and the other path uses the SSM model for temporal modeling. Finally, the outputs of the two paths are concatenated, and the information from both is fused through a linear transformation to obtain the output features. This process is described by the following formula:
[0136] Given input MambaVision Mixer output It can be calculated using the following formula:
[0137]
[0138]
[0139]
[0140] In the formula, This indicates that the input dimension is C and The linear layer consists of Scan, a selective scanning operation, and σ, a function using the SiLU activation function to introduce non-linear features. Furthermore, Conv and Concat represent one-dimensional convolution and concatenation operations, respectively, used to handle spatial and temporal features.
[0141] 3.3.3 Self-attention mechanism
[0142] The last N / 2 layers of the MambaVision Block employ a general multi-head self-attention mechanism. By introducing this mechanism, the complex spatial dependencies between potential hazards and their surroundings (such as wires, towers, and background trees) in the image are captured more precisely, enhancing the ability to handle complex scenes and dynamic targets. The calculation process follows the formula below:
[0143]
[0144] Where Q, K, and V represent query, key, and value, respectively. The number of attention heads is represented in the formula. Q and The product result is scaled to mitigate the numerical offset problem caused by dimensionality growth; This is a normalization function used to map the attention weights to the [0,1] interval, ensuring that the sum of the weights is 1; " indicates matrix multiplication operation.
[0145] 4. Model Training and Testing
[0146] After completing the DINO model improvements, input the divided training set, validation set, and test set into the model according to the following procedure:
[0147] First, the training set is input into the improved DINO model for training. The network parameters are iteratively optimized through gradient descent to enable the model to learn image features and target representations until the training loss converges. During training, performance is evaluated periodically based on metrics such as accuracy and mAP on the validation set, and the model's hyperparameters (such as learning rate, regularization factor, batch size, etc.) are adjusted accordingly. Finally, after training and parameter tuning are completed, the model is independently validated using the test set, and the final metrics such as detection accuracy and computational efficiency are output, along with the optimal weights on the validation set.
Claims
1. A method for identifying potential hazards in power transmission channels based on a hybrid backbone network, characterized in that: Includes the following steps: S1. Collect images of potential hazards in power transmission channels, preprocess the images, construct a dataset of potential hazards in power transmission channels, and divide the dataset into a training set, a validation set, and a test set. S2. Construct the DINO model, which includes a CSP-MambaVision hybrid backbone network, a pyramid network FPN, and a Transformer encoder-decoder. The CSP-MambaVision hybrid backbone network extracts multi-scale features from the power transmission channel hazard dataset. The extracted multi-scale features are then fused and enhanced by the pyramid network FPN. The enhanced features are serialized. The encoder of the Transformer encoder-decoder deeply fuses global features through a self-attention mechanism, gathers image context information to distinguish similar targets, and the decoder interacts with object queries through cross-attention to focus on the potential target location and generate preliminary prediction results. S3. Combining denoising training and Top-K feature selection optimization, the preliminary prediction results are optimized, and the hazard bounding box and hazard category prediction results are output. S4. The DINO model is trained, validated, and tested using divided training, validation, and test sets, thereby optimizing the DINO model.
2. The method for identifying potential hazards in power transmission channels based on a hybrid backbone network according to claim 1, characterized in that: Step S1 specifically includes the following: The data and images of potential hazards in the power transmission channels mentioned in 1.1 are derived from on-site inspection data, comprehensively covering the power transmission channel environment under different seasons, different weather conditions, and day and night alternation, totaling 2749 images; 1.2 Image Annotation The collected images were annotated using the LabelMe annotation tool to complete the bounding box annotation and type label definition of typical hidden dangers in the power transmission channel, and the annotation results were converted from the native LabelMe format to the standard format of the COCO dataset. 1.3 Data Augmentation and Partitioning Data augmentation strategies were implemented on the power transmission channel hazard dataset, including geometric transformations such as random horizontal and vertical flipping, small-angle rotation, random scaling, and cropping. At the same time, complex and variable lighting conditions and weather effects were simulated by adjusting image brightness, contrast, saturation, and adding Gaussian noise. The 2749 images were randomly divided into training, testing, and validation sets in a 7:2:1 ratio.
3. The method for identifying potential hazards in power transmission channels based on a hybrid backbone network according to claim 1, characterized in that: The specific working steps of the DINO model are as follows: (1) Multi-scale feature extraction stage First, input the feature map. To the CSP-MambaVision hybrid backbone network, where For height, For width, The CSP-MambaVision hybrid backbone network processes images using a hierarchical downsampling structure, progressively downsampling while increasing the number of channels, ultimately outputting a set of multi-scale feature maps, denoted as . Subsequently, these multi-scale features are rapidly fused and enhanced by the Feature Pyramid Network (FPN) to generate a set of feature maps for detection. ; (2) Transformer encoding and decoding stage First, the feature map output by the Feature Pyramid Network (FPN) After a projection layer uniformly adjusts the channel dimensions, the data is flattened and spliced into a serialized feature token, and positional encoding is added. Finally, it is input into the Transformer encoder. The Transformer encoder performs deep fusion and enhancement of global features through a multi-layer self-attention mechanism, so that each feature point gathers the contextual information of the entire image, thereby effectively distinguishing normal devices with similar appearances from potential hazards. Subsequently, the Transformer decoder receives a set of learnable object queries and enhanced features from the encoder. Through a cross-attention mechanism, each object query interacts with global features, gradually focusing on the potential target location, and is finally decoded into a series of preliminary prediction outputs. (3) Prediction, optimization and output stage The DINO model introduces advanced mechanisms such as denoising training and hybrid query selection. During the training phase, a portion of noisy ground truth bounding boxes are added to the input of the decoder, and the model learns to reconstruct the real target in the presence of interference. At the same time, the model selects the most significant Top-K feature points from the features output by the encoder to initialize the anchor boxes. Finally, the output processed by the decoder is passed through the prediction head to generate the final target bounding box and specific category prediction results.
4. The method for identifying potential hazards in power transmission channels based on a hybrid backbone network according to claim 3, characterized in that: The CSP-MambaVision hybrid backbone network adopts a progressive structure consisting of four stages, Stage 1 to Stage 4. Each stage sequentially includes convolutional layers and a CSP-MambaVision Block module, enabling gradual deepening and dimensional transformation of the input features. The input to the CSP-MambaVision hybrid backbone network is a preprocessed power transmission channel image. The image first enters Stage 1, where initial feature extraction and spatial downsampling are performed by the initial convolutional layer. Subsequently, the CSP-MambaVision Block module captures the spatial correlation and preliminary long-distance dependency of the features. Stage 2 receives the output features from Stage 1 and further compresses the feature map spatial size and increases the channel dimension through convolutional layers to learn richer feature representations. The subsequent CSP-MambaVision Block module enhances the ability to capture the correlation of cross-regional features. Stages 3 and 4 continue the same logic, using convolutional downsampling and deep semantic modeling by the CSP-MambaVision Block module to enable CSP-MambaVision to achieve higher levels of detail and dimensionality transformation. As the receptive field of the hybrid backbone network continues to expand, it eventually gains the ability to understand image content from a global perspective, thereby accurately identifying a wide range of hidden dangers and encoding their high-level semantic attributes.
5. The method for identifying potential hazards in power transmission channels based on a hybrid backbone network according to claim 4, characterized in that: The CSP-MambaVision Block module, serving as the core processing unit in each stage, is combined with the Cross-Stage Partial Network (CSPNet). By concatenating the outputs of multiple MambaVision Block modules with the original feature maps, the model can comprehensively utilize multi-scale and multi-level information, as detailed below: Given an input feature map ,in For height, For width, The number of channels, after being input into the CSP-Mamba Vision Block module, first undergoes dimensionality adaptation through a convolutional layer (Conv) to obtain an intermediate feature map. : X in Represents the input feature map; Subsequently, the intermediate feature map Perform a split operation to divide the feature into two feature branches, following the CSPNet approach of direct feature transfer for some features and in-depth feature processing for others. The formula is as follows: in, and These represent two feature branches, respectively. The feature extraction will then proceed to the MambaVisionBlock module for deep feature extraction. This serves as a direct connection branch, preserving the basic information of the original features for subsequent fusion with deep processing features; feature The features are processed sequentially through n MambaVision Block modules. After the first MambaVision Block module, the features are updated to... This feature is then input into the next n MambaVisionBlocks, where n is the number of flexibly configurable branches. After processing by the i-th MambaVision Block module, the i-th level time-aware multi-scale feature is output. As the value of i increases, the MambaVision Block module's ability to capture local fine-grained associations of features and model global long-term temporal dependencies is enhanced layer by layer. The level of abstraction and semantic information content of the features are gradually improved, ultimately resulting in deeply processed branch features. ; Ultimately, the direct-connection branch feature will be... , and n deep processing branch features output by the MambaVision Block module The concatenation is performed along the channel dimension, using the following formula: X concat These are the features after splicing.
6. The method for identifying potential hazards in power transmission channels based on a hybrid backbone network according to claim 5, characterized in that: The structure of the MambaVision Block module is as follows: Assuming the input feature map ,in For height, For width, The number of channels is calculated in the following way for the output of the Mamba isionBlock module: in, Represents the input feature map Output after processing by the Mamba Vision Mixer module express The output after MLP processing, where MLP stands for Multilayer Perceptron, Norm and Mixer represent the selection of layer normalization and token mixing modules, respectively. Norm adopts the layer normalization method. In the N layers, the first N / 2 layers use MambaVisionMixer, and the last N / 2 layers use the self-attention mechanism.
7. The method for identifying potential hazards in power transmission channels based on a hybrid backbone network according to claim 6, characterized in that: The MambaVisionMixer module replaces causal convolution with regular convolution. It adds a symmetric branch without SSM (Sequential Structural Modeling), which includes additional convolutional layers and a sigmoid linear unit activation function to compensate for content loss caused by SSM sequence constraints. Finally, the outputs of the two branches are concatenated and projected through a final linear layer, as detailed below: The input image first undergoes a linear transformation, then is processed through convolution and activation functions. The MambaVisionMixer module has two parallel paths: one path uses regular convolution for feature processing, and the other path uses an SSM model for temporal modeling. Finally, the outputs of the two paths are concatenated, and the information from both is fused through a linear transformation to obtain the output features. This process is described by the following formula: Given input The output of the MambaVision Mixer module It is calculated using the following formula: In the formula, , This represents the output of two different branches. This indicates that the input dimension is C and The linear layer consists of Scan, a selective scanning operation, σ, a function using the SiLU activation function to introduce nonlinear features, and Conv and Concat, which represent one-dimensional convolution and concatenation operations respectively, used to handle spatial and temporal features.
8. The method for identifying potential hazards in power transmission channels based on a hybrid backbone network according to claim 7, characterized in that: The self-attention mechanism calculation process follows the following formula: Where Attention represents the attention mechanism, and Q, K, and V represent query, key, and value, respectively. The number of attention heads is represented in the formula. Q and The product result is scaled to mitigate the numerical offset problem caused by dimensionality growth; This is a normalization function used to map the attention weights to the [0,1] interval, ensuring that the sum of the weights is 1. " represents matrix multiplication operation, This represents the number of attention points and the square root operation.
9. The method for identifying potential hazards in power transmission channels based on a hybrid backbone network according to claim 1, characterized in that: Follow these steps to input the divided training set, validation set, and test set into the DINO model: First, the training set is input into the DINO model for training. The network parameters are iteratively optimized through gradient descent, enabling the DINO model to learn image features and target representations until the training loss converges. Furthermore, during the training process, the accuracy and mAP metrics are calculated periodically based on the validation set to evaluate the performance, and the hyperparameters of the model are adjusted accordingly. Finally, after training and parameter tuning are completed, the model is independently validated using the test set, and the final indicators of detection accuracy and computational efficiency are output, and the optimal weights on the validation set are obtained.
10. A power transmission channel hazard identification system based on a hybrid backbone network, characterized in that: Including: Data acquisition module: Collects images of potential hazards in power transmission channels, preprocesses these images, constructs a dataset of potential hazards in power transmission channels, and divides the dataset into training, validation, and test sets; DINO Model Construction Module: The DINO model includes a CSP-MambaVision hybrid backbone network, a pyramid network FPN, and a Transformer encoder-decoder. The CSP-MambaVision hybrid backbone network extracts multi-scale features from the power transmission channel hazard dataset. These extracted multi-scale features are then fused and enhanced by the pyramid network FPN. The enhanced features are then serialized. The encoder of the Transformer encoder-decoder deeply fuses global features through a self-attention mechanism, aggregating image context information to distinguish similar targets. The decoder interacts with object queries through cross-attention, focusing on potential target locations to generate preliminary prediction results. Prediction result optimization module: Combining denoising training and Top-K feature selection optimization, the module optimizes the preliminary prediction results and outputs the hazard bounding box and hazard category prediction results. Model optimization module: The DINO model is optimized by training, validating, and testing the model using divided training, validation, and test sets.
11. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method for identifying potential hazards in power transmission channels based on a hybrid backbone network as described in any one of claims 1-9.
12. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps of the method for identifying potential hazards in power transmission channels based on a hybrid backbone network as described in any one of claims 1-9.