A multi-modal cooperative communication tower corrosion detection method
By employing a multimodal collaborative corrosion detection method, and utilizing improvements to keyframe preprocessing and the backbone network, combined with a gated cross-attention mechanism, the problem of redundant information in multimodal detection is solved, achieving efficient corrosion detection and improving detection accuracy and robustness.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HEILONGJIANG UNIV
- Filing Date
- 2026-03-23
- Publication Date
- 2026-06-19
AI Technical Summary
Existing multimodal corrosion detection methods tend to introduce redundant information when incorporating video data, resulting in high computational overhead. Furthermore, the detection results depend on the fine-grained structure and boundary features in the image, making it difficult to achieve efficient fusion while ensuring computational efficiency.
A multimodal collaborative corrosion detection method is adopted. By preprocessing video data to extract key frames, and introducing the LS module and the EAM edge enhancement module into the backbone network, combined with the gated cross-attention mechanism and deformable attention structure, video modal information is selectively filtered out, so as to achieve key enhancement of image modalities and lightweight assistance of video modalities.
It improves the practicality and detection accuracy of multimodal fusion, reduces computational burden, enhances texture contrast and structural detail expression in eroded areas, is suitable for corrosion feature recognition under complex lighting and occlusion conditions, and improves detection accuracy and robustness.
Smart Images

Figure CN122244760A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of artificial intelligence and corrosion detection technology, specifically relating to a multimodal collaborative method for corrosion detection of communication towers. Background Technology
[0002] With the continuous advancement of infrastructure construction in my country, the scale of 5G communication networks and power systems is constantly expanding, leading to a dramatic increase in the number of communication towers as a fundamental support structure. As of the end of June 2025, the number of tower sites in my country reached 2.119 million, an increase of 25,000 compared to the end of last year. Faced with such a massive number of towers, improper maintenance can cause property damage and personal injury; therefore, ensuring the long-term stable operation of towers is a significant challenge for maintenance work. In complex outdoor environments, communication tower structures are susceptible to corrosion due to factors such as rainwater erosion and temperature changes. If corrosion is not detected and assessed in a timely manner, it can range from affecting the stable operation of equipment to causing structural safety hazards, leading to serious accidents such as component detachment, communication interruptions, and tower collapses, and in extreme cases, even resulting in casualties. Therefore, corrosion monitoring of communication towers has become a crucial issue that urgently needs to be addressed.
[0003] Corrosion problems are particularly prominent in high-altitude and cold regions. Communication towers are exposed to complex natural environments characterized by low temperatures, freeze-thaw cycles, and frequent snowfall. These factors combined lead to rapid corrosion development and diverse corrosion forms. Due to the long winters, tower corrosion is often highly concealed by ice cover. Traditional inspection methods, primarily relying on manual patrols, are not only costly in terms of manpower and pose significant safety risks, but are also severely limited by weather and geographical conditions, making it difficult to conduct regular and detailed inspections of the widely distributed towers. Therefore, their ability to support corrosion detection of communication towers is limited.
[0004] In recent years, the development of deep learning and computer vision has made automatic corrosion detection methods based on image and video data a research hotspot. For example, existing image target detection methods are mostly based on convolutional neural networks or Transformer architectures, including models such as the YOLO series, Faster R-CNN, and DETR. These methods can improve detection efficiency to some extent and reduce reliance on manual labor, but their performance is easily affected by shooting angle, lighting changes, and occlusion factors. In addition to image detection, existing technologies have also proposed corrosion detection methods based on video temporal analysis. For example, 3D convolutional neural networks or temporal Transformer architectures are used to model continuous frames to extract the spatiotemporal features of continuous frames and capture the dynamic changes of the corrosion area. However, these methods generally face problems such as large data volume, high computational complexity, and high requirements for hardware and bandwidth. Overall, although single-modal corrosion detection methods have improved accuracy and efficiency compared to manual inspection, there is still room for further optimization in their detection performance and applicability in engineering applications.
[0005] To overcome the limitations of single-modal data, some studies have introduced multimodal detection methods combining images and videos, using temporal information to assist image recognition and object detection. Existing multimodal fusion methods mainly follow two technical routes: one is based on the Transformer architecture, uniformly encoding multimodal information and achieving deep feature fusion through cross-modal attention mechanisms; the other is based on convolutional neural network frameworks, fusing or weighting feature maps from different modalities at network layers. These methods can, to some extent, verify the auxiliary role of video temporal information in image detection tasks. However, when incorporating video data, existing multimodal methods often directly utilize raw consecutive frames for feature fusion, easily introducing redundant information and noise, leading to a significant increase in computational overhead. Especially in corrosion detection tasks, the detection results rely more heavily on fine-grained structures and boundary features in the image. How to fully utilize the effective supplementary information provided by video, while ensuring computational efficiency, and achieve efficient fusion of multimodal features, while primarily relying on image information, remains a key problem that urgently needs to be solved. Summary of the Invention
[0006] The purpose of this invention is to address the problem that existing methods easily introduce redundant information when incorporating video data, leading to high computational overhead. Therefore, a multimodal collaborative method for detecting corrosion of communication towers is proposed.
[0007] The technical solution adopted by the present invention to solve the above-mentioned technical problems is: a multimodal collaborative method for detecting corrosion of communication towers, the method specifically including the following steps:
[0008] Step 1: Acquire image and video data of the area to be detected on the communication tower, respectively;
[0009] Step 2: Preprocess the video data obtained in Step 1 to extract keyframes related to the image data from the video data;
[0010] Next, the images and keyframes obtained in step one are processed. Specifically, the images and keyframes obtained in step one are adjusted to a uniform size, and then the adjusted images and keyframes are normalized respectively.
[0011] Step 3: Use the normalized image and normalized keyframes from Step 2 as input to the corrosion detection model;
[0012] The corrosion detection model includes a backbone network, an encoder, a decoder, and a detection head;
[0013] Step 4: Output the corrosion detection results of the area to be tested through the detection head of the corrosion detection model.
[0014] Furthermore, in step two, the video data obtained in step one is preprocessed to extract keyframes related to the image data from the video data. The specific process is as follows:
[0015] First, the video data is divided into frames. Then, each frame obtained from the frame division is extracted at equal intervals. The extracted frames are used to form a keyframe sequence with temporal continuity.
[0016] Calculate the feature similarity between the image obtained in step one and each keyframe, and determine the keyframe corresponding to the largest feature similarity.
[0017] Using the identified keyframe as the center, select the two nearest neighbor keyframes forward and backward in the keyframe sequence, resulting in a total of five keyframes, including the center.
[0018] Furthermore, the working process of the backbone network is as follows:
[0019] Within the backbone network, the input image is sequentially passed through the Stem module, the first LS module, the first EAM module, the first downsampling module, the second LS module, the second EAM module, the second downsampling module, the third LS module, the third downsampling module, and the MSA module. The output of the MSA module is used as the output of the backbone network.
[0020] Furthermore, the operation process of the first LS module is as follows:
[0021] Within the first LS module, the input features first pass through a first depthwise separable convolutional layer with a kernel size of 3×3.
[0022] The output of the first depthwise separable convolutional layer is then residually connected to the input features of the first LS module to obtain residual connection result a. The residual connection result a is then used as the input of the first SE block.
[0023] The output of the first SE block is used as the input of the first FFN layer. Then, the output of the first FFN layer is residually connected with the output of the first SE block to obtain the residual connection result b. The residual connection result b is used as the input of the first LS convolutional layer.
[0024] The output of the first LS convolutional layer is residually concatenated with the residual concatenation result b to obtain the residual concatenation result c.
[0025] The residual connection result c is used as the input of the second FFN layer, and the output of the second FFN layer is then residually connected with the residual connection result c to obtain the residual connection result d.
[0026] The residual connection result d is used as the output of the first LS module.
[0027] Furthermore, the working process of the first EAM module is as follows:
[0028] Within the first EAM module, the output of the first LS module and the Sobel edge map are used as the input of the first FE unit. The output of the first FE unit is then used as the input of the first convolutional layer. Finally, the output of the first convolutional layer is residually connected with the output of the first LS module to obtain the residual connection result e.
[0029] The residual connection result e and the Sobel edge map are used as inputs to the second FE unit, and the output of the second FE unit is used as input to the second convolutional layer. The output of the second convolutional layer is then residually connected with the output of the first LS module to obtain the residual connection result f.
[0030] The residual connection result f is used as the output of the first EAM module.
[0031] Furthermore, the operation process of the first FE unit is as follows:
[0032] Sobel edge map The edge feature map is output after passing through the third convolutional layer and the LeakyReLU activation function layer. ;
[0033]
[0034]
[0035] in, Indicates global average pooling. Indicates global max pooling. This represents a shared multilayer perceptron. This represents the activation function. This represents the channel attention vector of the final output;
[0036] The feature map output by the first LS module Compared with edge feature maps respectively and channel attention vector Perform element-wise multiplication and then add the two multiplication results element-wise to obtain the final output of the first FE unit. :
[0037]
[0038] in, This indicates element-wise multiplication.
[0039] Furthermore, the encoder includes a first gated cross-attention module, a second gated cross-attention module, a first deformable encoding module, a second deformable encoding module, a third deformable encoding module, and a fourth deformable encoding module, and the working process within the encoder is as follows:
[0040] The image modality embedding vector and the embedding vector of each key frame selected from the video modality are grouped into a group, that is, a total of five groups of embedding vectors are obtained. Each group of embedding vectors is passed through the gated cross-attention layer in the first gated cross-attention module to obtain the output of the gated cross-attention layer.
[0041] The calculation process of the gated cross-attention layer within the first gated cross-attention module is as follows:
[0042] A query vector is generated by linearly mapping the embedding vector of the image modality. For each keyframe selected from the video modality, a linear mapping is performed on its embedding vector to generate the key vector and value vector corresponding to the embedding vector of each keyframe. This process is repeated for each keyframe selected from the video modality. The key vectors corresponding to each keyframe are denoted as follows: The video modality selected is the first one. The value vector corresponding to each keyframe is denoted as . ;
[0043] Calculate the selected video modality. The gating weights corresponding to each keyframe:
[0044]
[0045] in, The embedding vector representing the image modality and the selected first video modality Similarity of embedding vectors of keyframes Represents the learnable weight matrix. Indicates the selected video modality. The gating weights corresponding to each keyframe It is the sigmoid function;
[0046]
[0047] in, This indicates the selection of the first image modality and the second video modality. Cross-attention of keyframes, This represents the output of the gated cross-attention layer within the first gated cross-attention module;
[0048] The image modality embedding vector is added to the output of the gated cross-attention layer in the first gated cross-attention module to obtain the image modality embedding vector after the first fusion of video modalities:
[0049]
[0050] in, The original embedding vector representing the image modality. This represents the image modality embedding vector after the first fusion of video modalities;
[0051] The image modality embedding vectors after the first fusion of video modalities are standardized to obtain the standardized result. Then the standardization results After passing through the third FFN layer, the output of the third FFN layer is compared with the normalized result. Perform residual joins, and then standardize the residual join results to obtain... ;
[0052] Will Each keyframe selected from the video modality is combined with its embedding vector to form a group, resulting in a total of five groups of embedding vectors. Each group of embedding vectors is then passed through a gated cross-attention layer within the second gated cross-attention module to obtain the output of that layer. ;
[0053] Will and Perform residual connections to obtain the image modality embedding vector after the second fusion of video modalities. :
[0054]
[0055] The image modality embedding vectors after the second fusion of video modalities are standardized to obtain the standardized result. Then the standardization results After passing through the fourth FFN layer, the output of the fourth FFN layer is compared with the normalized result. Perform residual joins, and then standardize the residual join results to obtain... ;
[0056] right Position encoding is performed, and the position encoding result is used as the input to the first deformable encoding module. Within the first deformable encoding module, the input first passes through a deformable self-attention unit, and then the output of the deformable self-attention unit is compared with... Perform residual joins, and then standardize the residual join results to obtain the standardized results. ;
[0057] Then As input to the fifth FFN layer, the output of the fifth FFN layer is combined with... Perform residual joins, and then standardize the residual join results to obtain the standardized results. , This is the output of the first deformable coding module:
[0058] The output of the first deformable coding module is used as the input of the second deformable coding module, the output of the second deformable coding module is used as the input of the third deformable coding module, the output of the third deformable coding module is used as the input of the fourth deformable coding module, and the output of the fourth deformable coding module is used as the output of the encoder.
[0059] Furthermore, the method for obtaining the image modality embedding vector and the embedding vector of each keyframe selected from the video modality is as follows:
[0060] The features extracted from the image modal through the backbone network are tokenized, and the tokenization results are position-encoded to obtain the image modal embedding vector.
[0061] Each keyframe selected from the video modality has its features extracted by the backbone network and then tokenized. The tokenization results of each keyframe feature are then subjected to positional and temporal encoding to obtain the embedding vector of each keyframe selected from the video modality.
[0062] Furthermore, the working process of the decoder is as follows:
[0063] Step S1: Initialize the target query vector , ;
[0064] Step S2: Initialize the number of decoding modules ;
[0065] Step S3: Transfer the target query vector As the first The input of the decoding module is used to obtain the first decoding module. The target query vector output by each decoding module ;
[0066] Among them, the The working process of each decoding module is as follows:
[0067] Step S31: Transfer the target query vector As input to the self-attention layer, the output of the self-attention layer is compared with the target query vector. Perform residual joins, and then standardize the results of the residual joins.
[0068] Step S32: Perform position encoding on the standardization results from step S31;
[0069] Step S33: Use the position encoding result from step S32 and the encoder output after position encoding as inputs to the deformable cross-attention unit. Then, perform residual concatenation between the output of the deformable cross-attention unit and the normalization result from step S31. Finally, perform normalization on the residual concatenation result to obtain the normalized result. ;
[0070] Step S4: The standardization results from step S33 are processed... After passing through the sixth FFN layer, the output of the sixth FFN layer is then compared with the normalization result from step S33. Perform a residual join, and then standardize the results of the residual join to obtain the target query vector. ;
[0071] Step S5: Determine if the condition is met. ;
[0072] If satisfied Then the first The target query vector output by each decoding module As the output of the decoder;
[0073] If not satisfied Then let Return to step S3.
[0074] Furthermore, the working process of the deformable cross-attention unit is as follows:
[0075] Step 1: Combine the encoder outputs after position encoding into a single feature map;
[0076] Step 2: For each feature vector in the position encoding result in step S32, after passing each feature vector through the first linear layer, the reference point position corresponding to each feature vector in the position encoding result in step S32 in the feature map is obtained respectively.
[0077] Step 3: Process the feature vectors according to the reference point position corresponding to each feature vector in the feature map in the position encoding result of step S32;
[0078] For any feature vector in the position encoding result of step S32:
[0079] Step 31: Pass the feature vector through the second linear layer, the third linear layer, and the fourth linear layer respectively, and use the second linear layer, the third linear layer, and the fourth linear layer to predict three sampling offsets of the feature vector; wherein, the three sampling offsets predicted by each linear layer include the offset direction and the offset corresponding to the offset direction;
[0080] Step 32: Pass the integrated feature map from Step 1 through the fifth linear layer, the sixth linear layer, and the seventh linear layer respectively, and use the fifth linear layer, the sixth linear layer, and the seventh linear layer to output the value features respectively;
[0081] Step 33: The reference point corresponding to the feature vector in the feature map is offset in the value feature output by the fifth linear layer according to the three sampling offsets predicted by the second linear layer, so as to obtain the position of the three sampling points after the offset of the reference point. The feature vectors corresponding to the position of the three sampling points after the offset in the value feature output by the fifth linear layer are used to form the first set of feature vectors.
[0082] The reference point corresponding to the feature vector in the feature map is offset in the value feature output by the sixth linear layer according to the three sampling offsets predicted by the third linear layer, so as to obtain the position of the three sampling points after the offset of the reference point. The feature vectors corresponding to the position of the three sampling points after the offset are used to form the second set of feature vectors in the value feature output by the sixth linear layer.
[0083] The reference point corresponding to the feature vector in the feature map is offset in the value feature output by the seventh linear layer according to the three sampling offsets predicted by the fourth linear layer, so as to obtain the position of the three sampling points after the offset of the reference point. The feature vectors corresponding to the position of the three sampling points after the offset are used to form the third set of feature vectors in the value feature output by the seventh linear layer.
[0084] Step 34: Pass the feature vector through the eighth, ninth, and tenth linear layers respectively;
[0085] The output of the eighth linear layer is passed through the first Softmax activation function layer, and the attention weights of the three feature vectors in the first set of feature vectors are output by the first Softmax activation function layer.
[0086] The output of the ninth linear layer is passed through the second Softmax activation function layer, and the attention weights of the three feature vectors in the second set of feature vectors are output by the second Softmax activation function layer.
[0087] The output of the tenth linear mapping layer is passed through the third Softmax activation function layer, and the attention weights of the three feature vectors in the third set of feature vectors are output by the third Softmax activation function layer.
[0088] Step 35: Use the attention weights output by the first Softmax activation function layer to perform a weighted summation on the first set of feature vectors;
[0089] The attention weights output by the second Softmax activation function layer are used to perform a weighted summation of the second set of feature vectors;
[0090] The attention weights output by the third Softmax activation function layer are used to perform a weighted summation of the third set of feature vectors;
[0091] Step 36: Aggregate the three sets of weighted summation results, and then pass the aggregated results through the eleventh linear layer;
[0092] Step 37: Process each feature vector in the position encoding result according to steps 31 to 37, and use the output of the eleventh linear layer as the output of the deformable cross attention unit.
[0093] The beneficial effects of this invention are:
[0094] The corrosion detection framework constructed in this invention achieves focused enhancement of the image modality and lightweight assistance of the video modality in its overall structure, thereby improving the practicality and detection accuracy of multimodal fusion. First, the backbone network introduces an LS module and an EAM edge enhancement module, enabling the model to possess stronger contextual structure awareness and corrosion boundary enhancement capabilities in the early feature extraction stage, significantly enhancing the texture contrast and structural detail expression of the corrosion region. Second, in the multimodal fusion stage, a gated cross-attention mechanism is used to selectively filter out the video modality, retaining only video frame information highly correlated with the image modality. This achieves the suppression and effective enhancement of redundant information in the auxiliary modality, reducing computational burden while enhancing the consistency and discriminative power of multimodal features. Third, this invention combines a deformable attention structure to complete deep semantic fusion, enabling the model to stably identify corrosion features under complex lighting, occlusion, and long-distance shooting conditions. The method of this invention can effectively improve the accuracy and robustness of corrosion detection without significantly increasing computational costs, and is suitable for corrosion detection tasks in complex high-altitude scenarios such as communication towers. Attached Figure Description
[0095] Figure 1 It is an erosion-free image;
[0096] Figure 2 It is a lightly corroded image;
[0097] Figure 3 It is a moderately corroded image;
[0098] Figure 4 It is a severely corroded image;
[0099] Figure 5 It is the 26th frame of the video segment;
[0100] Figure 6 It is the 76th frame of the video segment;
[0101] Figure 7 It is the 126th frame of the video segment;
[0102] Figure 8 This is a flowchart of the data preprocessing process;
[0103] Figure 9 This is the overall network architecture diagram of the present invention;
[0104] Figure 10 This is a schematic diagram of the LS convolutional module;
[0105] Figure 11 This is a schematic diagram of the FE module;
[0106] Figure 12 This is a schematic diagram of a deformable cross-attention module;
[0107] Figure 13 This is a diagram of the encoder-decoder network architecture. Detailed Implementation
[0108] Specific Implementation Method 1: This implementation method for multimodal collaborative corrosion detection of communication towers balances the accuracy and efficiency of multimodal detection, thereby achieving a more accurate and stable assessment of the corrosion status of communication towers. The method specifically includes the following steps:
[0109] Step 1: Acquire image and video data of the area to be detected on the communication tower, that is, for each location to be detected, obtain image and video data of the same location.
[0110] Step 2: Preprocess the video data obtained in Step 1 to extract keyframes related to the image data from the video data;
[0111] Specifically, for video data, considering that the backbone network uses RGB images as input and that a complete video sequence contains a large amount of redundant information, the complete video sequence cannot be directly input into the backbone network. Instead, keyframes with high modal correlation to the image need to be selected from the video as auxiliary input. Therefore, as... Figure 8 As shown:
[0112] First, the video data is divided into frames. Then, each frame obtained from the frame division is extracted at equal intervals. The extracted frames are used to form a keyframe sequence with temporal continuity.
[0113] The present invention extracts frames at equal intervals of 24 frames, and the extracted key frames are the 1st, 26th, 51st, 76th and so on of the video sequence.
[0114] In order to select the keyframes that are most relevant to the image modality in terms of structure and appearance, the feature similarity between the image obtained in step one and each keyframe is calculated (the features are extracted by a CNN model or a ResNet model), and the keyframe corresponding to the largest feature similarity is determined.
[0115] Using the identified keyframe as the center, select two nearest neighbor keyframes forward and backward in the keyframe sequence, resulting in a total of five keyframes including the center (it should be noted that if there are fewer than two keyframes before or after the center, keyframes in the opposite direction are used to make up the five frames).
[0116] Next, the images and keyframes obtained in step one are processed, that is, the images and keyframes obtained in step one are adjusted to a uniform size (the uniform size in this invention is 512×512×3), and then the images and keyframes after size adjustment are normalized respectively.
[0117] This processing method not only ensures a high correlation between image and video modalities, but also effectively reduces redundant data.
[0118] Step 3: Use the normalized image and normalized keyframes from Step 2 as input to the corrosion detection model;
[0119] The corrosion detection model includes a backbone network, an encoder, a decoder, and a detection head;
[0120] To achieve corrosion detection of communication towers, the corrosion detection model needs to fully extract key features of the corroded area and effectively fuse information from different sources. The corrosion detection model of this invention can complete feature extraction and detection prediction of the corroded area within an end-to-end framework. The overall network structure is as follows: Figure 9 As shown in the attached diagram, the various parts of the corrosion detection model will be explained below:
[0121] I. Backbone Network: The backbone network is primarily responsible for feature extraction from the original input data, providing effective feature representations for subsequent feature enhancement and multimodal fusion, thus achieving efficient feature expression while ensuring computational efficiency. To enhance the feature representation capability of the overall structure of the communication tower and the contextual information of the corrosion area, this invention improves the backbone network. The improved backbone network is beneficial for capturing the structural relationships between different components of the tower, providing a more suitable representation for subsequent feature fusion. Furthermore, considering that corrosion areas typically have blurred boundaries and irregular shapes, this invention introduces an Edge-Augmented Module (EAM module) in the first two stages of the backbone network. The EAM module can enhance the edge contrast information in the feature map, improving the ability to identify the boundaries of the corrosion area.
[0122] The working process of the backbone network is as follows:
[0123] Within the backbone network, the input image is sequentially passed through the Stem module (a module consisting of three convolutional layers connected in sequence), the first LS module, the first EAM module, the first downsampling module, the second LS module, the second EAM module, the second downsampling module, the third LS module, the third downsampling module, and the MSA module (Multi-Head Self-Attention). The output of the MSA module is used as the output of the backbone network.
[0124] The backbone network employs a unified feature extraction process for data of different modalities, namely key frames in images and videos.
[0125] (1) The working process of each LS module is the same. The following explanation takes the first LS module as an example. The working process of the first LS module is as follows:
[0126] Within the first LS module, the input features first pass through a first depthwise separable convolutional layer with a kernel size of 3×3.
[0127] The output of the first depthwise separable convolutional layer is then residually concatenated with the input features of the first LS module to obtain residual concatenation result a. The residual concatenation result a is then used as the input of the first SE block (Squeeze-and-Excitation block).
[0128] The output of the first SE block is used as the input of the first FFN layer. Then, the output of the first FFN layer is residually connected with the output of the first SE block to obtain the residual connection result b. The residual connection result b is used as the input of the first LS convolutional layer.
[0129] The output of the first LS convolutional layer is residually concatenated with the residual concatenation result b to obtain the residual concatenation result c.
[0130] The residual connection result c is used as the input of the second FFN layer, and the output of the second FFN layer is then residually connected with the residual connection result c to obtain the residual connection result d.
[0131] The residual connection result d is used as the output of the first LS module.
[0132] In lightweight convolutional neural networks, traditional convolutions typically rely on fixed-size kernels to perform feature mixing, and their computational form can be expressed as:
[0133]
[0134] in, This indicates that the input feature map is at the location Features Indicated by Centered Neighborhood, Represents the aggregate weight matrix. This represents the convolution operation. Indicates the output feature map at position The above formula can be understood as utilizing the characteristics of the formula; The size of the convolution kernel, for... The convolution operation is performed on the neighborhood centered on the input. Due to the limited size of the convolution kernel, lightweight models struggle to fully utilize a wide range of contextual information; furthermore, the fixed convolution weights result in a lack of ability to dynamically adjust feature aggregation based on the input content.
[0135] Furthermore, since feature extraction capability directly affects the effectiveness of subsequent feature enhancement and multimodal fusion, this invention introduces LS Convolution (LS Convolution) into the backbone network to address the issues of limited receptive field and insufficient feature mixing capability. The overall process of LS Convolution is as follows: Figure 10 As shown. By decoupling the feature extraction process into two stages, Large-Kernel-Perception and Small-Kernel-Aggregation, multi-scale context modeling is achieved without significantly increasing computational overhead. The overall process is represented as follows:
[0136]
[0137] in, This represents the aggregated weight matrix for large kernel perception. The small kernel aggregation stage represents the local window of the convolution operation; the large kernel perception stage operates on a larger context region to acquire wide-area spatial semantic information; the small kernel aggregation stage operates on a local region to perform fine modeling of key details.
[0138] Specifically, in the large kernel perception module, the feature map is first subjected to pointwise convolution to compress the number of channels, so as to reduce the amount of computation while ensuring the feature representation ability. Then, a large kernel depthwise separable convolution is used to extract features, expand the receptive field and enhance the ability to capture long-distance spatial dependencies. Finally, pointwise convolution is used to map to an appropriate number of channels. The whole process can be described as follows:
[0139]
[0140] in, For pointwise convolution, For depthwise separable convolution, For the receptive field size of large-kernel perception, in this invention, Set to 7. Represented as The generated weights; the big kernel perception generates a... The feature map is reshaped into... The dynamic weights are provided to the kernel aggregation module for dynamic convolution. Number of groups The kernel size of the small kernel aggregation module is, in this invention, Set it to 3.
[0141] In the small kernel aggregation module, the channels of the input feature map are divided into... Grouping can achieve efficient feature aggregation while reducing computational complexity and storage overhead. Each group has a corresponding dynamic convolutional kernel generated by the large kernel perception module. The small kernel aggregation module uses the dynamic convolutional kernel to perform grouped dynamic convolution and adaptive feature aggregation on local regions. Its calculation form is as follows:
[0142]
[0143] in, express The The channel and the The channel belongs to the first One channel group, Indicated by Centered The size of the neighborhood, This indicates the reshaped weights. This represents the aggregated feature representation.
[0144] By combining broad context awareness with local fine-grained aggregation, LS convolution effectively compensates for the shortcomings of traditional lightweight convolution in terms of receptive field range and feature mixing ability, thereby improving the robustness of feature representation and providing a feature foundation for subsequent multimodal feature fusion and object detection.
[0145] (2) To enhance the model's ability to perceive the boundaries of eroded regions, this invention introduces an EAM module in the first two stages of the backbone network. This module extracts edge information from the input image using the Sobel operator to generate a Sobel edge map. The Sobel edge map is then used as an auxiliary input to the FE unit. The FE unit utilizes boundary information to enhance boundary features in both spatial and channel dimensions, strengthening the boundary contrast in the feature map, thereby preserving eroded edge information to the greatest extent possible. The overall process is as follows: Figure 11 As shown.
[0146] The first EAM module and the second EAM module operate in the same way. Taking the first EAM module as an example, the operation process of the first EAM module is as follows:
[0147] Within the first EAM module, the output of the first LS module and the Sobel edge map are used as the input of the first FE unit. The output of the first FE unit is then used as the input of the first convolutional layer (here, ordinary convolution, i.e., non-depth separable convolution and pointwise convolution). The output of the first convolutional layer is then residually connected with the output of the first LS module to obtain the residual connection result e.
[0148] The residual connection result e and the Sobel edge map are used as inputs to the second FE unit, and the output of the second FE unit is used as input to the second convolutional layer (still a normal convolution). The output of the second convolutional layer is then residually connected with the output of the first LS module to obtain the residual connection result f.
[0149] The residual connection result f is used as the output of the first EAM module.
[0150] First, it should be noted that the Sobel edge map is generated as follows:
[0151] The Sobel horizontal and Sobel vertical operators are used to process the current input image of the erosion detection network to obtain a single-channel Sobel edge map. :
[0152]
[0153] in, This represents the Sobel level operator. This represents the Sobel vertical operator. This represents the image currently input to the corrosion detection network. This represents the convolution operation.
[0154] Combined Figure 11 Explain the working process of the first FE unit:
[0155] Sobel edge map After passing through the third convolutional layer (also a regular convolutional layer) and the LeakyReLU activation function layer in sequence, the... Perform feature mapping and size adjustment to make The spatial size and number of channels remain consistent with the current main feature map, and the edge feature map is output through the LeakyReLU activation function layer. ;
[0156]
[0157] Edge feature map This can effectively prevent the optimization difficulties caused by directly integrating the original edge map.
[0158] Building upon this, and considering the varying importance of different channels in representing the boundaries of eroded regions, this invention introduces a channel attention mechanism to model edge features. The channel attention mechanism first aggregates the overall information of each channel, evaluates the importance of each channel and assigns weights, and then adjusts the feature response intensity based on these weights to highlight key boundary information. The entire process can be described as follows:
[0159]
[0160] in, Indicates global average pooling. Indicates global max pooling. This represents a shared multilayer perceptron. This represents the activation function. This represents the final output channel attention vector. The calculated channel attention weights can highlight key information channels.
[0161] Finally, the edge feature map Feature map output by the first LS module Fusion enhancement is performed, which involves using the feature map output by the first LS module. Compared with edge feature maps respectively and channel attention vector Perform element-wise multiplication and then add the two multiplication results element-wise to obtain the final output of the first FE unit. :
[0162]
[0163] in, This indicates element-wise multiplication;
[0164] Through the above design, the EAM module achieves joint enhancement of the feature map in terms of spatial location and channel dimension without changing the feature map size and number of channels. While maintaining the integrity of the original semantic information, it effectively improves the feature expression capability at the boundary of the eroded region.
[0165] After feature extraction of the backbone network, feature map representations of each keyframe in both the image and video modalities are obtained. To facilitate subsequent feature modeling, the feature maps are unfolded in spatial dimensions and transformed into a sequence of embedding vectors of uniform dimensions through linear mapping.
[0166] Specifically, the features extracted from the image modal through the backbone network are tokenized, and the tokenization results are position-encoded to obtain the image modal embedding vector.
[0167] Each keyframe selected from the video modality has its features extracted by the backbone network tokenized. The tokenization results of each keyframe feature are then subjected to positional and temporal encoding to obtain the embedding vector of each keyframe selected from the video modality.
[0168] II. Encoder: The encoder inputs the embedded vector sequence for multimodal feature fusion. The image embedding vector retains its corresponding spatial structure information, while the video embedding vector further incorporates temporal identifiers to represent the temporal relationship between different video frames. The first two layers of the encoder introduce a gated cross-attention mechanism, prioritizing image features while selectively introducing video features to suppress the interference of redundant temporal information on the corrosion detection task. Furthermore, considering that video modalities typically contain a large amount of background information and redundant temporal information unrelated to the corrosion target, a gating mechanism is introduced based on the cross-modal attention output to adaptively adjust the weights of keyframe features in the video modality. The gating weights are determined by the correlation between the image embedding vector and the keyframe embedding vectors in the video modality: keyframe features in video modalities with low correlation are assigned smaller weights, thus weakening their impact on image modal feature updates; conversely, keyframe features in video modalities with high correlation receive greater weights for updates. The gated cross-modal features are combined with the image modal embedding vectors through residual connections, effectively supplementing the image modal embedding vectors with video information without altering the main features of the image modality.
[0169] Specifically, the encoder includes a first gated cross-attention module, a second gated cross-attention module, a first deformable encoding module, a second deformable encoding module, a third deformable encoding module, and a fourth deformable encoding module, and the working process within the encoder is as follows:
[0170] The image modality embedding vector and the embedding vector of each keyframe selected from the video modality are grouped together to obtain a total of five groups of embedding vectors. Each group of embedding vectors is then passed through the gated cross-attention layer in the first gated cross-attention module to obtain the output of the gated cross-attention layer. The calculation process of the gated cross-attention layer in the first gated cross-attention module is as follows:
[0171] A query vector is generated by linearly mapping the embedding vector of the image modality. For each keyframe selected from the video modality, a linear mapping is performed on its embedding vector to generate the key vector and value vector corresponding to the embedding vector of each keyframe. This process is repeated for each keyframe selected from the video modality. The key vectors corresponding to each keyframe are denoted as follows: The video modality selected is the first one. The value vector corresponding to each keyframe is denoted as . ;
[0172] Calculate the selected video modality. The gating weights corresponding to each keyframe:
[0173]
[0174] in, The embedding vector representing the image modality and the selected first video modality The embedding vector similarity of the nth keyframe (i.e., first calculating the embedding vector of the image modality and the embedding vector of the selected video modality) The dot product of the embedding vectors of each keyframe is then passed through the Sigmoid function, and the output of the Sigmoid function is used as the similarity calculation result. Represents the learnable weight matrix. Indicates the selected video modality. The gating weights corresponding to each keyframe , used to quantize the selected video modality The contribution weights of each keyframe feature to the current image feature update. It is the sigmoid function;
[0175]
[0176] in, This indicates the selection of the first image modality and the second video modality. Cross-attention of keyframes, This represents the output of the gated cross-attention layer within the first gated cross-attention module;
[0177] The image modality embedding vector is added to the output of the gated cross-attention layer in the first gated cross-attention module to obtain the image modality embedding vector after the first fusion of video modalities:
[0178]
[0179] in, The original embedding vector representing the image modality. This represents the image modality embedding vector after the first fusion of video modalities.
[0180] The image modality embedding vectors after the first fusion of video modalities are standardized to obtain the standardized result. Then the standardization results After passing through the third FFN layer, the output of the third FFN layer is compared with the normalized result. Perform residual joins, and then standardize the residual join results to obtain... That is, the output of the first gating cross-attention module;
[0181] Will ( This is equivalent to combining the image embedding vector (which is the input of the gated cross-attention layer in the first gated cross-attention module) with the embedding vector of each keyframe selected from the video modality, resulting in a total of five sets of embedding vectors. Each set of embedding vectors is then passed through the gated cross-attention layer in the second gated cross-attention module to obtain the output of the second gated cross-attention module. The specific process is similar to the gated cross-attention layer in the first gated cross-attention module;
[0182] Will and Perform residual connections to obtain the image modality embedding vector after the second fusion of video modalities. :
[0183]
[0184] The image modality embedding vectors after the second fusion of video modalities are standardized to obtain the standardized result. Then the standardization results After passing through the fourth FFN layer, the output of the fourth FFN layer is compared with the normalized result. Perform residual joins, and then standardize the residual join results to obtain... ;
[0185] right Position encoding is performed, and the position encoding result is used as the input to the first deformable encoding module. Within the first deformable encoding module, the input first passes through a deformable self-attention unit, and then the output of the deformable self-attention unit is compared with... Perform residual joins, and then standardize the residual join results to obtain the standardized results. ;
[0186] Then As input to the fifth FFN layer, the output of the fifth FFN layer is combined with... Perform residual joins, and then standardize the residual join results to obtain the standardized results. , This is the output of the first deformable coding module:
[0187] The output of the first deformable encoding module is used as the input of the second deformable encoding module, then the output of the second deformable encoding module is used as the input of the third deformable encoding module, the output of the third deformable encoding module is used as the input of the fourth deformable encoding module, and the output of the fourth deformable encoding module is used as the encoder output. It should be noted that the input of the second deformable encoding module is... The output of the first deformable coding module and the input of the third deformable coding module are... The output of the second deformable coding module and the input of the fourth deformable coding module are... And the output of the third deformable coding module, that is, after obtaining the outputs of the deformable self-attention units in the second, third, and fourth deformable coding modules, Residual connections are made with the outputs of the deformable self-attention units in the second, third, and fourth deformable coding modules, respectively.
[0188] After processing by the gated cross-attention module within the encoder, the image modality embedding vector has fully absorbed the discrimination information of keyframes related to the erosion target in the video modality, while effectively suppressing redundant or irrelevant temporal features. In the subsequent deformable encoding process of the encoder, only the updated image modality embedding vector is modeled, and the video modality embedding vector is no longer propagated, thereby reducing computational load and avoiding noise accumulation.
[0189] III. Decoder: In the decoding stage, a pre-defined learnable target query vector is introduced. As input to the decoder. In each decoding layer, the target query vector Modeling is performed using a self-attention mechanism to establish relationships and division of labor among different candidate targets; subsequently, the target query vector... As a query vector, it interacts with the image embedding vector output by the encoder (i.e., the encoder's output) through a deformable cross-attention mechanism, enabling the target query vector to adaptively focus on key feature regions related to the erosion target. After multiple layers of decoding iterations, the final output target query vector... It has a clear target orientation and can characterize the semantic information of specific corrosion targets.
[0190] Specifically, such as Figure 13 As shown, the decoder's operation process is as follows:
[0191] Step S1: Initialize the target query vector , ;
[0192] Step S2: Initialize the number of decoding modules ;
[0193] Step S3: Transfer the target query vector As the first The input of the decoding module is used to obtain the first decoding module. The target query vector output by each decoding module ;
[0194] Among them, the The working process of each decoding module is as follows:
[0195] Step S31: Transfer the target query vector As input to the self-attention layer, the output of the self-attention layer is compared with the target query vector. Perform residual joins, and then standardize the results of the residual joins.
[0196] Step S32: Perform position encoding on the standardization results from step S31;
[0197] Step S33: Use the position encoding result from step S32 and the encoder output after position encoding as inputs to the deformable cross-attention unit. Then, perform residual concatenation between the output of the deformable cross-attention unit and the normalization result from step S31. Finally, perform normalization on the residual concatenation result to obtain the normalized result. ;
[0198] The working process of the deformable cross-attention unit is as follows:
[0199] For any feature vector in the positional encoding result of step S32, the reference point location of the feature vector is first predicted, and multiple sets of sampling offsets and attention weights are generated based on this reference point. The feature map is weighted and aggregated only at a limited number of sampling locations within the neighborhood of the reference point, focusing on modeling key spatial locations related to the eroded region. Through a deformable cross-attention mechanism, each embedding vector establishes attention relationships with only a small number of adaptively selected spatial locations. Therefore, while maintaining contextual modeling capabilities, computational complexity can be significantly reduced, enabling the model to stably focus on key structural and texture information related to the eroded region, such as... Figure 12 As shown, specifically:
[0200] Step 1: Combine the encoder outputs after position encoding into a single feature map;
[0201] Step 2: For each feature vector in the position encoding result in step S32, after passing each feature vector through the first linear layer, the reference point position corresponding to each feature vector in the position encoding result in step S32 in the feature map is obtained respectively.
[0202] Step 3: Process the feature vectors according to the reference point position corresponding to each feature vector in the feature map in the position encoding result of step S32;
[0203] For any feature vector in the position encoding result of step S32:
[0204] Step 31: Pass the feature vector through the second linear layer, the third linear layer, and the fourth linear layer respectively, and use the second linear layer, the third linear layer, and the fourth linear layer to predict three sampling offsets of the feature vector; wherein, the three sampling offsets predicted by each linear layer include the offset direction and the offset corresponding to the offset direction;
[0205] Step 32: Pass the integrated feature map from Step 1 through the fifth linear layer, the sixth linear layer, and the seventh linear layer respectively, and use the fifth linear layer, the sixth linear layer, and the seventh linear layer to output the value features respectively;
[0206] Step 33: The reference point corresponding to the feature vector in the feature map is offset in the value feature output by the fifth linear layer according to the three sampling offsets predicted by the second linear layer, so as to obtain the position of the three sampling points after the offset of the reference point. The feature vectors corresponding to the position of the three sampling points after the offset in the value feature output by the fifth linear layer are used to form the first set of feature vectors.
[0207] The reference point corresponding to the feature vector in the feature map is offset in the value feature output by the sixth linear layer according to the three sampling offsets predicted by the third linear layer, so as to obtain the position of the three sampling points after the offset of the reference point. The feature vectors corresponding to the position of the three sampling points after the offset are used to form the second set of feature vectors in the value feature output by the sixth linear layer.
[0208] The reference point corresponding to the feature vector in the feature map is offset in the value feature output by the seventh linear layer according to the three sampling offsets predicted by the fourth linear layer, so as to obtain the position of the three sampling points after the offset of the reference point. The feature vectors corresponding to the position of the three sampling points after the offset are used to form the third set of feature vectors in the value feature output by the seventh linear layer.
[0209] Step 34: Pass the feature vector through the eighth, ninth, and tenth linear layers respectively;
[0210] The output of the eighth linear layer is passed through the first Softmax activation function layer, and the attention weights of the three feature vectors in the first set of feature vectors are output by the first Softmax activation function layer.
[0211] The output of the ninth linear layer is passed through the second Softmax activation function layer, and the attention weights of the three feature vectors in the second set of feature vectors are output by the second Softmax activation function layer.
[0212] The output of the tenth linear mapping layer is passed through the third Softmax activation function layer, and the attention weights of the three feature vectors in the third set of feature vectors are output by the third Softmax activation function layer.
[0213] Step 35: Use the attention weights output by the first Softmax activation function layer to perform a weighted summation on the first set of feature vectors;
[0214] The attention weights output by the second Softmax activation function layer are used to perform a weighted summation of the second set of feature vectors;
[0215] The attention weights output by the third Softmax activation function layer are used to perform a weighted summation of the third set of feature vectors;
[0216] Step 36: Aggregate the three sets of weighted summation results, and then pass the aggregated results through the eleventh linear layer;
[0217] Step 37: Process each feature vector in the position encoding result according to steps 31 to 37, and use the output of the eleventh linear layer as the output of the deformable cross attention unit.
[0218] Step S4: The standardization results from step S33 are processed... After passing through the sixth FFN layer, the output of the sixth FFN layer is then compared with the normalization result from step S33. Perform a residual join, and then standardize the results of the residual join to obtain the target query vector. ;
[0219] Step S5: Determine if the condition is met. ;
[0220] If satisfied Then the first The target query vector output by each decoding module As the output of the decoder;
[0221] If not satisfied Then let Return to step S3.
[0222] IV. Detection Head: This head receives the target query vector output by the decoder. The classification branch and regression branch are input separately to complete the classification and location regression of the eroded target. Each target query vector output by the decoder corresponds to a different location in the image to be detected, that is, the detection head outputs the detection results of different locations in the image to be detected.
[0223] Specifically, the classification branch is used to predict the probability that the target query vector output by the decoder belongs to each erosion category, while the regression branch is used to predict the spatial location and extent of the erosion region in the image.
[0224] Experimental Section
[0225] Dataset Construction: To verify the practical effectiveness of the method in real-world engineering scenarios, a multimodal dataset for corrosion detection of communication towers was constructed. All data in this dataset were collected from actual communication towers on-site, including 1174 high-resolution still images and 78 continuous video sequences. During the data collection process, a combination of manual inspection and drone aerial photography was used to comprehensively collect data from different areas of the towers, covering various structural locations and corrosion morphologies. Figure 1 , Figure 2 , Figure 3 and Figure 4 As shown, the corrosion levels of the iron towers in the dataset include no corrosion, light corrosion, moderate corrosion, and severe corrosion, which can ensure the representativeness of the data in engineering application scenarios. Figures 1 to 4 The exhibited corrosion samples of varying degrees demonstrate clear corrosion patches and irregular corrosion boundary features. In addition to image data, the dataset also includes video sequences taken by drones around the tower, recording the structural changes of the same corroded area from different perspectives. Figure 5 , Figure 6 and Figure 7 As shown, video frames at different times can effectively compensate for the information loss caused by limited viewpoint, uneven lighting, or local occlusion in a single image, providing important dynamic information support for subsequent multimodal fusion analysis. The corrosion detection model of this invention was validated using image and video data acquired in this invention, demonstrating the effectiveness of the method.
[0226] The above examples of the present invention are merely illustrative of the computational model and process of the present invention, and are not intended to limit the implementation of the present invention. Those skilled in the art will recognize that other variations or modifications can be made based on the above description. It is impossible to exhaustively list all possible implementations here. Any obvious variations or modifications derived from the technical solutions of the present invention are still within the scope of protection of the present invention.
Claims
1. A multimodal collaborative method for detecting corrosion in communication towers, characterized in that, The method specifically includes the following steps: Step 1: Acquire image and video data of the area to be detected on the communication tower, respectively; Step 2: Preprocess the video data obtained in Step 1 to extract keyframes related to the image data from the video data; Next, the images and keyframes obtained in step one are processed. Specifically, the images and keyframes obtained in step one are adjusted to a uniform size, and then the adjusted images and keyframes are normalized respectively. Step 3: Use the normalized image and normalized keyframes from Step 2 as input to the corrosion detection model; The corrosion detection model includes a backbone network, an encoder, a decoder, and a detection head; Step 4: Output the corrosion detection results of the area to be tested through the detection head of the corrosion detection model.
2. The multimodal collaborative corrosion detection method for communication towers according to claim 1, characterized in that, In step two, the video data obtained in step one is preprocessed to extract keyframes related to the image data from the video data. The specific process is as follows: First, the video data is divided into frames. Then, each frame obtained from the frame division is extracted at equal intervals. The extracted frames are used to form a keyframe sequence with temporal continuity. Calculate the feature similarity between the image obtained in step one and each keyframe, and determine the keyframe corresponding to the largest feature similarity. Using the identified keyframe as the center, select the two nearest neighbor keyframes forward and backward in the keyframe sequence, resulting in a total of five keyframes, including the center.
3. The multimodal collaborative corrosion detection method for communication towers according to claim 2, characterized in that, The working process of the backbone network is as follows: Within the backbone network, the input image is sequentially passed through the Stem module, the first LS module, the first EAM module, the first downsampling module, the second LS module, the second EAM module, the second downsampling module, the third LS module, the third downsampling module, and the MSA module. The output of the MSA module is used as the output of the backbone network.
4. The multimodal collaborative corrosion detection method for communication towers according to claim 3, characterized in that, The working process of the first LS module is as follows: Within the first LS module, the input features first pass through a first depthwise separable convolutional layer with a kernel size of 3×3. The output of the first depthwise separable convolutional layer is then residually connected to the input features of the first LS module to obtain residual connection result a. The residual connection result a is then used as the input of the first SE block. The output of the first SE block is used as the input of the first FFN layer. Then, the output of the first FFN layer is residually connected with the output of the first SE block to obtain the residual connection result b. The residual connection result b is used as the input of the first LS convolutional layer. The output of the first LS convolutional layer is residually concatenated with the residual concatenation result b to obtain the residual concatenation result c. The residual connection result c is used as the input of the second FFN layer, and the output of the second FFN layer is then residually connected with the residual connection result c to obtain the residual connection result d. The residual connection result d is used as the output of the first LS module.
5. The multimodal collaborative corrosion detection method for communication towers according to claim 4, characterized in that, The working process of the first EAM module is as follows: Within the first EAM module, the output of the first LS module and the Sobel edge map are used as the input of the first FE unit. The output of the first FE unit is then used as the input of the first convolutional layer. Finally, the output of the first convolutional layer is residually connected with the output of the first LS module to obtain the residual connection result e. The residual connection result e and the Sobel edge map are used as inputs to the second FE unit, and the output of the second FE unit is used as input to the second convolutional layer. The output of the second convolutional layer is then residually connected with the output of the first LS module to obtain the residual connection result f. The residual connection result f is used as the output of the first EAM module.
6. The multimodal collaborative corrosion detection method for communication towers according to claim 5, characterized in that, The working process of the first FE unit is as follows: Sobel edge map The edge feature map is output after passing through the third convolutional layer and the LeakyReLU activation function layer. ; in, Indicates global average pooling. Indicates global max pooling. This represents a shared multilayer perceptron. This represents the activation function. This represents the channel attention vector of the final output; The feature map output by the first LS module Compared with edge feature maps respectively and channel attention vector Perform element-wise multiplication and then add the two multiplication results element-wise to obtain the final output of the first FE unit. : in, This indicates element-wise multiplication.
7. The multimodal collaborative corrosion detection method for communication towers according to claim 6, characterized in that, The encoder includes a first gated cross-attention module, a second gated cross-attention module, a first deformable encoding module, a second deformable encoding module, a third deformable encoding module, and a fourth deformable encoding module, and the working process within the encoder is as follows: The image modality embedding vector and the embedding vector of each key frame selected from the video modality are grouped into a group, that is, a total of five groups of embedding vectors are obtained. Each group of embedding vectors is passed through the gated cross-attention layer in the first gated cross-attention module to obtain the output of the gated cross-attention layer. The calculation process of the gated cross-attention layer within the first gated cross-attention module is as follows: A query vector is generated by linearly mapping the embedding vector of the image modality. For each keyframe selected from the video modality, a linear mapping is performed on its embedding vector to generate the key vector and value vector corresponding to the embedding vector of each keyframe. This process is repeated for each keyframe selected from the video modality. The key vectors corresponding to each keyframe are denoted as follows: The video modality selected is the first one. The value vector corresponding to each keyframe is denoted as . ; Calculate the selected video modality. The gating weights corresponding to each keyframe: in, The embedding vector representing the image modality and the selected first video modality Similarity of embedding vectors of keyframes Represents the learnable weight matrix. Indicates the selected video modality. The gating weights corresponding to each keyframe It is the sigmoid function; in, This indicates the selection of the first image modality and the second video modality. Cross-attention of keyframes, This represents the output of the gated cross-attention layer within the first gated cross-attention module; The image modality embedding vector is added to the output of the gated cross-attention layer in the first gated cross-attention module to obtain the image modality embedding vector after the first fusion of video modalities: in, The original embedding vector representing the image modality. This represents the image modality embedding vector after the first fusion of video modalities; The image modality embedding vectors after the first fusion of video modalities are standardized to obtain the standardized result. Then the standardization results After passing through the third FFN layer, the output of the third FFN layer is compared with the normalized result. Perform residual joins, and then standardize the residual join results to obtain... ; Will Each keyframe selected from the video modality is combined with its embedding vector to form a group, resulting in a total of five groups of embedding vectors. Each group of embedding vectors is then passed through a gated cross-attention layer within the second gated cross-attention module to obtain the output of that layer. ; Will and Perform residual connections to obtain the image modality embedding vector after the second fusion of video modalities. : The image modality embedding vectors after the second fusion of video modalities are standardized to obtain the standardized result. Then the standardization results After passing through the fourth FFN layer, the output of the fourth FFN layer is compared with the normalized result. Perform residual joins, and then standardize the residual join results to obtain... ; right Position encoding is performed, and the position encoding result is used as the input to the first deformable encoding module. Within the first deformable encoding module, the input first passes through a deformable self-attention unit, and then the output of the deformable self-attention unit is compared with... Perform residual joins, and then standardize the residual join results to obtain the standardized results. ; Then As input to the fifth FFN layer, the output of the fifth FFN layer is combined with... Perform residual joins, and then standardize the residual join results to obtain the standardized results. , This is the output of the first deformable coding module: The output of the first deformable coding module is used as the input of the second deformable coding module, the output of the second deformable coding module is used as the input of the third deformable coding module, the output of the third deformable coding module is used as the input of the fourth deformable coding module, and the output of the fourth deformable coding module is used as the output of the encoder.
8. The multimodal collaborative corrosion detection method for communication towers according to claim 7, characterized in that, The image modality embedding vector and the embedding vector of each keyframe selected from the video modality are obtained as follows: The features extracted from the image modal through the backbone network are tokenized, and the tokenization results are position-encoded to obtain the image modal embedding vector. Each keyframe selected from the video modality has its features extracted by the backbone network and then tokenized. The tokenization results of each keyframe feature are then subjected to positional and temporal encoding to obtain the embedding vector of each keyframe selected from the video modality.
9. The multimodal collaborative corrosion detection method for communication towers according to claim 8, characterized in that, The decoder operates as follows: Step S1: Initialize the target query vector , ; Step S2: Initialize the number of decoding modules ; Step S3: Transfer the target query vector As the first The input of the decoding module is used to obtain the first decoding module. The target query vector output by each decoding module ; Among them, the The working process of each decoding module is as follows: Step S31: Transfer the target query vector As input to the self-attention layer, the output of the self-attention layer is compared with the target query vector. Perform residual joins, and then standardize the results of the residual joins. Step S32: Perform position encoding on the standardization results from step S31; Step S33: Use the position encoding result from step S32 and the encoder output after position encoding as inputs to the deformable cross-attention unit. Then, perform residual concatenation between the output of the deformable cross-attention unit and the normalization result from step S31. Finally, perform normalization on the residual concatenation result to obtain the normalized result. ; Step S4: The standardization results from step S33 are processed... After passing through the sixth FFN layer, the output of the sixth FFN layer is then compared with the normalization result from step S33. Perform a residual join, and then standardize the results of the residual join to obtain the target query vector. ; Step S5: Determine if the condition is met. ; If satisfied Then the first The target query vector output by each decoding module As the output of the decoder; If not satisfied Then let Return to step S3.
10. A multimodal collaborative method for detecting corrosion of communication towers according to claim 9, characterized in that, The working process of the deformable cross-attention unit is as follows: Step 1: Combine the encoder outputs after position encoding into a single feature map; Step 2: For each feature vector in the position encoding result in step S32, after passing each feature vector through the first linear layer, the reference point position corresponding to each feature vector in the position encoding result in step S32 in the feature map is obtained respectively. Step 3: Process the feature vectors according to the reference point position corresponding to each feature vector in the feature map in the position encoding result of step S32; For any feature vector in the position encoding result of step S32: Step 31: Pass the feature vector through the second linear layer, the third linear layer, and the fourth linear layer respectively, and use the second linear layer, the third linear layer, and the fourth linear layer to predict three sampling offsets of the feature vector; wherein, the three sampling offsets predicted by each linear layer include the offset direction and the offset corresponding to the offset direction; Step 32: Pass the integrated feature map from Step 1 through the fifth linear layer, the sixth linear layer, and the seventh linear layer respectively, and use the fifth linear layer, the sixth linear layer, and the seventh linear layer to output the value features respectively; Step 33: The reference point corresponding to the feature vector in the feature map is offset in the value feature output by the fifth linear layer according to the three sampling offsets predicted by the second linear layer, so as to obtain the position of the three sampling points after the offset of the reference point. The feature vectors corresponding to the position of the three sampling points after the offset in the value feature output by the fifth linear layer are used to form the first set of feature vectors. The reference point corresponding to the feature vector in the feature map is offset in the value feature output by the sixth linear layer according to the three sampling offsets predicted by the third linear layer, so as to obtain the position of the three sampling points after the offset of the reference point. The feature vectors corresponding to the position of the three sampling points after the offset are used to form the second set of feature vectors in the value feature output by the sixth linear layer. The reference point corresponding to the feature vector in the feature map is offset in the value feature output by the seventh linear layer according to the three sampling offsets predicted by the fourth linear layer, so as to obtain the position of the three sampling points after the offset of the reference point. The feature vectors corresponding to the position of the three sampling points after the offset are used to form the third set of feature vectors in the value feature output by the seventh linear layer. Step 34: Pass the feature vector through the eighth, ninth, and tenth linear layers respectively; The output of the eighth linear layer is passed through the first Softmax activation function layer, and the attention weights of the three feature vectors in the first set of feature vectors are output by the first Softmax activation function layer. The output of the ninth linear layer is passed through the second Softmax activation function layer, and the attention weights of the three feature vectors in the second set of feature vectors are output by the second Softmax activation function layer. The output of the tenth linear mapping layer is passed through the third Softmax activation function layer, and the attention weights of the three feature vectors in the third set of feature vectors are output by the third Softmax activation function layer. Step 35: Use the attention weights output by the first Softmax activation function layer to perform a weighted summation on the first set of feature vectors; The attention weights output by the second Softmax activation function layer are used to perform a weighted summation of the second set of feature vectors; The attention weights output by the third Softmax activation function layer are used to perform a weighted summation of the third set of feature vectors; Step 36: Aggregate the three sets of weighted summation results, and then pass the aggregated results through the eleventh linear layer; Step 37: Process each feature vector in the position encoding result according to steps 31 to 37, and use the output of the eleventh linear layer as the output of the deformable cross attention unit.