Transformer-based dual-branch photovoltaic panel fault identification method

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By constructing a GAN-VAE collaborative generation model and a dual-branch feature fusion module, the problems of multi-scale feature extraction and class imbalance in photovoltaic panel fault identification are solved, realizing high-precision identification and real-time detection of photovoltaic panel faults, and supporting intelligent operation and maintenance of photovoltaic power plants.

CN121640174BActive Publication Date: 2026-06-30HOHAI UNIV +1

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: HOHAI UNIV
Filing Date: 2025-12-10
Publication Date: 2026-06-30

Application Information

Patent Timeline

10 Dec 2025

Application

30 Jun 2026

Publication

CN121640174B

IPC: G06V10/764; G06V10/82; G06V10/774; G06V10/44; G06V10/424; G06V10/52; G06V10/54; G06V10/77; G06V10/80; G06N3/045; G06N3/0455; G06N3/0464; G06N3/0475; G06N3/048; G06N3/094

AI Tagging

Technology Topics

Algorithm Data pre-processing

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Motion action analysis system and motion action analysis method
CN116600861BComputer graphics (images)Algorithm
Information processing device, information processing method, and program
JP2026100894AMachine learning Knowledge based models Information processing Algorithm
Reducing Noise in Video Frames
US20260170619A1Image enhancement Image analysisNoise (video)Algorithm
Apparatus and method for automatically verifying quality data associated with at least one slide
JP2026101624AImage analysis Biological testing Computer hardware Algorithm
A sea surface weak target detection method based on residual network and hypersphere constraint
CN122260256ADistribution fitting is robustSolve the scarcity problemKernel methods Biological models Frequency spectrumSmall sample

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing photovoltaic panel fault identification methods have shortcomings in multi-scale feature extraction, global-local collaborative modeling, class imbalance handling, and sample quality control, making it difficult to achieve high accuracy, high robustness, and high generalization ability.

Method used

A Transformer-based dual-branch photovoltaic panel fault identification method is adopted. High-quality enhanced samples are generated by constructing a GAN-VAE collaborative generation model. The improved CNN feature extraction module and the Swin Transformer encoder are combined to extract and fuse features, realizing deep interactive modeling of local texture and global semantics, and solving the class imbalance problem.

Benefits of technology

It significantly improves the accuracy of photovoltaic panel fault identification, and can accurately classify various faults such as hot spots, junction box overheating, dust obstruction, and micro-cracks, meeting real-time detection needs and providing efficient and reliable intelligent operation and maintenance support for photovoltaic power plants.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN121640174B_ABST

Patent Text Reader

Abstract

This invention proposes a Transformer-based method for fault identification of dual-branch photovoltaic (PV) panels, comprising: S1: obtaining raw PV panel data and performing data preprocessing to obtain a sample set; S2: constructing a GAN-VAE collaborative generation model to generate high-quality enhanced samples; then summarizing the raw PV panel data and high-quality enhanced samples from S1, and dividing them into training and test sets; S3: constructing a dual-branch PV panel fault identification model, training the model based on the training set to obtain a trained dual-branch PV panel fault identification model, and testing the model on the test set; S4: using the trained model for fault identification. This invention achieves accurate classification of various faults such as hotspots, junction box overheating, dust obstruction, and micro-cracks in PV panel infrared images through a bidirectional cross-attention mechanism, a GAN-VAE collaborative generation model, and element-level interaction plus attention weighting.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the technical field of photovoltaic panel fault identification methods, specifically relating to a Transformer-based method for identifying faults in dual-branch photovoltaic panels. Background Technology

[0002] Photovoltaic panel fault identification is a key technology to ensure the safe and efficient operation of photovoltaic power plants and improve power generation revenue. It is of great significance in new energy monitoring, operation and maintenance decision-making and defect early warning.

[0003] Fault identification methods based on convolutional neural networks (CNNs) are widely used in infrared image analysis to automatically locate anomalies such as hotspots and cracks through multi-layer feature extraction. However, the inherent limitation of the local receptive field of CNNs makes it difficult to model array-level global dependencies, resulting in insufficient representation capabilities for faults and long-range anomaly patterns.

[0004] To overcome the limitations of CNNs, some studies have introduced the Transformer architecture, leveraging its self-attention mechanism to capture global contextual information, significantly improving the ability to understand complex fault modes. While Transformers excel in sequence modeling, their ability to perceive local texture details (such as cracks or junction box faults) is weak, and their computational complexity increases quadratically with input size, facing challenges such as large parameter counts and training difficulties when processing high-resolution infrared images. Furthermore, existing dual-branch (CNN+Transformer) fusion methods often employ simple splicing or additive fusion, lacking deep interactive modeling of local and global features, making it difficult to fully utilize the complementary advantages of both approaches.

[0005] More importantly, traditional data augmentation (such as rotation and flipping) cannot generate semantically novel fault variants, while sample augmentation methods based on generative adversarial networks (GANs) can increase diversity, but the generated images often have blurriness, artifacts or structural distortions, making them difficult to use directly for high-precision recognition tasks.

[0006] In existing technologies, solutions to class imbalance mostly focus on loss function weighting or oversampling strategies. However, the former easily leads to a decline in majority class performance, while the latter is limited by the number of real samples and cannot fundamentally solve the problem of small-sample learning. In recent years, although some studies have attempted to combine variational autoencoders (VAEs) and GANs for sample generation, they mostly adopt joint training architectures, which suffer from problems such as gradient conflicts and pattern collapse, resulting in unstable sample quality that fails to meet requirements. In summary, existing photovoltaic panel fault identification methods still have significant shortcomings in multi-scale feature extraction, global-local collaborative modeling, class imbalance handling, and sample quality control, making it difficult to simultaneously achieve high accuracy, high robustness, and high generalization ability. Therefore, this paper proposes a Transformer-based bi-branch photovoltaic panel fault identification method. Summary of the Invention

[0007] To address the problems existing in the prior art, this invention proposes a fault identification method for dual-branch photovoltaic panels based on Transformer.

[0008] To achieve the above objectives, the present invention proposes the following technical content:

[0009] A Transformer-based fault identification method for dual-branch photovoltaic panels includes the following steps:

[0010] S1: Obtain raw photovoltaic panel data and perform data preprocessing to obtain a sample set after preprocessing;

[0011] S2: Based on the sample set, construct a GAN-VAE collaborative generation model to generate high-quality enhanced samples; then combine the original photovoltaic panel data from S1 with the high-quality enhanced samples, and divide them into training and test sets; specifically, this includes the following steps:

[0012] S2.1: Independently trained Generative Adversarial Network (GAN);

[0013] S2.2: Independently trained variational autoencoder (VAE);

[0014] S2.3: Using a pre-trained VAE model, the rough samples generated by GAN are refined in texture, sharpened in edge, and enhanced in hotspot contours, outputting high-quality enhanced samples with realistic details and faithful structure. X aug ;

[0015] S2.4: Enhance the details of the sample X aug The original photovoltaic panel data was aggregated and divided into training and test sets.

[0016] S3: Construct a fault identification model for dual-branch photovoltaic panels. Train the dual-branch photovoltaic panel fault identification model based on the training set to obtain a trained dual-branch photovoltaic panel fault identification model; specifically including the following steps:

[0017] S3.1: Improved CNN feature extraction module. Input the input feature map into the improved CNN feature extraction module, and output multi-scale enhanced features. X out ;

[0018] S3.2: The Transformer feature extraction module uses a Swing Transformer encoder to extract global contextual features from the input feature map and output a global representation sequence. Y ;

[0019] S3.3: Global representation sequence Y Combining multi-scale enhancement features X out Generate adaptively enhanced feature sequences and element-level interaction features splicing features Then input the deep separable fusion network to obtain a joint representation that is highly coupled with local texture and global semantics. F fuse ;

[0020] S3.4: Repeat S3.1-S3.3 multiple times, and in each iteration, combine the joint representations from the previous layer. F fuse The input is fed into S3.2 to replace the input feature map, and feature extraction is performed, outputting a new global representation sequence Y; the previous layer... X out The input is fed into S3.1 for feature extraction to obtain new multi-scale enhanced features. X out ; and through S3.3, a new joint characterization is obtained again. F fuse After multiple iterations, the final joint representation is output. F fuse ;

[0021] S3.5: The final result obtained through iteration F fuse Input the information into the classification header to identify the fault type;

[0022] S3.6: Combine the training set and use the loss function to iteratively optimize the classification head;

[0023] S3.7: Test the fault identification model for dual-branch photovoltaic panels using the test set;

[0024] S4: The trained dual-branch photovoltaic panel fault identification model can accurately classify multiple faults such as hot spots, junction box overheating, dust obstruction, and micro-cracks from the infrared images of the photovoltaic panel.

[0025] Furthermore, in step S2.1,

[0026] Establish the loss functions for the generator and discriminator:

[0027]

[0028] In the formula, L D Discriminator D The loss function; D ( x ) represents the discriminator D The probability value for the real sample; Indicates when x When traversing the training set samples in S1 Expected value; z ~ N (0,1) represents random noise between 0 and 1; G ( z ) represents a generator G With random noise z The input is the fault sample, and the output is the fault sample. D ( G ( z )) represents the discriminator D The probability value of the faulty sample; express z When it is random noise Expected value; L G Represents generator G The loss function; the generator is trained using the sample set in S1 and frozen after convergence. G Let the frozen generator be denoted as G best .

[0029] Furthermore, in step S2.2,

[0030] Establish the loss function for the variational autoencoder (VAE):

[0031]

[0032] In the formula, L VAE The loss function of a variational autoencoder (VAE) is represented by . Z a Represents the latent vector;

[0033] This represents the Gaussian posterior distribution of the output of the variational autoencoder (VAE). This represents the generation probability of the VAE's decoder; This represents the expected probability of reconstructing the original sample, i.e., the reconstruction term; express KL Annealing coefficient; express KL Divergence, i.e. KL Regular terms; where,

[0034]

[0035] In the formula, μ This represents the mean of the Gaussian posterior distribution; σ 2 This represents the variance of the Gaussian posterior distribution; express μ arrive σ 2 Distribution; This means summing the dimensions of all latent variables in the variational autoencoder (VAE). Let S represent the latent variables in the variational autoencoder (VAE). After training and convergence using the sample set in S1, a frozen encoder and a frozen decoder are obtained. The frozen encoder is denoted as S1. Enc best The frozen decoder is denoted as Dec best In the frozen generator G best 、 The frozen encoder is denoted as Enc best And the frozen decoder is denoted as Dec best Then, the trained GAN-VAE collaborative generation model is obtained.

[0036] Furthermore, step S2.3 includes the following steps:

[0037] S2.3.1: Sample random noise using a trained Generative Adversarial Network (GAN). z 1~ N (0,1);

[0038] S2.3.2: Using a frozen generator G best Output initial synthetic sample ;

[0039] S2.3.3: Input the initial synthesized sample into the frozen encoder. Enc best Extracting latent representations from, i.e.

[0040] ;

[0041] S2.3.4: By freezing the decoder Dec best Reconstruction yields a detailed enhanced sample, i.e.

[0042] ,in, In the formula, X aug Represents samples with enhanced details; This indicates element-wise multiplication.

[0043] Furthermore, step S3.1 includes the following steps:

[0044] S3.1.1: Input feature map of the training set Divided into four sub-features along the channel dimension , g ∈[1,4] and is an integer. B, C, H, W These represent batch size, number of channels, height, and width, respectively.

[0045] S3.1.2: For each sub-feature X g By applying depthwise separable convolution operations with different receptive fields, multi-scale local features can be generated. B g The formula is:

[0046]

[0047] In the formula, Conv 1×1 This represents a 1×1 standard convolution; DWConv k×k This indicates a kernel size of k×k and an expansion rate of d Depth-separable convolution; B g Indicates the first g Multi-scale local features extracted from each branch;

[0048] S3.1.3: Multi-scale local features extracted for each branch B g Independent channel attention mechanisms are applied to generate channel-weighted features, as shown in the formula:

[0049]

[0050] In the formula, AvgPool and MaxPool These are global average pooling and global max pooling, respectively. MLP For a shared two-layer 1×1 convolutional network; σ sThis represents the Sigmoid activation function; Indicates the first g Channel attention map of the branch; This indicates the features after channel attention enhancement, used to highlight key channel responses related to photovoltaic panel failures;

[0051] S3.1.4: Concatenate the four sets of channel-weighted features to restore the complete channel dimensions and obtain the fused features. F ;

[0052]

[0053] In the formula, Concat This indicates the concatenation of channel weighted features;

[0054] S3.1.5: Fusion Features F Applying spatial attention mechanisms to generate multi-scale enhanced features X out The formula is:

[0055]

[0056] In the formula, AvgPool channel Indicates channel average pooling; MaxPool channel Indicates channel max pooling; M Represents the aggregated feature map; A s This represents a spatial attention map, used to focus the spatial location of fault areas in infrared images of photovoltaic panels. X out This represents the multi-scale augmentation features of the final output; Conv 7×7 [ ] represents a 7×7 standard convolution; [ ] represents concatenation of channel dimensions.

[0057] Further, step S3.2 includes the following steps:

[0058] S3.2.1: The Patch Embedding module divides the input feature map into several non-overlapping 4×4 image patches and maps each image patch into a high-dimensional embedding vector;

[0059] S3.2.2: Input the embedded vector into the Swing Transformer encoder, and implement global context modeling through multiple layers of Swing TransformerBlocks. l In the Swin Transformer Block, updates are performed using a shift-based self-attention mechanism, with the following formula:

[0060]

[0061] In the formula, LN Representation layer normalization; SW-MSA This indicates processing based on a shift-window self-attention mechanism; MLP Processed as a two-layer fully connected network; Indicates the first l The input feature sequence of layer -1; Indicates the first l The feature sequence after the layer input feature sequence has undergone attention interaction; Indicates the first l The output feature sequence of the layer;

[0062] The odd-numbered layer Swin Transformer block uses SW-MSA Enhance cross-window information interaction;

[0063] Even-numbered layers of the Swing Transformer block use the standard Window Multi-Head Self-Attention mechanism to keep the window fixed and prevent shifting operations.

[0064] Both odd-numbered and even-numbered layers are calculated using the attention calculation formula, namely:

[0065]

[0066] In the formula, Q, K, V These represent the query, key, and value matrices, respectively. d h Indicates the dimension per head; B Represents the learnable position index, characterizing the relative positional relationship between pixels within the window; Softmax represents the activation function;

[0067] S3.2.3: After passing through several Swing Transformer Blocks, insert a Patch Merging layer; for the first... l Feature sequences of layers By stitching together adjacent 2×2 blocks and reducing dimensionality through linear projection, the spatial resolution of the feature map is halved, the number of channels is doubled, and the feature map is replaced by the first... l Feature sequences of layers The formula is:

[0068]

[0069] In the formula, Indicates the first l The output feature sequence of layer +1 is the result after the Patch Merging operation; Represents a linear transformation operation; ConcatThis indicates a splicing operation along the channel; Indicates the first l In the feature sequence of the layer, four adjacent 2×2 pixel blocks, where, ( 2i,2j ), ( 2i,2j+ 1), ( 2i+ 1 ,2j )and( 2i+ 1 ,2j+ 1) These represent the coordinates of the corresponding pixel positions;

[0070] S3.2.4: The updated feature sequence from the last layer in S3.2.3 is aggregated using global average pooling to generate a global representation sequence. Y .

[0071] Furthermore, step S3.3 includes the following steps:

[0072] S3.3.1: Enhance features at multiple scales X out Projected onto Y Same channel dimension; the formula is:

[0073]

[0074] In the formula, X’ Represents the projected feature sequence; Conv 1×1 This represents a 1×1 standard convolution; Y ' represents the reshaped feature sequence; Reshape Indicates a reshaping operation;

[0075] S3.3.2: Apply global average pooling followed by adaptive enhancement to obtain expressive feature sequences; the formula is:

[0076]

[0077] In the formula, Linear This indicates fully connected layer processing; Expand Indicates broadcast extension processing; express The feature sequence after adaptive enhancement; express The feature sequence after adaptive enhancement;

[0078] S3.3.3: To achieve cross-branch information interaction, a bidirectional windowed cross-attention mechanism is introduced, with a window size M=4 for the adaptively enhanced feature sequence. and All are locally divided and filled to the integer boundary;

[0079]

[0080] In the formula, H p Indicates the height after division; H b Indicates the height before the division; W p Indicates the width after division; W b Indicates the width before the division;

[0081] S3.3.4: Generate query, key, and value features respectively; the formula is:

[0082]

[0083] In the formula, DSC This indicates that depthwise separable convolutional modules are used for efficient extraction of local context; DSC Q , DSC K and DSC V These represent query, key, and value operations performed after processing by the depthwise separable convolutional module. Q Y , K Y and V Y They respectively represent the following: The characteristics of the query, key, and value obtained after processing; Q X , K X and V X They respectively represent the following: The characteristics of the query, key, and value obtained after processing;

[0084] S3.3.5: Perform bidirectional cross-attention; the formula is:

[0085]

[0086] In the formula, Indicates the dimension per head; Softmax Indicates the activation function; express Transformer Attention output from branch to CNN branch; Indicates CNN branch to Transformer Attention output for branches;

[0087] S3.3.6: The attention output recovers the spatial structure and removes fill after projection, using the following formula:

[0088]

[0089] In the formula, This represents the output after the attention interaction between the Transformer branch and the CNN branch; This represents the output after the attention interaction between the CNN branch and the Transformer branch; DSC This represents depthwise separable convolution;

[0090] S3.3.7: Compute spatial element-level interaction features;

[0091]

[0092] S3.3.8: Concatenate the bidirectional attention output with the interaction features, and then reduce the dimensionality using a 1×1 convolution:

[0093]

[0094] S3.3.9: Will Four path features are concatenated and input into a multi-layer deep separable fusion network:

[0095]

[0096] In the formula, FusionNet The sequence consists of the following sequence: IDSC→BN→GELU→DSC→BN→GELU→Conv 1*1 →BN→GELU, IDSC is pointwise convolution followed by depthwise convolution, while DSC is depthwise convolution followed by pointwise convolution; IDSC represents spatial convolution operation with dynamic parameters; BN represents batch normalization; GELU represents activation function.

[0097] The beneficial effects that can be achieved by adopting the above-mentioned technology are:

[0098] 1. This invention captures local texture features and global semantic features of infrared images separately through a dual-branch recognition model, and achieves deep interactive fusion through a bidirectional cross-attention mechanism. It can fully leverage the advantages of CNN's four-branch deep separable convolution combined with channel and spatial attention for accurate perception of fine-grained details such as hot spots and junction box faults, as well as the powerful modeling capability of the SwinTransformer encoder for long-range dependencies such as open circuits and short circuits, significantly improving the recognition accuracy of component anomalies.

[0099] 2. This invention proposes a GAN-VAE collaborative generation model to address dataset imbalance. In the first stage, the GAN-VAE collaborative generation model utilizes independently trained GAN generators to generate semantically novel fault variants. In the second stage, independently trained VAE decoders perform detail reconstruction and texture enhancement, generating samples with realistic details and structural fidelity. After screening, these samples are evenly injected into the preprocessed dataset according to class, constructing a balanced dataset and effectively alleviating the problem of small sample class imbalance. Furthermore, the decoupled generation process has minimal impact on model inference speed, meeting the requirements of real-time detection.

[0100] 3. The dual-branch feature fusion module of this invention utilizes a bidirectional cross-attention mechanism and multi-path context modeling to perform element-level interaction and attention weighting between the multi-scale local texture features output by CNN and the global sequence features output by Transformer. This promotes deep information exchange between the two branches, enabling the model to effectively leverage the semantic diversity and detail fidelity of the enhanced dataset to achieve accurate end-to-end fault classification. The final recognition model, trained on the constructed balanced dataset, accurately classifies various faults such as hotspots, junction box overheating, dust obstruction, and micro-cracks in photovoltaic panel infrared images, significantly outperforming existing single-branch or shallow fusion methods. This provides efficient and reliable technical support for the intelligent operation and maintenance of large-scale photovoltaic power plants. Attached Figure Description

[0101] Figure 1 This is the overall implementation process of a Transformer-based dual-branch photovoltaic panel fault identification method;

[0102] Figure 2 It is the specific structure of the GAN-VAE collaborative generation model;

[0103] Figure 3 These are samples generated by the GAN-VAE collaborative generation model; where (A), (B), and (C) represent samples under short-circuit, dust band, and open-circuit conditions, respectively.

[0104] Figure 4 This is a detailed module diagram of the dual-branch photovoltaic panel fault identification method provided in this embodiment of the invention;

[0105] Figure 5 These are radar charts comparing various models in Dataset 1 across six categories (Vision Transformer, Swin Transformer, ResNet50, CoatNet, and the model itself); where (a), (b), (c), and (d) are radar charts comparing accuracy, recall, F1 score, and false positive rate (FPR), respectively.

[0106] Figure 6 These are the PRC curves for each model;

[0107] Figure 7 These are radar charts comparing the various models in Dataset 2 across six categories. (a), (b), (c), and (d) are radar charts comparing accuracy, recall, F1 score, and false positive rate (FPR), respectively. Detailed Implementation

[0108] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0109] like Figure 1 As shown, the fault identification method for dual-branch photovoltaic panels based on Transformer includes the following steps:

[0110] S1: Obtain photovoltaic panel data and perform data preprocessing. This includes the following steps:

[0111] S1.1: Acquire photovoltaic panel data and trim it, then categorize it according to fault type;

[0112] S1.2: In each data category, the cropped data is color-mapped, transforming a single-band grayscale image into a three-channel heatmap; the formula is:

[0113]

[0114] In the formula, t=I norm (i,j) Represented as pixels in the original data before mapping ( i,j The normalized gray value at () I RGB ( i,j This is represented as a three-channel heatmap; f R ( t () is represented as a red channel heatmap; f G ( t This is represented as a heatmap of the green channel; f B ( t () is represented as a blue channel heatmap.

[0115] S1.3: First, resample each type of data to a preset size (i.e., 112×112 pixels), and then aggregate them to form a sample set.

[0116] S2: Construct a GAN-VAE collaborative generation model, employing a two-stage, decoupled collaborative mechanism to generate high-quality new samples. Specifically, this includes the following steps:

[0117] S2.1: Independently trained Generative Adversarial Network (GAN);

[0118] Instead of traditional cross-entropy, the least squares loss MSE Loss is used to train the generator on the sample set in S1. G and discriminator D Build a generator G and discriminator D The loss function is used to improve the clarity of the samples and the stability of training; the loss functions of both are:

[0119]

[0120] In the formula, L D Discriminator D The loss function is used to measure the discriminator. D The ability to distinguish between "real samples" and "generated samples"; x ~ Dreal This indicates that the samples used for training come from the sample set in S1; D ( x ) represents the discriminator D The probability value for the real sample; E Represents the mathematical expectation; Indicates when x When traversing the training set samples in S1 Expected value; z ~ N (0,1) represents random noise between 0 and 1; G ( z ) represents a generator G With random noise z The input is the fault sample, and the output is the fault sample. D ( G ( z )) represents the discriminator D The probability value of the faulty sample; express z When it is random noise Expected value; L G Represents generator G The loss function is used to measure the generator's ability to generate "realistic samples to fool the discriminator".

[0121] The generator is trained and frozen after convergence using the sample set in S1. G Let the frozen generator be denoted asG best .

[0122] S2.2: Independently trained variational autoencoder (VAE).

[0123] Specifically, a variational autoencoder (VAE) (including encoder and decoder) is trained on the sample set in S1. The loss function of the VAE consists of a reconstruction term and a regularization term, and is expressed as follows:

[0124]

[0125] In the formula, L VAE The loss function of a variational autoencoder (VAE) is represented by . Z a Indicate latent variables; This represents the Gaussian posterior distribution of the output of the variational autoencoder (VAE). This represents the generation probability of the VAE's decoder; This represents the expected probability of reconstructing the original sample, i.e., the reconstruction term; express KL Annealing coefficient; express KL Divergence, i.e. KL Regular terms; x Represents a sample set;

[0126] in,

[0127] In the formula, μ This represents the mean of the Gaussian posterior distribution; σ 2 This represents the variance of the Gaussian posterior distribution; express μ arrive σ 2 Distribution; This means summing the dimensions of all latent variables in the variational autoencoder (VAE). This represents the latent variables in a variational autoencoder (VAE).

[0128] After training and convergence using the sample set in S1, a frozen encoder and a frozen decoder are obtained. The frozen encoder is denoted as... Enc best The frozen decoder is denoted as Dec best ;

[0129] In the frozen generator G best 、 The frozen encoder is denoted as Enc bestAnd the frozen decoder is denoted as Dec best Then, the trained GAN-VAE collaborative generation model is obtained;

[0130] S2.3: Using a pre-trained VAE model, the rough samples generated by GAN are refined in texture, sharpened in edge, and enhanced in hotspot contours, outputting high-quality enhanced samples with realistic details and faithful structure. X aug Specifically, it includes the following steps:

[0131] S2.3.1: Sample noise using a trained Generative Adversarial Network (GAN). z 1~ N (0,1);

[0132] S2.3.2: Using a frozen generator G best Output initial synthetic sample ;

[0133] S2.3.3: Input the initial synthesized sample into the frozen encoder. Enc best Extracting latent representations from, i.e. ;

[0134] S2.3.4: By freezing the decoder Dec best Reconstruction yields a detailed enhanced sample, i.e. ,in, In the formula, X aug Represents samples with enhanced details; This indicates element-wise multiplication.

[0135] S2.4: Enhance the details of the sample X aug After summarizing the original data, the training set and the test set are divided in an 8:2 ratio.

[0136] S3: Construct a dual-branch photovoltaic panel fault identification model. Based on the training set in S2, train the dual-branch photovoltaic panel fault identification model. The dual-branch photovoltaic panel fault identification model includes a CNN feature extraction module, a Transformer feature extraction module, and a dual-branch feature fusion module. End-to-end fault identification is achieved through feature extraction and fusion. Specifically, the following steps are included:

[0137] S3.1: Improved CNN feature extraction module, outputting multi-scale enhanced features. X out Specifically, it includes the following steps:

[0138] S3.1.1: Input feature map of the training set Divided into four sub-features along the channel dimension , g ∈[1,4] and is an integer. B, C, H, W These represent batch size, number of channels, height, and width, respectively.

[0139] S3.1.2: For each sub-feature X g Applying depthwise separable convolution operations with different receptive fields generates multi-scale local features. B g The formula is:

[0140]

[0141] In the formula, Conv 1×1 This represents a 1×1 standard convolution; DWConv k×k This indicates a kernel size of k×k and an expansion rate of d Depth-separable convolution; B g Indicates the first g Multi-scale local features extracted from each branch are used to capture fault textures of different sizes in infrared images of photovoltaic panels.

[0142] S3.1.3: Multi-scale local features extracted for each branch B g Independent channel attention mechanisms are applied to generate channel-weighted features, as shown in the formula:

[0143]

[0144] In the formula, AvgPool and MaxPool These are global average pooling and global max pooling, respectively. MLP For a shared two-layer 1×1 convolutional network (dimensionality reduction ratio of 2); σ s This represents the Sigmoid activation function; Indicates the first g Channel attention map of the branch; This indicates the features after channel attention enhancement, used to highlight key channel responses related to photovoltaic panel failures;

[0145] S3.1.4: Concatenate the four sets of channel-weighted features to restore the complete channel dimensions and obtain the fused features. F ;

[0146]

[0147] In the formula, Concat This indicates the concatenation of channel weighted features;

[0148] S3.1.5: Fusion Features F Applying spatial attention mechanisms to generate multi-scale enhanced features X out The formula is:

[0149]

[0150] In the formula, AvgPool channel Indicates channel average pooling; MaxPool channel Indicates channel max pooling; M Represents the aggregated feature map; A s This represents a spatial attention map, used to focus on the spatial location of fault areas (such as hot spot center, crack path) in the infrared image of a photovoltaic panel. X out This represents the multi-scale augmentation features of the final output; Conv 7×7 This represents a 7×7 standard convolution; [;] represents concatenation of channel dimensions;

[0151] S3.2: The Transformer feature extraction module uses a Swing Transformer encoder to process the input feature map. Perform global context feature extraction. This includes the following steps:

[0152] S3.2.1: Feature map input via the Patch Embedding module The image is divided into several non-overlapping 4×4 blocks, and each image block is mapped to a high-dimensional embedding vector;

[0153] S3.2.2: Input the embedded vector into the Swing Transformer encoder, and implement global context modeling through multi-layer Swing TransformerBlock to extract long-range fault dependencies (such as hot spots, shadows, and cracks) across regions in the photovoltaic array.

[0154] Specifically, in the first l In the Swing Transformer Block, the input feature sequence First, it undergoes LayerNorm normalization, and then is updated using a shift-based self-attention mechanism, with the following formula:

[0155]

[0156] In the formula, LN Representation layer normalization; SW-MSA This indicates processing based on a shift-window self-attention mechanism; MLP It is processed by a two-layer fully connected network (with GELU as the intermediate activation function) to model the semantic association between long-distance faults in photovoltaic panel images (such as the potential causal relationship between hot spots on edge modules and temperature anomalies in the central region). Indicates the first l The input feature sequence of layer -1; Indicates the first l The feature sequence after the layer input feature sequence has undergone attention interaction; Indicates the first l The output feature sequence of the layer;

[0157] The odd-numbered layer Swin Transformer block uses SW-MSA Enhance cross-window information interaction;

[0158] Even-numbered layers of the Swing Transformer block use the standard Window Multi-Head Self-Attention (W-MSA) mechanism to keep the window fixed and prevent any shifting operations.

[0159] Both odd-numbered and even-numbered layers are calculated using the attention calculation formula, namely:

[0160]

[0161] In the formula, Q, K, V These represent the query, key, and value matrices, respectively. d h This represents the dimension per head. B The learnable position index represents the relative positional relationship between pixels within the window, effectively enhancing the ability to detect structured faults in the photovoltaic array (such as the boundaries between modules); Softmax represents the activation function.

[0162] By alternating between W-MSA and SW-MSA, the Swing Transformer achieves a global receptive field while maintaining linear computational complexity, making it suitable for global fault mode mining of infrared images of large-size photovoltaic panels.

[0163] S3.2.3: To adapt to downsampling requirements, a PatchMerging layer is inserted after every few Swing Transformer Blocks;

[0164] For the l Feature sequences of layers By stitching together adjacent 2×2 blocks and reducing dimensionality through linear projection, the spatial resolution of the feature map is halved, the number of channels is doubled, and the feature map is replaced by the first...l Feature sequences of layers The formula is:

[0165]

[0166] In the formula, Indicates the first l The output feature sequence of layer +1 is the result after the Patch Merging operation; Represents a linear transformation operation; Concat This indicates a splicing operation along the channel; Indicates the first l In the feature sequence of the layer, four adjacent 2×2 pixel blocks, where, ( 2i,2j ), ( 2i,2j+ 1), ( 2i+ 1 ,2j )and( 2i+ 1 ,2j+ 1) These represent the coordinates of the corresponding pixel positions;

[0167] S3.2.4: The updated feature sequence from the last layer in S3.2.3 is aggregated using global average pooling to generate a global representation sequence. Y .

[0168] S3.3: Blending process. This includes the following steps:

[0169] S3.3.1: Enhance features at multiple scales X out Projected onto Y Same channel dimension; the formula is:

[0170]

[0171] In the formula, X’ Represents the projected feature sequence; Conv 1×1 This represents a 1×1 standard convolution; Y ' represents the reshaped feature sequence; Reshape Indicates a reshaping operation;

[0172] S3.3.2: To enhance the expressive power of the two sequences, adaptive enhancement is applied after global average pooling to obtain expressive feature sequences; the formula is:

[0173]

[0174] In the formula, Linear This indicates fully connected layer processing; Expand Indicates broadcast extension processing; express The feature sequence after adaptive enhancement; express The feature sequence after adaptive enhancement;

[0175] S3.3.3: To achieve cross-branch information interaction, a bidirectional windowed cross-attention mechanism is introduced, with a window size M=4 for the adaptively enhanced feature sequence. and All are locally divided and filled to the integer boundary;

[0176]

[0177] In the formula, H p Indicates the height after division; H b Indicates the height before the division; W p Indicates the width after division; W b This indicates the width before the division.

[0178] S3.3.4: Generate query, key, and value features respectively; the formula is:

[0179]

[0180] In the formula, DSC This indicates that depthwise separable convolutional modules are used for efficient extraction of local context; DSC Q , DSC K and DSC V These represent query, key, and value operations performed after processing by the depthwise separable convolutional module. Q Y , K Y and V Y They respectively represent the following: The characteristics of the query, key, and value obtained after processing; Q X , K X and V X They respectively represent the following: The characteristics of the query, key, and value obtained after processing.

[0181] S3.3.5: Perform bidirectional cross-attention; the formula is:

[0182]

[0183] In the formula, Indicates the dimension per head; Softmax This represents the activation function. express Transformer Attention output from branch to CNN branch enables long-range dependency guidance for edge details; Indicates CNN branch to Transformer The attention output of the branch enables local textures to feed back into global semantics;

[0184] S3.3.6: The attention output recovers the spatial structure and removes fill after projection, using the following formula:

[0185]

[0186] In the formula, This represents the output after the attention interaction between the Transformer branch and the CNN branch; This represents the output after the attention interaction between the CNN branch and the Transformer branch; DSC This represents depthwise separable convolution;

[0187] S3.3.7: Compute spatial element-level interaction features.

[0188]

[0189] S3.3.8: Concatenate the bidirectional attention output with the interaction features, and then reduce the dimensionality using a 1×1 convolution:

[0190]

[0191] S3.3.9: Will Four path features are concatenated and input into a multi-layer deep separable fusion network:

[0192]

[0193] In the formula, FusionNet The sequence consists of the following sequence: IDSC→BN→GELU→DSC→BN→GELU→Conv 1*1 →BN→GELU, IDSC is pointwise convolution followed by depthwise convolution, while DSC is depthwise convolution followed by pointwise convolution; IDSC represents spatial convolution operation with dynamic parameters; BN represents batch normalization; GELU represents activation function; F fuse This indicates that a depth-separable fusion network provides a joint representation that is highly coupled with local texture and global semantics.

[0194] The model is divided into four stages, with the previous stage... F fuseThe results of the fourth stage, which are used as input for feature extraction in the Transformer, are then fed into the classification head to obtain the probability distribution of the corresponding fault type.

[0195] S3.4: Repeat S3.1-S3.3 multiple times, and in each iteration, combine the joint representations from the previous layer. F fuse The input is fed into S3.2 to replace the input feature map, and feature extraction is performed, outputting a new global representation sequence Y; the previous layer... X out The input is fed into S3.1 for feature extraction to obtain new multi-scale enhanced features. X out ; and through S3.3, a new joint characterization is obtained again. F fuse After multiple iterations, the final joint representation is output. F fuse ;

[0196] S3.5: The final result obtained through iteration F fuse Input the information into the classification header to identify the fault type;

[0197] S3.6: Combine the training set and use the loss function to iteratively optimize the classification head;

[0198] S3.7: Test the fault identification model for dual-branch photovoltaic panels using the test set;

[0199] S4: The trained dual-branch photovoltaic panel fault identification model can accurately classify multiple faults such as hot spots, junction box overheating, dust obstruction, and micro-cracks from the infrared images of the photovoltaic panel.

[0200] Example Analysis:

[0201] The following example demonstrates the simulation verification of this invention:

[0202] The experimental hardware platform was a GeForce RTX3090 (24GB), and the experimental environment was Windows 10 with Python 3.6 and PyTorch 1.12. The research data sources were the PVF-10 dataset (Dataset1) provided by wangbo et al. (2024) and the Infrared Solar Modules dataset (Dataset2) provided by Raptor Maps.

[0203] The method involves identifying six types of photovoltaic faults, and testing the capabilities of each model using datasets 1 and 2. We selected six fault types for experiments, as shown in Table 1.

[0204] Table 1. Dataset Categories and Features

[0205]

[0206] The original infrared images were pseudo-color mapped and resampled to 112×112 pixels according to the aspect ratio of the photovoltaic panels. Minority class samples were selected and sequentially input into a generative adversarial network (LSGAN) and a variational autoencoder (VAE) for two-stage training. After training, the parameters of the generator G and decoder Dec were frozen, as shown in the structure below. Figure 2 As shown in the diagram, a generator produces semantically novel samples, and a decoder enhances the details and textures, achieving high-quality data augmentation. The resulting sample visualization is shown below. Figure 3 As shown. The generated samples and the original data are merged in an 8:2 ratio to form the enhanced training set Daug, which is then input into the recognition model (HCTFNet) for end-to-end training. The network structure is as follows. Figure 4 As shown. To comprehensively evaluate model performance, this invention is compared with mainstream deep learning models (VisionTransformer, Swin Transformer, ResNet50, CoatNet) on the validation set. Metrics include Precision, Recall, F1-score, and FPR (false positive rate). Results for each category are recorded in Tables 2–4, and the overall performance is summarized in Table 5. See the results figures below. Figure 5 To further analyze the robustness of the model under different confidence thresholds, a precision-recall curve (PR curve) was plotted, such as... Figure 6 As shown.

[0207] Table 2 Comparison Results of Dust Belt and Junction Box Models

[0208]

[0209] Table 3 Comparison Results of Hotspot and Short Circuit Models

[0210]

[0211] Table 4 Comparison Results of Open Circuit and Short Circuit Models

[0212]

[0213] Table 5 Overall Comparison Results of Each Model

[0214]

[0215] Figure 6 The PR curves of each model on Dataset1 are shown. The average accuracy (AP) of the method of this invention is significantly higher than that of the comparison model in all categories, especially in a few types of faults such as dust belts and microcracks.

[0216] Dataset 2 was used to verify the model's generalization ability. Dataset 2 was trained following the same training procedure as Dataset 1, and then the models were compared. The comparison results are shown in Table 6. Figure 7 As shown.

[0217] Table 6 Comparison results of models in Dataset 2

[0218]

[0219] As can be clearly seen from the tables and figures, the method proposed in this invention achieves the highest scores across all four metrics in both datasets. Furthermore, the AP value of the method proposed in this invention is also the highest in the precision-recall curves. Other methods, however, consistently perform slightly worse than this model across different categories.

[0220] In summary, the proposed GAN-VAE collaborative generation + dual-branch recognition framework achieves the best performance in terms of accuracy, recall, F1 score, and false positive rate on two independent datasets. In particular, it demonstrates strong robustness and generalization ability in the recognition of minority classes and complex texture faults (such as dust belts and microcracks), fully verifying the engineering application value of the method.

[0221] Based on the above-described preferred embodiments of the present invention, and through the foregoing description, those skilled in the art can make various changes and modifications without departing from the inventive concept. The technical scope of this invention is not limited to the contents of the specification, but must be determined according to the scope of the claims.

Claims

1. A fault identification method for dual-branch photovoltaic panels based on Transformer, characterized in that, Includes the following steps: S1: Obtain raw photovoltaic panel data and perform data preprocessing to obtain a sample set after preprocessing; S2: Based on the sample set, construct a GAN-VAE collaborative generation model to generate high-quality enhanced samples; then combine the original photovoltaic panel data from S1 with the high-quality enhanced samples, and divide them into training and test sets; specifically, this includes the following steps: S2.1: Independently trained Generative Adversarial Network (GAN); S2.2: Independently trained variational autoencoder (VAE); S2.3: Using a pre-trained VAE model, the rough samples generated by GAN are refined in texture, sharpened in edge, and enhanced in hotspot contours, outputting high-quality enhanced samples with realistic details and faithful structure. X aug ; S2.4: Enhance the details of the sample X aug The original photovoltaic panel data was aggregated and divided into training and test sets. S3: Construct a fault identification model for dual-branch photovoltaic panels. Train the dual-branch photovoltaic panel fault identification model based on the training set to obtain a trained dual-branch photovoltaic panel fault identification model; specifically including the following steps: S3.1: Improved CNN feature extraction module. Input the input feature map into the improved CNN feature extraction module, and output multi-scale enhanced features. X out ; S3.2: The Transformer feature extraction module uses a Swing Transformer encoder to extract global contextual features from the input feature map and output a global representation sequence. Y ; S3.3: Global representation sequence Y Combining multi-scale enhancement features X out Generate adaptively enhanced feature sequences and element-level interaction features splicing features Then, by inputting a depthwise separable fusion network, a joint representation with highly coupled local texture and global semantics is obtained. ; S3.4: Repeat S3.1-S3.3 multiple times, and in each iteration, combine the joint representations from the previous layer. F fuse The input is fed into S3.2 to replace the input feature map, and feature extraction is performed, outputting a new global representation sequence Y; the previous layer... X out The input is fed into S3.1 for feature extraction to obtain new multi-scale enhanced features. X out ; and through S3.3, a new joint characterization is obtained again. F fuse After multiple iterations, the final joint representation is output. F fuse ; S3.5: The final result obtained through iteration F fuse Input the information into the classification header to identify the fault type; S3.6: Combine the training set and use the loss function to iteratively optimize the classification head; S3.7: Test the fault identification model for dual-branch photovoltaic panels using the test set; S4: Through a trained dual-branch photovoltaic panel fault identification model, the infrared images of photovoltaic panels can be accurately classified into multiple types of faults, such as hot spots, junction box overheating, dust obstruction, and micro-cracks.

2. The fault identification method for dual-branch photovoltaic panels based on Transformer according to claim 1, characterized in that, In step S2.1, Establish the loss functions for the generator and discriminator: ； In the formula, L D Discriminator D The loss function; D ( x ) represents the discriminator D The probability value for the real sample; Indicates when x When traversing the training set samples in S1 Expected value; z ~ N (0,1) represents random noise between 0 and 1; G ( z ) represents a generator G With random noise z The input is the fault sample, and the output is the fault sample. D ( G ( z )) represents the discriminator D The probability value of the faulty sample; express z When it is random noise Expected value; L G Represents generator G The loss function; the generator is trained using the sample set in S1 and frozen after convergence. G Let the frozen generator be denoted as G best .

3. The Transformer-based fault identification method for dual-branch photovoltaic panels according to claim 2, comprising the following steps: In step S2.2, Establish the loss function for the variational autoencoder (VAE): ； In the formula, L VAE The loss function of a variational autoencoder (VAE) is represented by . This represents the Gaussian posterior distribution of the output of the variational autoencoder (VAE). This represents the generation probability of the VAE's decoder; This represents the expected probability of reconstructing the original sample, i.e., the reconstruction term; express KL Annealing coefficient; express KL Divergence, i.e. KL Regular terms; where, ； In the formula, μ This represents the mean of the Gaussian posterior distribution; σ 2 This represents the variance of the Gaussian posterior distribution; express μ arrive σ 2 Distribution; This means summing the dimensions of all latent variables in the variational autoencoder (VAE). Let S represent the latent variables in the variational autoencoder (VAE). After training and convergence using the sample set in S1, a frozen encoder and a frozen decoder are obtained. The frozen encoder is denoted as S1. Enc best The frozen decoder is denoted as Dec best In the frozen generator G best 、 Frozen encoder Enc best And the frozen decoder Dec best Then, the trained GAN-VAE collaborative generation model is obtained.

4. The Transformer-based fault identification method for dual-branch photovoltaic panels according to claim 3, comprising the following steps: Step S2.3 includes the following steps: S2.3.1: Sample random noise using a trained Generative Adversarial Network (GAN). z 1~ N (0,1); S2.3.2: Using a frozen generator G best Output initial synthetic sample ; S2.3.3: Input the initial synthesized sample into the frozen encoder. Enc best Extracting latent representations from, i.e. ； S2.3.4: By freezing the decoder Dec best Reconstruction yields a detailed enhanced sample, i.e. ,in, In the formula, X aug Represents samples with enhanced details; This indicates element-wise multiplication.

5. The Transformer-based fault identification method for dual-branch photovoltaic panels according to claim 4, comprising the following steps: Step S3.1 includes the following steps: S3.1.1: Input feature map of the training set Divided into four sub-features along the channel dimension , g ∈[1,4] and is an integer. B, C, H, W These represent batch size, number of channels, height, and width, respectively. S3.1.2: For each sub-feature X g By applying depthwise separable convolution operations with different receptive fields, multi-scale local features can be generated. B g The formula is: ； In the formula, Conv 1×1 This represents a 1×1 standard convolution; DWConv k×k This indicates a kernel size of k×k and an expansion rate of [missing information]. d Depth-separable convolution; B g Indicates the first g Multi-scale local features extracted from each branch; S3.1.3: Multi-scale local features extracted for each branch B g Independent channel attention mechanisms are applied to generate channel-weighted features, as shown in the formula: ； In the formula, AvgPool and MaxPool These are global average pooling and global max pooling, respectively. MLP For a shared two-layer 1×1 convolutional network; σ s This represents the Sigmoid activation function; Indicates the first g Channel attention map of the branch; This indicates the features after channel attention enhancement, used to highlight key channel responses related to photovoltaic panel failures; S3.1.4: Concatenate the four sets of channel-weighted features to restore the complete channel dimensions and obtain the fused features. F ; ； In the formula, Concat This indicates the concatenation of channel weighted features; S3.1.5: Fusion Features F Applying spatial attention mechanisms to generate multi-scale enhanced features X out The formula is: ； In the formula, AvgPool channel Indicates channel average pooling; MaxPool channel Indicates channel max pooling; M Represents the aggregated feature map; A s This represents a spatial attention map, used to focus the spatial location of fault areas in infrared images of photovoltaic panels. X out This represents the multi-scale augmentation features of the final output; Conv 7×7 [ ] represents a 7×7 standard convolution; [ ] represents concatenation of channel dimensions.

6. The Transformer-based fault identification method for dual-branch photovoltaic panels according to claim 5, comprising the following steps: Step S3.2 includes the following steps: S3.2.1: The Patch Embedding module divides the input feature map into several non-overlapping 4×4 image patches and maps each image patch into a high-dimensional embedding vector; S3.2.2: Input the embedded vector into the Swing Transformer encoder, and implement global context modeling through multiple layers of Swing TransformerBlocks. l In the Swin Transformer Block, updates are performed using a shift-based self-attention mechanism, with the following formula: ； In the formula, LN Representation layer normalization; SW-MSA This indicates processing based on a shift-window self-attention mechanism; MLP Processed as a two-layer fully connected network; Indicates the first l The input feature sequence of layer -1; Indicates the first l The feature sequence after the layer input feature sequence has undergone attention interaction; Indicates the first l The output feature sequence of the layer; The odd-numbered layer Swin Transformer block uses SW-MSA Enhance cross-window information interaction; Even-numbered layers of the Swing Transformer block use the standard Window Multi-Head Self-Attention mechanism to keep the window fixed and prevent shifting operations. Both odd-numbered and even-numbered layers are calculated using the attention calculation formula, namely: ； In the formula, Q, K, V These represent the query, key, and value matrices, respectively. d h Indicates the dimension per head; B Represents the learnable position index, characterizing the relative positional relationship between pixels within the window; Softmax represents the activation function; S3.2.3: After passing through several Swing Transformer Blocks, insert a Patch Merging layer; for the first... l The feature sequence of the layer is used to stitch together adjacent 2×2 blocks and reduce dimensionality through linear projection, thereby halving the spatial resolution of the feature map, doubling the number of channels, and replacing the first layer. l The feature sequence of the layer is given by the formula: ； In the formula, Indicates the first l The output feature sequence of layer +1 is the result after the Patch Merging operation; Represents a linear transformation operation; Concat This indicates a splicing operation along the channel; Indicates the first l In the feature sequence of the layer, four adjacent 2×2 pixel blocks, where, These represent the coordinates of the corresponding pixel positions; S3.2.4: The updated feature sequence from the last layer in S3.2.3 is aggregated using global average pooling to generate a global representation sequence. Y .

7. The Transformer-based fault identification method for dual-branch photovoltaic panels according to claim 6, comprising the following steps: Step S3.3 includes the following steps: S3.3.1: Enhance features at multiple scales X out Projected onto Y Same channel dimension; the formula is: ； In the formula, X’ Represents the projected feature sequence; Conv 1×1 This represents a 1×1 standard convolution; Y ' represents the reshaped feature sequence; Reshape Indicates a reshaping operation; S3.3.2: Apply global average pooling followed by adaptive enhancement to obtain expressive feature sequences; the formula is: ； In the formula, Linear This indicates fully connected layer processing; Expand Indicates broadcast extension processing; express The feature sequence after adaptive enhancement; express The feature sequence after adaptive enhancement; S3.3.3: To achieve cross-branch information interaction, a bidirectional windowed cross-attention mechanism is introduced, with a window size M=4 for the adaptively enhanced feature sequence. and All are locally divided and filled to the integer boundary; ； In the formula, H p Indicates the height after division; H b Indicates the height before the division; W p Indicates the width after division; W b Indicates the width before the division; S3.3.4: Generate query, key, and value features respectively; the formula is: ； In the formula, DSC This indicates that depthwise separable convolutional modules are used for efficient extraction of local context; DSC Q , DSC K and DSC V These represent query, key, and value operations performed after processing by the depthwise separable convolutional module. Q Y , K Y and V Y They respectively represent the following: The characteristics of the query, key, and value obtained after processing; Q X , K X and V X They respectively represent the following: The characteristics of the query, key, and value obtained after processing; S3.3.5: Perform bidirectional cross-attention; the formula is: ； In the formula, Indicates the dimension per head; Softmax Indicates the activation function; express Transformer Attention output from branch to CNN branch; Indicates CNN branch to Transformer Attention output for branches; S3.3.6: The attention output recovers the spatial structure and removes fill after projection, using the following formula: ； In the formula, This represents the output after the attention interaction between the Transformer branch and the CNN branch; This represents the output after the attention interaction between the CNN branch and the Transformer branch; DSC This represents depthwise separable convolution; S3.3.7: Compute spatial element-level interaction features; ； S3.3.8: Concatenate the bidirectional attention output with the interaction features, and then reduce the dimensionality using a 1×1 convolution: ； S3.3.9: Will Four path features are concatenated and input into a multi-layer deep separable fusion network: ； In the formula, FusionNet The sequence consists of the following sequence: IDSC→BN→GELU→DSC→BN→GELU→Conv 1*1 →BN→GELU, IDSC is point convolution followed by depth convolution, while DSC is depth convolution followed by point convolution; BN stands for batch normalization; GELU stands for activation function.