Triple attention generative adversarial network driven infrared and visible image fusion method

By using a triple attention generative adversarial network-driven approach, deep fusion of infrared and visible light images was achieved, solving the problems of insufficient feature interaction and rigid fusion strategies in existing technologies. The generated images clearly present infrared targets and visible light textures under low light conditions, improving the accuracy of target detection and recognition.

CN120543392BActive Publication Date: 2026-06-23SHENYANG UNIVERSITY OF TECHNOLOGY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHENYANG UNIVERSITY OF TECHNOLOGY
Filing Date
2025-05-14
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing infrared and visible light image fusion methods suffer from insufficient feature interaction, rigid fusion strategies, and one-sided loss functions, making it difficult to effectively preserve the saliency of infrared targets and the texture of visible light under low illumination conditions.

Method used

A triple attention generative adversarial network, including channel, spatial and point attention mechanisms, is adopted. Combined with a nested connection generator and a dual discriminator, and optimized through a multimodal loss function, deep fusion of infrared and visible light images is achieved.

Benefits of technology

The generated fused image visually presents infrared targets and visible light details clearly, enhancing information richness and structural similarity, and significantly improving the accuracy of target detection and recognition. It is suitable for fields such as UAV night patrol, remote sensing imaging, and medical imaging.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120543392B_ABST
    Figure CN120543392B_ABST
Patent Text Reader

Abstract

The application discloses an infrared and visible light image fusion method driven by a triple attention generative adversarial network, and through the collaborative design of a triple attention mechanism, a nested connection generator, double discriminators and a multi-modal loss function, an efficient infrared and visible light image fusion framework is constructed. The triple attention mechanism strengthens feature representation from multiple dimensions, and channel attention dynamically adjusts the weight of each channel, so that the model focuses more on the key information channels such as infrared thermal radiation features and visible light textures. Spatial attention generates a weight tensor by analyzing the spatial distribution of the feature map, effectively highlighting the target outline, edge and other regions of interest. Point attention excavates the subtle correlation of each pixel point in the feature map, strengthens the interaction of local details, and realizes the complementation and enhancement of features at the channel, spatial and point levels, thereby significantly improving the model's ability to capture complex features in multi-modal images.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of image processing and computer vision technology, and in particular to an infrared-visible image fusion method driven by a triple attention generative adversarial network. Background Technology

[0002] In nighttime campus patrol scenarios using drones, visible light images suffer from low contrast and blurred details due to low lighting conditions, while infrared images, although capable of capturing thermal radiation information, lack textural detail. Existing fusion strategies struggle to achieve an effective balance between these two aspects. Traditional image fusion methods (such as multi-scale transformation) rely on manually designed rules, easily losing modality-specific features. While deep learning-based methods (such as FusionGAN and DDcGAN) improve fusion performance through adversarial training, they still have the following shortcomings:

[0003] (1) Insufficient feature extraction:

[0004] Existing network structures lack sufficient interaction mining of channel, spatial and local detail features, resulting in incomplete preservation of key information in fused images (such as infrared target saliency and visible light texture).

[0005] (2) The integration strategy is too simplistic:

[0006] Relying on fixed fusion rules (such as simple weighting or splicing), and lacking a trainable adaptive mechanism, it is difficult to cope with modal differences in complex scenarios.

[0007] (3) Limitations of loss function design:

[0008] Focusing only on pixel intensity or single-modal features without simultaneously considering the texture details and structural information of multimodal images results in limited quality of the fused image.

[0009] Therefore, there is an urgent need for an end-to-end deep learning framework that can achieve deep fusion and efficient preservation of multimodal features through a trainable attention mechanism and an optimized loss function. Summary of the Invention

[0010] In view of the shortcomings of the prior art, the purpose of this invention is to provide an infrared and visible light image fusion method driven by a triple attention generative adversarial network, which aims to solve the problems of insufficient feature interaction, rigid fusion strategy and one-sided loss function in the existing infrared and visible light image fusion.

[0011] To achieve the above objectives, the present invention adopts the following technical solution:

[0012] An infrared-visible image fusion method driven by a triple attention generative adversarial network includes:

[0013] Step 1: Obtain the registered infrared and visible light images in the same scene, perform preprocessing, unify the number of channels in the visible light and infrared images, and stitch them together into a multi-channel image;

[0014] Step 2: Set up a feature extraction network including a triple attention module to extract features from multi-channel images and generate a single-channel fused feature map;

[0015] Step 3: Transform the single-channel fusion feature map generated in Step 2 through a convolutional layer to generate a final fusion image that includes infrared target saliency and rich visible light texture;

[0016] Step 4: Based on steps 1-3, the fused image is distinguished from the real infrared and visible light images by a dual discriminator. The distribution difference between the generated image and the real image is calculated by the loss based on the least squares method, which guides the generator to learn the feature representation that conforms to the dual-modal distribution, ensuring that the fused image retains both the saliency of the infrared target and the texture details of the visible light.

[0017] Furthermore, in step 1, registered infrared and visible light images of the same scene are acquired, both with the same resolution. Image sources include devices such as drone sensors and cameras, or publicly available datasets. If the visible light image is an RGB three-channel image, it is converted to a single-channel grayscale image, ensuring both the visible light and infrared images are single-channel to unify the number of channels. Subsequently, both images are normalized using a formula... The pixel value range is adjusted to [-1, 1], and finally the normalized infrared and visible light images are stitched together along the channel dimension to form a 2×H×W input tensor as the initial input to the generator.

[0018] Furthermore, in step 2, the triple attention module of the feature extraction network includes channel attention, spatial attention, and point attention. Channel attention generates channel weights through bilinear layers, ReLU activation function, and Sigmoid function to dynamically enhance key feature channels. Spatial attention extracts spatial features based on average pooling and max pooling, and generates spatial weight tensors through convolutional layers and Sigmoid function to focus on regions of interest. Point attention analyzes the contextual relevance of each pixel in the feature map, enhances the interaction of local detail features through residual transformation, and captures subtle texture differences. After independent processing, the feature map is stitched together to form a multi-view enhanced feature representation.

[0019] Furthermore, in step 2, the feature extraction network also includes an encoder and a decoder. The feature map processed by the triple attention module is input to the encoder, which consists of 5 residual blocks. Multi-scale features are extracted step by step through downsampling, with the number of channels in each layer increasing to achieve deep encoding of modal difference features. The high-level features output by the encoder are input to the UNet++ simplified nested connection decoder. Through upsampling and connection with the corresponding layer of the encoder, the low-level details and high-level semantics are progressively restored, and finally a single-channel fused feature map is generated.

[0020] Furthermore, in step 3, the feature map output in step 2 is converted into a fused image with pixel values ​​in the range [-1, 1] by the last convolutional layer. If normalization is used in the preprocessing, the pixel values ​​need to be restored to the original range [0, 255] by inverse normalization, using the following formula:

[0021]

[0022] Among them, the left side I fused This represents the fused image pixel values ​​restored to the original range [0, 255] after denormalization, with I on the right. fused This represents the pixel value output by the last convolutional layer, ranging from [-1, 1]. max(I) represents the maximum value of the original image pixel value, and min(I) represents the minimum value of the original image pixel value.

[0023] The final output is a fused image that combines infrared target saliency with rich visible texture.

[0024] Furthermore, in step 4, when using the least squares method, the discriminator's loss function is determined by the mean square error loss between the discriminator output and the preset target label. This ensures that the discriminator can effectively distinguish between real and generated images. The discriminator loss function is as follows:

[0025]

[0026] Where MSE(·) represents the mean squared error loss, D real D represents the discriminator's output prediction of the real image. fake This represents the discriminator's output prediction for the generated image, where N represents the dimension of the vector, and L... D This represents the total loss of the discriminator.

[0027] Furthermore, the generator loss function is mainly divided into two parts. The first part is the adversarial loss, which is determined based on the discriminator's output. The second part is the content loss, which includes pixel loss and other losses. The samples generated by the generator are input into two different branches of the discriminator, namely D... i and D vThe adversarial loss for the visible light and infrared modes is calculated separately. The total adversarial loss of the generator is the sum of the losses of these two branches. The formula for the adversarial loss is as follows:

[0028] L adv =MSE(D v (fake),1)+MSE(D i (fake),1)

[0029] Among them, D v (fake) indicates the discriminator's judgment of the fused image in the visible spectrum.

[0030] D i (fake) represents the discriminator's judgment result on the fused image in the infrared spectrum, MSE(·) represents the mean square error loss, and L adv It signifies resistance to loss.

[0031] Furthermore, the pixel loss is measured by the Frobenius norm, which measures the pixel difference between the generated image and the original image, using the following formula:

[0032]

[0033] Where Fused represents the fused image, and Input ir Represents the original infrared image, Input vis Represents the original visible light image, ||·|| F Denotes the Frobenius norm, ω p L represents the balance coefficient in the loss. ir pixel and L vis pixel These represent pixel loss in infrared and visible light, respectively;

[0034] The gradient loss measures the difference in the gradient domain between the generated image and the real image using the L1 norm, and the formula is as follows:

[0035]

[0036] Where Fused represents the fused image, and Input ir Represents the original infrared image, Input vis Represents the original visible light image, Grad(·) represents the gradient calculation of the image, ||·|| L Let ω represent the L1 norm. g This represents the balance coefficient in the loss. and These represent the gradient loss for infrared and visible light, respectively;

[0037] The content loss is the weighted sum of the two:

[0038] L con =L pixel +λL grad

[0039] Among them, L con Indicates content loss, L pixel L represents pixel loss. grad λ represents the gradient loss, and λ represents the balance coefficient between the two losses.

[0040] The total loss of the generator is:

[0041] L G =L adv +L con

[0042] Among them, L G L represents the total loss. adv L represents the adversarial loss generated by the generator. con This indicates a loss of content.

[0043] The technical solution adopted in this invention has the following beneficial effects:

[0044] This invention constructs an efficient infrared and visible light image fusion framework through the collaborative design of a triple attention mechanism, a nested connection generator, a dual discriminator, and a multimodal loss function. The triple attention mechanism enhances feature representation from multiple dimensions: channel attention dynamically adjusts the weights of each channel, making the model more focused on key information channels such as infrared thermal radiation features and visible light texture; spatial attention generates weight tensors by analyzing the spatial distribution of feature maps, effectively highlighting regions of interest such as target contours and edges; and point attention mines the subtle correlations of each pixel in the feature map, strengthening the interaction of local details. The combination of the three achieves feature complementarity and enhancement at the channel, spatial, and point levels, significantly improving the model's ability to capture complex features in multimodal images. The nested connection generator leverages multi-scale feature extraction from residual blocks in the encoder and dense cross-layer connections in the decoder to integrate high-level semantic features while preserving low-level details. This achieves a progressive recovery from coarse-grained structure to fine-grained texture, avoiding the information loss problem caused by single-scale processing in traditional networks. Dual discriminators are designed for infrared and visible light modalities respectively. Adversarial training forces the generator to learn feature representations that conform to the true distribution of both modalities, ensuring that the fused image retains both the target saliency of the infrared image and the rich texture of the visible light image, avoiding the modality bias problem that may be caused by a single discriminator. The loss function system organically combines adversarial loss, pixel loss, and gradient loss to constrain the visual realism of the generated image while taking into account pixel intensity and texture details in the gradient domain. This allows the fusion result to achieve higher-quality information integration while maintaining the original features of infrared and visible light. Experiments show that the proposed framework significantly outperforms several advanced methods in fusion performance on publicly available datasets. The generated images not only visually present clearer details of infrared targets and visible light scenes, such as the outlines of heat sources and the texture of the surrounding environment in nighttime scenes, but also demonstrate outstanding performance in key indicators such as information richness and structural similarity. This effectively solves the problems of poor visible light image quality under low-light conditions and incomplete feature preservation in traditional fusion methods. In practical applications, this method provides clearer fused images for UAV nighttime patrols, enabling subsequent target detection and recognition tasks to more accurately locate and analyze key information in the scene, significantly improving monitoring efficiency and security in complex environments. It also exhibits strong adaptability in remote sensing imaging, medical imaging, and other fields, providing an innovative solution for the deep fusion and practical application of multimodal images, possessing significant technical value and broad engineering application prospects. Attached Figure Description

[0045] Figure 1 The diagram shows a triple attention mechanism model, including the structural design of channel attention (a), spatial attention (b), and point attention (c).

[0046] Figure 2Here is a diagram of the generator structure;

[0047] Figure 3 This is a diagram of the dual discriminator structure;

[0048] Figure 4 Figure showing the qualitative analysis results of the attention ablation experiment;

[0049] Figure 5 A qualitative analysis diagram of the fusion effect of the TNO dataset. Detailed Implementation

[0050] To make the objectives, technical solutions, and effects of this invention clearer and more explicit, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

[0051] An infrared-visible image fusion method driven by a triple attention generative adversarial network is implemented through the following core modules and steps:

[0052] 1) Image acquisition and preprocessing

[0053] ① Data Input: Acquire registered infrared images (single channel, denoted as I) of the same scene. ir ) and visible light images, single / three channels, denoted as I vis Both datasets have the same resolution and originate from sources including drone sensors, cameras, and other devices, or from publicly available datasets (TNO datasets).

[0054] ② Preprocessing: If the visible light image is an RGB three-channel image, it is converted into a single-channel grayscale image to ensure that both the visible light and infrared images are single-channel to unify the number of channels. Then, both are normalized using the formula... The pixel value range is adjusted to [-1, 1], and finally the normalized infrared and visible light images are stitched together along the channel dimension to form a 2×H×W input tensor as the initial input to the generator.

[0055] 2) Feature extraction and fusion (generator processing)

[0056] Generator structure as follows Figure 2 As shown, the stitched multi-channel images are processed sequentially through the core module of the generator:

[0057] ① Enhanced Triple Attention Features: A diagram of the triple attention mechanism is shown below. Figure 1As shown, channel attention, spatial attention, and point attention are processed independently, and the generated feature maps are then concatenated to form a multi-dimensional enhanced feature representation. Specifically, channel attention generates channel weights through bilinear layers, ReLU, and sigmoid to dynamically enhance key features in the infrared thermal radiation channel and the visible light texture channel; spatial attention extracts spatial features using average pooling and max pooling to generate spatial weight tensors, thereby focusing on regions of interest such as target contours and edges; and point attention analyzes the contextual correlation of pixels and enhances local detail interactions through residual transformation to capture subtle texture differences.

[0058] ② Multi-scale feature extraction of residual encoder: The feature map processed by the attention module is input to the encoder consisting of 5 residual blocks. Multi-scale features are extracted step by step through downsampling (such as convolution or pooling operations). The number of channels in each layer increases (such as 64→128→256→512) to achieve deep encoding of modal difference features.

[0059] ③ Dense Decoder Feature Recovery and Fusion: The high-level features output by the encoder are input into the simplified nested connected decoder of UNet++. Through upsampling and connection with the corresponding layer of the encoder, the low-level details and high-level semantics are progressively recovered, and finally a single-channel fused feature map is generated.

[0060] 3) Fusion of image generation and post-processing

[0061] ① Image Reconstruction: The feature map output by the decoder is converted into a fused image I with pixel values ​​ranging from [-1, 1] by the last convolutional layer (1×1 convolution). fused .

[0062] ② Inverse Normalization: If normalization is used in preprocessing, inverse normalization is required to restore the pixel values ​​to the original range [0, 255]. The formula is:

[0063] ③ Output results: Generate a final fused image containing infrared target saliency and visible light rich texture, which can be used for subsequent target detection, recognition and other tasks.

[0064] 4) Dual discriminator adversarial training (training phase)

[0065] Dual discriminator structure as follows Figure 3 As shown, during the training phase, the two discriminators (convolutional neural networks with identical structures) respectively analyze the fused image I. fused Distinguish between real infrared and visible light images:

[0066] The discriminator outputs a single-channel probability value, and calculates the distribution difference between the generated image and the real image through least squares loss. This guides the generator to learn feature representations that conform to a bimodal distribution, ensuring that the fused image retains both the saliency of infrared targets and the texture details of visible light.

[0067] Loss function optimization:

[0068] 1) Discriminator Loss Function: When the discriminator processes real samples, the probability of classifying them as true should approach 1. Therefore, the target label of the real samples is set to 1, and the mean squared error loss between the discriminator output and the target label is calculated accordingly. For generated samples, the probability of the discriminator output classifying them as false should approach 0, so the target label of the generated samples is set to 0, and the corresponding mean squared error loss is calculated.

[0069] The primary goal of the least squares loss function is to enable the discriminator to better distinguish between real and generated samples by minimizing the mean squared error. Compared to the traditional cross-entropy loss, this loss function offers the advantage of providing a more stable training process and generating clearer images.

[0070] When using the least squares method, the discriminator's loss function is determined by the mean square error loss between the discriminator output and the preset target label. This ensures that the discriminator can effectively distinguish between real and generated images. The formula for the discriminator's loss function is shown in (1):

[0071]

[0072] Where MSE(·) represents the mean squared error loss, D real D represents the discriminator's output prediction of the real image. fake This represents the discriminator's output prediction for the generated image, where N represents the dimension of the vector, and L... D This represents the total loss of the discriminator.

[0073] 2) Generator Loss Function: It mainly consists of two parts. The first part is the adversarial loss, which is determined based on the discriminator's output. The second part is the content loss, which is composed of pixel loss and other losses.

[0074] Within the least squares (LS) framework, the generator loss is determined by the mean squared error between the generated samples and the real samples. The generator sends the generated samples to the discriminator, which classifies and outputs probability values. Since the generator aims for samples to closely resemble reality, a target probability of 1 is set, and the mean squared error is calculated to obtain the loss value. By minimizing the loss, the generator adjusts its parameters to generate more realistic samples.

[0075] In practical implementations, the samples generated by the generator are input into two different branches of the discriminator, namely D. i and D v The adversarial loss for the visible and infrared modes is calculated separately. The total adversarial loss of the generator is the sum of the losses of these two branches. The formula for the adversarial loss is defined as shown in (2):

[0076] L adv=MSE(D v (fake),1)+MSE(D i (fake),1) (2)

[0077] Among them, D v (fake) indicates the discriminator's judgment of the fused image in the visible spectrum.

[0078] D i (fake) represents the discriminator's judgment result on the fused image in the infrared spectrum, MSE(·) represents the mean square error loss, and L adv It signifies resistance to loss.

[0079] Next, we will discuss pixel loss in the content loss. This loss is quantified using the Frobenius norm and is used to measure the difference between the generated image and the original image at the pixel level. Since visible light images usually have high contrast, when calculating pixel loss, the loss between the generated image and the two original images is calculated separately and assigned different weights, thereby adjusting the original image to contribute more pixel intensity information to the generated image. The definition of the pixel loss expression is shown in (3).

[0080]

[0081] Where Fused represents the fused image, and Input ir Represents the original infrared image, Input vis Represents the original visible light image, ||·|| F Denotes the Frobenius norm, ω p L represents the balance coefficient in the loss. ir pixel and L vis pixel These represent pixel loss in infrared and visible light, respectively.

[0082] Gradient loss is used to measure the difference between the generated image and the real image in the gradient domain, reflecting the gradient level features. Since infrared images contain texture details, when calculating the gradient loss, the gradient loss of the generated image and the two original images are calculated separately and assigned different weights, thereby controlling the original image to contribute more texture details to the generated image. The gradient loss expression is shown in (4).

[0083]

[0084] Where Fused represents the fused image, and Input ir Represents the original infrared image, Input vis Represents the original visible light image, Grad(·) represents the gradient calculation of the image, ||·|| L Let L1 norm be denoted, and ωg be the balance coefficient in the loss. and These represent the gradient loss for infrared and visible light, respectively.

[0085] We derive the content loss by weighted summation of the pixel loss and gradient loss. The expression for the content loss is defined as shown in (5):

[0086] L con =L pixel +λL grad (5)

[0087] Among them, L con Indicates content loss, L pixel L represents pixel loss. grad Let λ represent the gradient loss, and let λ represent the balance coefficient between the two losses.

[0088] In summary, the total loss of the generator is the sum of the adversarial loss and the content loss. The expression for the total loss is defined as shown in (6):

[0089] L G =L adv +L con (6)

[0090] Among them, L G L represents the total loss. adv L represents the adversarial loss generated by the generator. con This indicates a loss of content.

[0091] Specifically, the implementation effect of the present invention is verified through the following embodiments:

[0092] (1) Experimental platform and data preparation

[0093] 1) Hardware and Framework: The experiment is based on the PyTorch deep learning framework and runs on a server equipped with an NVIDIA V100 GPU.

[0094] 2) Dataset Construction:

[0095] ① Training data: 30,000 pairs of infrared and visible light images with a resolution of 128×128 were selected from publicly available urban scene data, covering scenes such as low light at night and complex backgrounds.

[0096] ② Test data: The TNO public dataset is used, which contains 32 pairs of multimodal images, covering diverse scenes such as natural scenes and man-made targets, to verify the fusion performance of the model in real complex environments.

[0097] (2) Network training parameter settings

[0098] 1) Optimization strategy: Use the Adam optimizer to update the generator and discriminator parameters simultaneously. Set the initial learning rate to 1E-4, the batch size to 4, and the total number of training rounds to 50 to ensure that the model achieves a balance between convergence and training efficiency.

[0099] 2) Input processing: The input infrared and visible light images are normalized and then input into the generator's triple attention module after channel splicing.

[0100] (3) Evaluation indicators and testing methods

[0101] 1) Quantitative evaluation indicators: Seven classic indicators are used to comprehensively evaluate the fusion quality, including:

[0102] ① Entropy (En): Measures the richness of image information; the higher the value, the more effective information it contains.

[0103] ②Sum of Difference Correlation (SCD): Evaluates the degree of modal difference preservation between the original image and the fused image, reflecting the ability to balance infrared and visible light features;

[0104] ③ Multiscale structural similarity (MS-SSIM): Measures the structural similarity of images from multiple spatial scales, reflecting the preservation effect of texture and edge details;

[0105] In addition, it includes mutual information (MI) and pixel-level fusion quality index (FMI_pixel), which comprehensively evaluate from the dimensions of information sharing and feature alignment.

[0106] 2) Comparison method: Five advanced fusion algorithms were selected for comparison, including NestFuse, RFN-Nest, DDcGAN, SeAFusion, and U2-Fusion. All comparison models used open source code and maintained the original parameter settings.

[0107] (4) Ablation Experiment Design

[0108] To verify the effectiveness of the triple attention mechanism, a comparative experiment was designed: under the same training parameters, a TA-NcGAN model containing the complete triple attention module (denoted as "with attention") and a simplified model with the attention module removed (denoted as "without attention") were trained respectively. The differences were analyzed through qualitative visual comparison and quantitative indicators. The qualitative analysis results are as follows: Figure 4 As shown in Table 1, the results indicate that the fused image from the non-attentional model is generally brighter, has reduced contrast, blurred texture details, and a significant decrease in key information such as the infrared heat source region. In terms of quantitative metrics, the attentional model outperforms the non-attentional model in all six metrics, including En, SCD, and MS-SSIM, demonstrating the crucial role of the triple attention mechanism in feature enhancement and information preservation. The quantitative comparison data are shown in Table 1.

[0109] Table 1 Ablation Comparison of Attention Modules

[0110] Table.1 Quantitative Comparison of Attention Module Ablation

[0111]

[0112] (5) Comparison of experimental results and analysis

[0113] 1) Qualitative analysis: The effect of qualitative fusion is as follows Figure 5 As shown, in typical scenarios of the TNO dataset, such as nighttime streets, the fused images generated by TA-NcGAN clearly preserve the outlines of thermal targets in infrared images, such as pedestrians and vehicle engines, as well as the texture details in visible light images, such as road surface patterns and building windows. The contrast and color reproduction are superior to contrast methods. For example, in low-light scenes, contrast methods may result in blurred edges of thermal targets or overly smoothed visible light textures, while the fusion results of TA-NcGAN accurately present the target location and surrounding environmental details, providing a visual effect closer to the real scene.

[0114] 2) Quantitative Analysis: Statistical results on 32 pairs of test images show that TA-NcGAN outperforms most comparative methods in the En metric, indicating that its fused images contain richer multimodal information; the MS-SSIM metric is significantly superior, demonstrating its outstanding advantages in structural similarity and detail preservation; the SCD metric is within a reasonable range, proving that the model effectively balances the modal differences between infrared and visible light, avoiding excessive dominance of a single modal feature. Overall, TA-NcGAN ranks highly in many of the seven evaluation metrics, and its overall performance surpasses existing state-of-the-art methods. The quantitative analysis results are shown in Table 2.

[0115] Table 2. Quantitative Analysis of Fusion Results on TNO Dataset

[0116]

[0117] In summary, addressing the technical challenges of infrared and visible light image fusion in low-light scenarios, this invention proposes the TA-NcGAN framework. Through the collaborative design of a triple attention mechanism, a nested connection generator, a dual discriminator, and a multimodal loss function, it achieves deep fusion and efficient preservation of multimodal features. The triple attention mechanism dynamically enhances key features from the channel, spatial, and point dimensions; the nested generator achieves progressive integration of multi-scale features; the dual discriminator ensures the fused image conforms to a bimodal distribution; and the multimodal loss function balances pixel intensity and texture detail. Experiments show that the framework significantly outperforms existing methods on publicly available datasets, generating images that combine infrared target saliency with visible light texture richness, exhibiting outstanding information preservation and structural similarity. Ablation experiments validate the core role of the attention mechanism, effectively improving target detection accuracy in scenarios such as UAV night patrols. This method provides a new path for multimodal image fusion, possessing broad application potential in remote sensing, medical imaging, and other fields, combining theoretical innovation with engineering practical value.

[0118] Other embodiments of the invention will readily occur to those skilled in the art upon consideration of the specification and practice of the solutions disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention that follow the general principles of the invention and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of the invention are indicated by the claims.

Claims

1. A method for infrared and visible light image fusion driven by a triple attention generative adversarial network, characterized in that, include: Step 1: Obtain the registered infrared and visible light images in the same scene, perform preprocessing, unify the number of channels in the visible light and infrared images, and stitch them together into a multi-channel image; Step 2: Set up a feature extraction network including a triple attention module to extract features from multi-channel images and generate a single-channel fused feature map; In step 2, the triple attention module of the feature extraction network includes channel attention, spatial attention, and point attention. Channel attention generates channel weights through bilinear layers, ReLU activation function, and sigmoid function to dynamically enhance key feature channels. Spatial attention extracts spatial features based on average pooling and max pooling, and generates spatial weight tensors through convolutional layers and sigmoid function to focus on regions of interest. Point attention analyzes the contextual relevance of each pixel in the feature map, enhances the interaction of local detail features through residual transformation, and captures subtle texture differences. The three are processed independently and then concatenated to form a multi-view enhanced feature representation. In step 2, the feature extraction network also includes an encoder and a decoder. The feature map processed by the triple attention module is input to the encoder, which consists of 5 residual blocks. Multi-scale features are extracted step by step through downsampling, with the number of channels in each layer increasing to achieve deep encoding of modal difference features. The high-level features output by the encoder are input to the UNet++ simplified nested connection decoder. Through upsampling and connection with the corresponding layer of the encoder, the low-level details and high-level semantics are progressively restored, and finally a single-channel fused feature map is generated. Step 3: Transform the single-channel fusion feature map generated in Step 2 through a convolutional layer to generate a final fusion image that includes infrared target saliency and rich visible light texture; Step 4: Based on steps 1-3, the fused image is distinguished from the real infrared and visible light images by a dual discriminator. The distribution difference between the generated image and the real image is calculated by the loss based on the least squares method, which guides the generator to learn the feature representation that conforms to the dual-modal distribution, ensuring that the fused image retains both the saliency of the infrared target and the texture details of the visible light. The generator loss function mainly consists of two parts: the first part is the adversarial loss, which is determined based on the discriminator's output; the second part is the content loss, including pixel loss and other losses. The samples generated by the generator are input into two different branches of the discriminator. and The adversarial loss for the visible light and infrared modes is calculated separately. The total adversarial loss of the generator is the sum of the losses of these two branches. The formula for the adversarial loss is as follows: ; Among them, D v (fake) represents the discriminator's judgment result on the fused image in the visible spectrum, D i (fake) represents the discriminator's judgment result on the fused image in the infrared spectrum, MSE(·) represents the mean square error loss, and L adv Indicates resistance to loss; The pixel loss is measured by the Frobenius norm, which measures the pixel difference between the generated image and the original image. The formula is as follows: ; Where Fused represents the fused image, and Input ir Represents the original infrared image, Input vis Represents the original visible light image. Denotes the Frobenius norm, ω p L represents the balance coefficient in the loss. ir pixel and L vis pixel These represent pixel loss in infrared and visible light, respectively; The gradient loss measures the difference in the gradient domain between the generated image and the real image using the L1 norm, and the formula is as follows: ; Where Fused represents the fused image, and Input ir Represents the original infrared image, Input vis Represents the original visible light image. This represents the gradient calculation of the image. Let ω represent the L1 norm. g This represents the balance coefficient in the loss. and These represent the gradient loss for infrared and visible light, respectively; The content loss is the weighted sum of the two: ; Among them, L con Indicates content loss, L pixel L represents pixel loss. grad λ represents the gradient loss, and λ represents the balance coefficient between the two losses. The total loss of the generator is: ; Among them, L G L represents the total loss. adv L represents the adversarial loss generated by the generator. con This indicates a loss of content.

2. The infrared-visible image fusion method driven by a triple attention generative adversarial network according to claim 1, characterized in that, In step 1, registered infrared and visible light images of the same scene are acquired, both with the same resolution. Image sources include drone sensors, cameras, and other devices, or publicly available datasets. If the visible light image is an RGB three-channel image, it is converted to a single-channel grayscale image, ensuring both the visible light and infrared images are single-channel to unify the number of channels. Subsequently, both images are normalized using a formula... The pixel value range is adjusted to [-1, 1], and finally the normalized infrared and visible light images are stitched together along the channel dimension to form a 2×H×W input tensor as the initial input to the generator.

3. The infrared-visible image fusion method driven by a triple attention generative adversarial network according to claim 1, characterized in that, In step 3, the feature map output in step 2 is converted into a fused image with pixel values ​​in the range [-1, 1] by the last convolutional layer. If normalization is used in the preprocessing, the pixel values ​​need to be restored to the original range [0, 255] by inverse normalization, using the following formula: ; Among them, the left side I fused This represents the fused image pixel values ​​restored to the original range [0, 255] after denormalization. The right side I... fused This represents the pixel value output by the last convolutional layer, ranging from [-1, 1]. `max(I)` represents the maximum pixel value of the original image, and `min(I)` represents the minimum pixel value of the original image. The final output is a fused image that combines infrared target saliency with rich visible texture.

4. The infrared-visible image fusion method driven by a triple attention generative adversarial network according to claim 1, characterized in that, In step 4, when using the least squares method, the discriminator's loss function is determined by the mean square error loss between the discriminator output and the preset target label. This ensures that the discriminator can effectively distinguish between real and generated images. The discriminator's loss function is as follows: ; in, D represents the mean squared error loss. real D represents the discriminator's output prediction of the real image. fake This represents the discriminator's output prediction for the generated image, where N represents the dimension of the vector, and L... D This represents the total loss of the discriminator.