Image fusion method based on shift window attention and semantic driven double confrontation

By adopting an image fusion method based on shift window attention and semantic-driven dual adversarial approaches, the problem of limited feature extraction capability and modal information imbalance in infrared and visible light image fusion is solved. The generated images perform well in visual and machine vision tasks, improving detection accuracy and robustness.

CN122243766APending Publication Date: 2026-06-19NORTHWESTERN POLYTECHNICAL UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NORTHWESTERN POLYTECHNICAL UNIV
Filing Date
2026-03-25
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing infrared and visible light image fusion methods suffer from limitations in feature extraction capabilities, multimodal information conflicts and imbalances, and a disconnect between visual quality and machine vision tasks. These issues result in fused images exhibiting insufficient detail representation and low detection accuracy in complex scenes.

Method used

We employ an image fusion method based on a dual adversarial approach of shift window attention and semantic driving. By constructing a dual-stream shift window Transformer generator, we combine a cross-window attention mechanism, a dual discriminator that decouples the target and texture, and a semantically driven feedback mechanism to achieve deep interaction and fusion of infrared and visible light images. This preserves significant target and background textures and improves downstream detection accuracy.

Benefits of technology

The generated images are richer and more natural in complex scenes, and the multimodal information is balanced, which significantly improves the detection accuracy and robustness of downstream machine vision tasks.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122243766A_ABST
    Figure CN122243766A_ABST
Patent Text Reader

Abstract

This invention discloses an image fusion method based on shifted window attention and semantic-driven dual adversarial approach. The method includes the following steps: Step 1, constructing an adaptive feature extraction network based on a dual-stream shifted window Transformer as a generator, utilizing a cross-window attention mechanism to achieve deep interaction and fusion of infrared image features and visible light image features; Step 2, constructing a multi-style dual discriminator decoupled from target and texture, introducing a target mask, and establishing brightness discrimination paths for salient infrared targets and texture gradient discrimination paths for visible light backgrounds; Step 3, introducing a semantic-driven meta-feature embedding feedback mechanism, utilizing a pre-trained target detection network to extract high-level semantic features, and constructing a semantic consistency loss to guide generator parameter updates; Step 4, constructing a joint objective function including content loss, dual adversarial loss, and semantic loss based on a mask weighting strategy, and performing end-to-end training on the network; Step 5, inputting the image to be fused into the trained generator, and outputting the fused image. This invention solves the problems of limited feature extraction and modal information conflict in traditional visible light and infrared fusion methods, significantly improving the detection accuracy of the fused image in machine vision tasks while balancing infrared high-brightness targets and clear background textures.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of computer vision and image processing technology, specifically relating to an infrared and visible light image fusion technology, and particularly to an image fusion method based on a dual adversarial approach of shifted window attention and semantic drive. Background Technology

[0002] With the development of sensor technology, infrared and visible light image fusion technology has been widely used in video surveillance, autonomous driving, and other fields. Infrared images, through thermal radiation imaging, can highlight prominent targets in all weather conditions and have strong anti-interference capabilities, but lack texture details; visible light images, based on reflected light imaging, have clear textures and rich colors, but their imaging effect is poor in low light or in bad weather. The purpose of image fusion is to combine the complementary information of the two to generate a high-quality image that has both bright target features and clear background texture, thereby improving scene understanding capabilities.

[0003] In recent years, deep learning-based image fusion methods, especially those based on Generative Adversarial Networks (GANs), have gradually become mainstream due to their powerful feature fitting capabilities. However, existing image fusion methods still face the following major problems in practical applications: First, feature extraction capabilities are limited. Traditional convolutional neural network (CNN)-based methods are limited by their local receptive fields, making it difficult to capture long-range dependencies and global contextual information in images, resulting in insufficient detail representation of fused images in complex scenes. Although the Transformer architecture has advantages in global modeling, its efficient application to pixel-level fusion tasks still faces the problem of high computational complexity.

[0004] Second, there is a conflict and imbalance in modal information. Existing GAN-based fusion methods typically use only a single discriminator, making it difficult to simultaneously constrain the generator to retain the thermal radiation intensity of infrared images and the gradient texture of visible light images. This often leads to a "partiality" phenomenon in the fusion results: either the texture is rich but the infrared target is obscured, or the target stands out but the background is blurred, failing to take into account the unique properties of different modalities.

[0005] Third, there is a disconnect between visual quality and machine vision tasks. Most existing fusion algorithms only aim to improve the visual effect of the human eye (such as contrast and sharpness), ignoring the loss of semantic information during the fusion process. This results in fused images that, while visually appealing, may have key semantic features destroyed in subsequent machine vision tasks (such as object detection and recognition), thus reducing detection accuracy and creating a "semantic gap."

[0006] Therefore, designing a fusion method that can efficiently extract global features, balance multimodal information, and meet the needs of downstream advanced vision tasks is a key technical problem that urgently needs to be solved in this field. The image fusion method based on shift-window attention and semantic-driven dual adversarial learning proposed in this invention is precisely aimed at solving the above problems. Summary of the Invention

[0007] To address the aforementioned technical problems, this invention provides an image fusion method based on a dual adversarial approach of shifted window attention and semantic-driven processing. This method overcomes the shortcomings of traditional convolutional neural networks in long-range dependency capture by constructing a generator based on a dual-stream shifted window Transformer. By combining a dual adversarial mechanism that decouples targets and textures with a semantic-driven feedback strategy, it effectively preserves both infrared salient targets and visible light background textures while significantly improving the detection accuracy of the fused image in downstream machine vision tasks.

[0008] 1. The technical solution adopted in this invention is: an image fusion method based on a dual adversarial approach of shifted window attention and semantic-driven methods, characterized by comprising the following steps: Step 1: Construct an adaptive feature extraction network based on a dual-stream shift window Transformer as a generator, and use the cross-window attention mechanism to achieve deep interaction and fusion of infrared image features and visible light image features; Step 101: Construct a dual-stream encoder to convert the infrared images into digital signals. and visible light images Mapped to feature embedding; Step 102: Introduce the cross-window attention module, in the... In layer feature extraction, infrared features are defined as query vectors. Define visible light features as bond vectors Sum value vector ; Step 103: According to the formula Calculate the cross-modal attention map, where For feature dimension, A bias is encoded for the relative position, thereby injecting visible light texture information into the infrared feature stream; similarly, symmetrical injection is performed using visible light as the query vector to obtain the fused features. ; Step 2: Construct a multi-style dual discriminator that decouples the target and texture. Introduce a target mask and establish a brightness discrimination path for infrared salient targets and a texture gradient discrimination path for visible light backgrounds, respectively. Step 201: Introduce the target mask ,in Indicates a significant target area. Indicates the background area; Step 202: Construct the target discriminator Its input is the product of the pixel intensities of the mask region, i.e. The aim is to maximize the differentiation of infrared images. With fused images Differences in the target region; Step 203: Construct the detail discriminator Using gradient operator Calculate the spatial gradient map of the image Its input is the gradient product of the inverse mask region, i.e. The aim is to constrain the fused image to retain the visible light image in the background area. Texture details; Step 3: Introduce a semantically driven meta-feature embedding feedback mechanism, use a pre-trained object detection network to extract high-level semantic features, and construct a semantic consistency loss to guide the generator parameter update in reverse. Step 301: Introduce a pre-trained object detection network The fused image output by the generator Input the network and extract the intermediate layer feature maps. ; Step 302: Construct the meta-feature generator and feature transformation network , detect features The feature is processed sequentially through the meta-feature generator and the feature transformation network, according to the formula. Obtain meta-features with dimensions consistent with the fused features; Step 303: Construct semantic consistency loss The formula is By minimizing this loss, the features extracted by the generator are forced to contain semantic information that is beneficial to downstream detection tasks; Step 4: Construct a joint objective function based on the mask weighting strategy, which includes content loss, dual adversarial loss, and semantic loss, and train the network end-to-end; Step 401: Construct an adaptive content loss based on mask weighting The formula is as follows: in, Indicates the background mask. This represents the element-wise multiplication operation of matrices. Describing the L1 norm, This represents the gradient computation operator. The hyperparameters for balancing the weights; Step 402: Construct a dual adversarial loss The formula is as follows: in, This represents the mathematical expectation operator, used to calculate the mean of a data distribution. This represents the binary cross-entropy loss function, used to measure the difference between the discriminator's output probability and the true label; Step 403: Construct the joint objective function ,in The weights are used to optimize the network parameters end-to-end using the gradient descent algorithm. Step 5: Input the images to be fused into the trained generator and output the fused image.

[0009] Compared with the prior art, the present invention has the following advantages: 1. This invention constructs an adaptive generator based on a dual-stream shifted window Transformer, overcoming the limitations of traditional CNN feature extraction. By introducing a cross-window attention mechanism, this invention enables deep interaction between infrared thermal radiation features and visible light texture features during the feature encoding stage. Compared to traditional convolutional operations limited by local receptive fields, this structure can capture long-range dependencies in images, thereby more fully extracting and fusing global contextual information, resulting in richer and more natural image details.

[0010] 2. This invention designs a multi-style dual discriminator that decouples the target and texture, effectively solving the problem of multimodal information conflict and imbalance. Existing methods often weaken certain modal information (such as infrared targets or visible light textures) due to a single discriminator. This invention introduces a target mask to establish a brightness discrimination path for salient infrared targets and a texture gradient discrimination path for visible light backgrounds. This decoupling strategy forces the generator to retain high-brightness radiation information in the target area and clear texture details in the background area, achieving complementary coexistence of the advantages of the two modalities.

[0011] 3. This invention introduces a semantically driven meta-feature embedding feedback mechanism, bridging the gap between visual quality and machine vision tasks. Unlike traditional methods that only focus on pixel-level consistency, this invention utilizes a pre-trained object detection network to extract high-level semantic features and uses semantic consistency loss to guide the generator update. This mechanism forces the generator network to retain key semantic information beneficial to downstream detection during the fusion process, thereby significantly improving the accuracy and robustness of the fused image in high-level visual tasks such as object detection.

[0012] The technical solution of the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. Attached Figure Description Figure 1 This is a general block diagram of the present invention. Detailed Implementation The method of the present invention will be further described in detail below with reference to the accompanying drawings and embodiments.

[0013] It should be noted that, unless otherwise specified, the embodiments and attributes described in this application can be combined with each other. The present invention will now be described in detail with reference to the accompanying drawings and embodiments.

[0014] It should be noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the exemplary embodiments according to this application. As used herein, the singular form is intended to include the plural form as well, unless the context clearly indicates otherwise. Furthermore, it should be understood that when the terms "comprising" and / or "including" are used in this specification, they indicate the presence of features, steps, operations, devices, components, and / or combinations thereof.

[0015] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented, for example, in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0016] For ease of description, spatial relative terms such as "above," "on top of," "on the upper surface of," "above," etc., are used herein to describe the spatial positional relationship of a device or feature as shown in the figures to other devices or features. It should be understood that spatial relative terms are intended to encompass different orientations in use or operation beyond the orientation of the device as described in the figures. For example, if the device in the figures were inverted, a device described as "above" or "on top of" other devices or structures would subsequently be positioned as "below" or "under" other devices or structures. Thus, the exemplary term "above" can include both "above" and "below." The device may also be positioned in other different ways (rotated 90 degrees or in other orientations), and the spatial relative descriptions used herein will be interpreted accordingly.

[0017] like Figure 1 As shown, the present invention includes the following steps: Step 1: Construct an adaptive feature extraction network based on a dual-stream shift window Transformer as a generator, and use the cross-window attention mechanism to achieve deep interaction and fusion of infrared image features and visible light image features; Step 101: Construct a dual-stream encoder to convert the infrared images into digital signals. and visible light images Mapped to feature embedding; Step 102: Introduce the cross-window attention module, in the... In layer feature extraction, infrared features are defined as query vectors. Define visible light features as bond vectors Sum value vector ; Step 103: According to the formula Calculate the cross-modal attention map, where For feature dimension, A bias is encoded for the relative position, thereby injecting visible light texture information into the infrared feature stream; similarly, symmetrical injection is performed using visible light as the query vector to obtain the fused features. ; Step 2: Construct a multi-style dual discriminator that decouples the target and texture. Introduce a target mask and establish a brightness discrimination path for infrared salient targets and a texture gradient discrimination path for visible light backgrounds, respectively. Step 201: Introduce the target mask ,in Indicates a significant target area. Indicates the background area; Step 202: Construct the target discriminator Its input is the product of the pixel intensities of the mask region, i.e. The aim is to maximize the differentiation of infrared images. With fused images Differences in the target region; Step 203: Construct the detail discriminator Using gradient operator Calculate the spatial gradient map of the image Its input is the gradient product of the inverse mask region, i.e. The aim is to constrain the fused image to retain the visible light image in the background area. Texture details; Step 3: Introduce a semantically driven meta-feature embedding feedback mechanism, use a pre-trained object detection network to extract high-level semantic features, and construct a semantic consistency loss to guide the generator parameter update in reverse. Step 301: Introduce a pre-trained object detection network The fused image output by the generator Input the network and extract the intermediate layer feature maps. ; Step 302: Construct the meta-feature generator and feature transformation network , detect features The feature is processed sequentially through the meta-feature generator and the feature transformation network, according to the formula. Obtain meta-features with dimensions consistent with the fused features; Step 303: Construct semantic consistency loss The formula is By minimizing this loss, the features extracted by the generator are forced to contain semantic information that is beneficial to downstream detection tasks; Step 4: Construct a joint objective function based on the mask weighting strategy, which includes content loss, dual adversarial loss, and semantic loss, and train the network end-to-end; Step 401: Construct an adaptive content loss based on mask weighting The formula is as follows: in, Indicates the background mask. This represents the element-wise multiplication operation of matrices. Describing the L1 norm, This represents the gradient computation operator. The hyperparameters for balancing the weights; Step 402: Construct a dual adversarial loss The formula is as follows: in, This represents the mathematical expectation operator, used to calculate the mean of a data distribution. This represents the binary cross-entropy loss function, used to measure the difference between the discriminator's output probability and the true label; Step 403: Construct the joint objective function ,in The weights are used to optimize the network parameters end-to-end using the gradient descent algorithm. Step 5: Input the images to be fused into the trained generator and output the fused image.

[0018] The above description is merely an embodiment of the present invention and is not intended to limit the present invention in any way. Any simple modifications, alterations, or equivalent structural changes made to the above embodiments based on the technical essence of the present invention shall still fall within the protection scope of the present invention.

Claims

1. An image fusion method based on a dual adversarial approach of shift-window attention and semantic-driven methods, characterized in that, Includes the following steps: Step 1: Construct an adaptive feature extraction network based on a dual-stream shift window Transformer as a generator, and use the cross-window attention mechanism to achieve deep interaction and fusion of infrared image features and visible light image features; Step 101: Construct a dual-stream encoder to convert the infrared images into digital signals. and visible light images Mapped to feature embedding; Step 102: Introduce the cross-window attention module, in the... In layer feature extraction, infrared features are defined as query vectors. Define visible light features as bond vectors Sum value vector ; Step 103: According to the formula Calculate the cross-modal attention map, where For feature dimension, The relative position is encoded with an offset, thereby injecting visible light texture information into the infrared feature stream; Similarly, symmetric injection is performed using visible light as the query vector to obtain fused features. ; Step 2: Construct a multi-style dual discriminator that decouples the target and texture. Introduce a target mask and establish a brightness discrimination path for infrared salient targets and a texture gradient discrimination path for visible light backgrounds, respectively. Step 201: Introduce the target mask ,in Indicates a significant target area. Indicates the background area; Step 202: Construct the target discriminator Its input is the product of the pixel intensities of the mask region, i.e. The aim is to maximize the differentiation of infrared images. With fused images Differences in the target region; Step 203: Construct the detail discriminator Using gradient operator Calculate the spatial gradient map of the image ; Its input is the gradient product of the inverse mask region, i.e. The aim is to constrain the fused image to retain the visible light image in the background area. Texture details; Step 3: Introduce a semantically driven meta-feature embedding feedback mechanism, use a pre-trained object detection network to extract high-level semantic features, and construct a semantic consistency loss to guide the generator parameter update in reverse. Step 301: Introduce a pre-trained object detection network The fused image output by the generator Input the network and extract the intermediate layer feature maps. ; Step 302: Construct the meta-feature generator and feature transformation network , detect features The feature is processed sequentially through the meta-feature generator and the feature transformation network, according to the formula. Obtain meta-features with dimensions consistent with the fused features; Step 303: Construct semantic consistency loss The formula is By minimizing this loss, the features extracted by the generator are forced to contain semantic information that is beneficial to downstream detection tasks; Step 4: Construct a joint objective function based on the mask weighting strategy, which includes content loss, dual adversarial loss, and semantic loss, and train the network end-to-end; Step 401: Construct an adaptive content loss based on mask weighting The formula is as follows: in, Indicates the background mask. This represents the element-wise multiplication operation of matrices. Describing the L1 norm, This represents the gradient computation operator. The hyperparameters for balancing the weights; Step 402: Construct a dual adversarial loss The formula is as follows: in, This represents the mathematical expectation operator, used to calculate the mean of a data distribution. This represents the binary cross-entropy loss function, used to measure the difference between the discriminator's output probability and the true label; Step 403: Construct the joint objective function ,in The weights are used to optimize the network parameters end-to-end using the gradient descent algorithm. Step 5: Input the images to be fused into the trained generator and output the fused image.