An infrared and visible light image fusion method based on cross-modal enhancement and multi-attention fusion strategy

By employing a cross-modal enhancement and multi-attention fusion strategy, the problems of high computational cost and insufficient feature extraction in infrared and visible light image fusion are solved, achieving high-quality image fusion results, especially improving image details and contrast in low-light environments.

CN118096554BActive Publication Date: 2026-06-30CHANGCHUN INST OF OPTICS FINE MECHANICS & PHYSICS CHINESE ACAD OF SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHANGCHUN INST OF OPTICS FINE MECHANICS & PHYSICS CHINESE ACAD OF SCI
Filing Date
2024-03-11
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In existing infrared and visible light image fusion technologies, traditional methods require manual adjustment of fusion rules and have high computational costs. Deep learning algorithms fail to fully consider the inherent correlation features of images and have poor removal effects on redundant features, especially in low-light environments.

Method used

A method based on cross-modal enhancement and multi-attention fusion strategy is adopted. The cross-modal enhancement module and detail injection module enhance feature extraction, and the texture and brightness attention modules of the multi-attention mechanism are combined to process features. Finally, the fusion weights are calculated through the spatial-channel attention module to reconstruct the fused image.

Benefits of technology

It improves the image fusion effect, enhances the ability to represent detailed information, captures the inherent correlation between infrared and visible light images, and improves image quality in low-light environments.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118096554B_ABST
    Figure CN118096554B_ABST
Patent Text Reader

Abstract

This invention relates to a method for fusing infrared and visible light images based on a cross-modal enhancement and multi-attention fusion strategy, comprising the steps of: acquiring original infrared and visible light images; enhancing features extracted from convolutional layers based on a cross-modal enhancement module and a detail injection module; fusing visible light and infrared features guided by a multi-attention mechanism; and decoding the fused features to reconstruct the fused image. This invention incorporates a cross-modal enhancement network in the encoding stage to complementarily enhance infrared and visible light features, capturing their inherent correlation. Simultaneously, a detail injection module is added to enhance the representation of detail information. A multi-attention fusion strategy is used instead of a conventional fusion method, introducing texture attention and luminance attention modules to process visible light and infrared features respectively, and a spatial-channel attention module to calculate fusion weights, guiding the fusion of each component to obtain the final fusion result.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image fusion technology, and in particular to an infrared and visible light image fusion method based on cross-modal enhancement and multi-attention fusion strategies. Background Technology

[0002] The purpose of infrared and visible light image fusion is to integrate heterogeneous images captured by different detectors of the same scene to obtain a single, information-rich image that better conforms to human visual perception. Infrared detectors primarily generate images by capturing the energy radiation of objects themselves, making them less susceptible to interference from harsh environments. Visible light detectors, on the other hand, generate images based on the reflected information of objects, offering higher resolution and better meeting the needs of human production and daily life. Infrared and visible light image fusion technology utilizes effective strategies to fuse the complementary information between the two and eliminate spectral differences.

[0003] Currently, infrared and visible light image fusion technology can be divided into two main directions: fusion technology based on traditional algorithms and fusion technology based on deep learning algorithms. Fusion technology based on traditional algorithms mainly includes spatial domain and transform domain image fusion techniques. The fusion results obtained by traditional algorithms can basically meet the imaging needs of some special scenarios. However, the following bottlenecks of traditional methods also limit the development of fusion technology: 1) To obtain the best fusion performance, traditional methods require manual adjustment of fusion rules, which increases computational cost and implementation difficulty; 2) Traditional methods use the same transformation method to extract features from heterogeneous images, failing to fully consider the inherent features of heterogeneous images.

[0004] With the continuous advancement of deep learning algorithms in image processing technology, deep learning-based fusion techniques have also developed rapidly. Based on implementation baselines, these can be categorized into convolutional neural network (CNN) methods, autoencoder methods, and generative adversarial network (GAN) methods. CNN methods primarily consist of three stages: encoding, fusion, and decoding. The fusion result largely depends on a well-designed network structure and a suitable loss equation. Most infrared and visible light image fusion techniques based on deep learning algorithms extract features from the two source images separately, failing to fully consider the inherent correlation between them. This results in insufficient protection of key information in the source images and ineffective removal of redundant features. Furthermore, most fusion methods are designed for normal lighting conditions, neglecting the enhancement of texture details and target brightness under insufficient lighting conditions. Summary of the Invention

[0005] The present invention aims to solve the technical problems in the prior art by providing an infrared and visible light image fusion method based on cross-modal enhancement and multi-attention fusion strategies.

[0006] To solve the above-mentioned technical problems, the technical solution of the present invention is as follows:

[0007] An infrared and visible light image fusion method based on cross-modal enhancement and multi-attention fusion strategy includes the following steps:

[0008] Step 1: Acquire the original infrared image and the original visible image from the same scene that have been registered;

[0009] Step 2: Encoding, which enhances the features extracted from the convolutional layer based on the cross-modal enhancement module and the minutiae injection module;

[0010] Step 3: Fusion, based on a multi-attention mechanism to guide the fusion of visible light and infrared features;

[0011] Step 4: Decoding. Decode the fused features to reconstruct the fused image.

[0012] In the above technical solution, step two specifically includes:

[0013] A 1×1 convolutional layer is applied to remove modal differences between infrared and visible light images; four 3×3 convolutional layers are applied to further extract deeper features.

[0014] The features are input into the cross-modal enhancement module and processed by the polarization self-attention module to obtain valuable features; the local features processed by the polarization self-attention module are then subjected to two processes: refined feature extraction and cross-modal complementation.

[0015] The joint features output by the cross-modal enhancement module and the second convolutional layer are injected into the third layer, and the resulting feature is injected into the fourth layer.

[0016] Finally, the output from the detail injection module is processed by the fifth convolutional layer.

[0017] In the above technical solution, step three specifically includes:

[0018] In the first step, infrared and visible light features are processed by the luminance attention module and the texture attention module, respectively. The visible light texture information is protected by a high-pass filter layer, and then the maximum and average values ​​are obtained by the max pooling layer and the average pooling layer, respectively. After being combined by the concatenation operator C1, the dimensionality is reduced by two consecutive 1×1 convolutional layers. Then, the sigmoid equation δ is used for regularization to generate the weight parameter w1, as shown in the following formula:

[0019]

[0020] Maximum and average brightness are obtained by max pooling and average pooling layers, and then combined by the concatenation operator C2. The result is then reduced in dimensionality by two consecutive 1×1 convolutional layers, and regularized by the sigmoid equation δ to generate the weight parameter w2, as shown in the following formula:

[0021]

[0022] Infrared features and visible light characteristics Each is multiplied by its respective weight w1 and w2 using the smart multiplication operator. addition operators Obtain intermediate features and The specific implementation formula is as follows:

[0023]

[0024] The second step is to cascade the intermediate features of infrared and visible light in the channel to obtain the joint feature. The weight values ​​w are then obtained through parallel processing of channel attention and spatial attention. ch and w sp The product of the two is then regularized using the sigmoid equation δ to generate the final weight value w.

[0025]

[0026]

[0027]

[0028] The fusion feature is obtained by the following formula:

[0029]

[0030] In the above technical solution, step four specifically involves using the fusion result as input to the decoding module, which then reconstructs the image.

[0031] In the above technical solution, the decoding module includes five convolutional layers: the kernel size of the first four convolutional layers is 3×3, and the activation function is Leaky-Relu(); the kernel size of the last convolutional layer is 1×1, and the activation function is Tanh().

[0032] The present invention has the following beneficial effects:

[0033] This invention presents an infrared and visible light image fusion method based on a cross-modal enhancement and multi-attention fusion strategy. During the encoding stage, a cross-modal enhancement network is added to complementarily enhance infrared and visible light features, capturing their inherent correlation. Simultaneously, a detail injection module is added to enhance the representation of detailed information. Furthermore, a multi-attention fusion strategy replaces the ordinary fusion method, introducing texture attention and brightness attention modules to process visible light and infrared features respectively. A spatial-channel attention module calculates the fusion weights to guide the fusion of each component, obtaining the final fusion result. Attached Figure Description

[0034] The present invention will now be described in further detail with reference to the accompanying drawings and specific embodiments.

[0035] Figure 1 This is a flowchart illustrating the infrared and visible light image fusion method based on cross-modal enhancement and multi-attention fusion strategy of the present invention.

[0036] Figure 2 This is a schematic diagram of the overall architecture of the infrared and visible light image fusion method based on cross-modal enhancement and multi-attention fusion strategy of the present invention.

[0037] Figure 3 This is a schematic diagram of a cross-modal enhancement network.

[0038] Figure 4 This is a schematic diagram of a multi-attention fusion network.

[0039] Figure 5 This is a schematic diagram of the subjective fusion result.

[0040] Figure 6 This is a schematic diagram illustrating the cumulative distribution of objective fusion evaluation parameters. Detailed Implementation

[0041] The inventive concept of this invention is as follows: The infrared and visible light image fusion method based on cross-modal enhancement and multi-attention fusion strategy aims to provide a novel end-to-end convolutional neural network architecture to address the shortcomings of existing deep learning-based infrared and visible light image fusion technologies, such as weak ability to extract important features, poor performance in removing redundant features, and poor adaptability to low-light environments.

[0042] Current infrared and visible light image fusion techniques based on deep learning algorithms mostly extract features from the two source images separately and then obtain the fused result through certain fusion rules. Since there are many complementary and intrinsically related features between infrared and visible light images, complex network components are often needed to improve the extraction capability of important features. This scheme incorporates a cross-modal enhancement network in the encoding stage to complementarily enhance infrared and visible light features, capturing their intrinsic correlation. Simultaneously, a detail injection module is added to enhance the representation of detailed information.

[0043] This invention adopts a fusion strategy based on a multi-attention mechanism instead of the ordinary fusion method. It introduces texture attention and brightness attention modules to process visible light features and infrared features respectively, and uses a spatial-channel attention module to calculate fusion weights to guide the fusion of each component and obtain the final fusion result.

[0044] The present invention will now be described in detail with reference to the accompanying drawings.

[0045] First, the overall architecture of the novel end-to-end convolutional neural network is explained, such as... Figure 2 As shown, the network architecture consists of three parts: encoding, fusion, and decoding.

[0046] The infrared and visible light image fusion method based on cross-modal enhancement and multi-attention fusion strategy of the present invention includes the following steps:

[0047] Step 1: Acquire the original infrared image and the original visible image from the same scene that have been image registered.

[0048] Step 2: Encoding, which enhances the features extracted from the convolutional layer based on the cross-modal enhancement module and the details injection module.

[0049] Five convolutional layers are used to extract image features during the encoding process. First, a 1×1 convolutional layer is applied to remove modal differences between infrared and visible light images; then, four 3×3 convolutional layers are used to further extract deeper features. The extraction process is shown in the following equation:

[0050]

[0051] F vi and F ir These are visible light and infrared features, respectively. Enc(·) represents the feature encoding operator. It is the Y channel data of the visible light image, I ir It is an infrared image.

[0052] Secondly, it was designed as follows Figure 3The cross-modal enhancement module shown is used to integrate infrared and visible light features from the second, third, and fourth convolutional layers, respectively, to achieve spectral complementarity between the two types of heterogeneous images. The cross-modal enhancement module mainly consists of two parts: refined feature extraction and cross-modal complementation. The features input to the cross-modal enhancement module are first processed by the polarization self-attention module (PSA) to obtain more valuable features, as shown in the following equation:

[0053]

[0054]

[0055] and These represent the visible light and infrared characteristics of each convolutional layer, respectively. and These are the visible light and infrared characteristics after PSA processing.

[0056] The features processed by PSA, i.e., local features, undergo two further processing steps: refined feature extraction and cross-modal complementation. Refined feature extraction is achieved by compressing useless features and compensating for useful features: first, local features are processed using a global average pooling layer. and The process involves obtaining global features; then intelligently adding the global features to the local features and performing regularization using the sigmoid equation to obtain weight values; finally, intelligently multiplying the local features with the weight values ​​to obtain the final refined features. The specific implementation formula is as follows:

[0057]

[0058]

[0059] and These represent refined features for infrared and visible light, respectively. and These represent the infrared and visible light characteristics after PSA processing, respectively. δ is the sigmoid equation, and GAP is the global average pooling. and These are the element-wise smart multiplication and addition operators, respectively.

[0060] Cross-modal complementarity is used to enhance the model's ability to represent infrared and visible light features. First, complementary features are obtained through difference operations between infrared and visible light features. Then, these complementary features are compressed into a scalar using a global average pooling layer and regularized using the sigmoid equation. Finally, this regularized scalar is multiplied by the complementary features, and the product is cross-added to each other's features, achieving cross-compensation of information. The specific implementation formula is as follows:

[0061]

[0062]

[0063] and These are the infrared and visible light characteristics after cross-complementary processing. and These represent refined features for infrared and visible light, respectively. and These are the infrared and visible light characteristics after PSA processing, respectively. δ is the sigmoid equation, GAP is global average pooling, and ⊙ represents the channel smart multiplication operator.

[0064] Furthermore, the joint features from the second, third, and fourth layers after the above processing are used as input to the detail injection module to further enhance the main feature information: that is, the joint features output from the cross-modal enhancement module and the second convolutional layer are injected into the third layer, and the resulting feature is then injected into the fourth layer. The specific process is as follows:

[0065]

[0066]

[0067]

[0068] α and β are intermediate variables, δ is the sigmoid equation, and Conv is the convolution operator. and Representing two neighborhood features of visible light / infrared, ReLU is the activation function, and BN is the batch normalization process. and These are the element-wise smart multiplication and addition operators, respectively, and IN represents instance regularization.

[0069] Finally, the output from the detail injection module is processed by a fifth convolutional layer. At this point, the main features of the infrared and visible light images have been completely extracted. The main properties of each convolutional layer are shown in Table 1.

[0070] Table 1. Important attributes of each convolutional layer in the network

[0071]

[0072] Step 3: Fusion, which guides the fusion of visible light and infrared features based on a multi-attention mechanism.

[0073] The fusion phase is mainly implemented by a multi-attention fusion module, such as... Figure 4 As shown, it consists of two steps.

[0074] In the first step, infrared and visible light features are processed by the brightness attention module and the texture attention module, respectively. The main texture module consists of a high-pass filter layer HP(), a max-pooling layer MP(), an average pooling layer AP(), and two 1×1 convolutional layers Conv(). n The visible light texture information is protected by a high-pass filter layer, and then the maximum and average values ​​are obtained by max pooling and average pooling layers respectively. These values ​​are then combined by the concatenation operator C1 and passed through two consecutive 1×1 convolutional layers to reduce the dimensionality. Finally, regularization is performed using the sigmoid equation δ to generate the weight parameter w1, as shown in the following formula:

[0075]

[0076] The brightness attention module ensures that the model focuses more on regions with high pixel brightness, consisting of a max pooling layer (MP()), an average pooling layer (AP()), and two 1×1 convolutional layers (Conv()). n The maximum and average brightness are obtained from max pooling and average pooling layers, and then combined using the concatenation operator C2. This result is then dimensionality reduced using two consecutive 1×1 convolutional layers, and regularized using the sigmoid equation δ to generate the weight parameter w2, as shown in the following formula:

[0077]

[0078] Finally, infrared features and visible light characteristics Each is multiplied by its respective weight w1 and w2 using the smart multiplication operator. addition operators Obtain intermediate features and The specific implementation formula is as follows:

[0079]

[0080]

[0081] The second step is to cascade the intermediate features of infrared and visible light in the channel to obtain the joint feature. The weight values ​​w are then obtained through parallel processing of channel attention and spatial attention. ch and w sp The product of the two is then regularized using the sigmoid equation δ to generate the final weight value w.

[0082]

[0083]

[0084]

[0085] Finally, the fusion features can be obtained by the following formula:

[0086]

[0087] in, Representing the characteristics of integration, and These are the element-wise smart multiplication and addition operators, respectively. and These are the infrared and visible light characteristics after PSA processing.

[0088] Step 4: Decoding. Decode the fused features to reconstruct the fused image.

[0089] The fusion result serves as input to the decoding module, which reconstructs the image. The decoding module consists of five convolutional layers, whose main properties are shown in Table 1. Except for the last layer, which has a 1×1 kernel size, all other convolutional layers are 3×3. Furthermore, the activation function for the first four convolutional layers is Leaky-Relu(), and the activation function for the last layer is Tanh().

[0090] In addition, a loss of brightness L Int Gradient loss L Grad and structural similarity loss L SSIM The combined loss equation L Total These are introduced to help the model retain more image details. If λ1 = 2.5, λ2 = 10, and λ3 = 3 are used as weighting factors, the loss equation is implemented as follows:

[0091] L Total =λ1L Int +λ2L Grad +λ3L SSIM

[0092] Remember I ir For infrared images, I vi For visible light images, I f To fuse the images, the brightness loss is L. Int It can be defined by the following formula:

[0093]

[0094] Where H and W are the image dimensions, It is a smart multiplication operator. Furthermore, by extracting significant components from the significant matrix S, we have:

[0095] ω ir =S(I ir ) / [S(I ir)-S(I vi )]

[0096] ω vi =1-ω ir

[0097] Gradient loss L Grad It can be defined by the following formula:

[0098]

[0099] in, For differentiation operations, max(·) is the maximum value operation, and ||·||1 is the 1-norm.

[0100] Structural similarity loss L SSIM It can be defined by the following formula:

[0101] L SSIM =1-SSIM (x,y) (I vi ,I f SSIM (x,y) (I ir ,I f )

[0102] SSIM(·) is the structural similarity calculation operator.

[0103] To verify the effectiveness of the infrared and visible light image fusion method based on cross-modal enhancement and multi-attention fusion strategy of the present invention, the following verification experiment was conducted.

[0104] like Figure 5 As shown, the present invention is compared with nine existing image fusion methods, where (a) is an infrared image, (b) is a visible light image, (c) is an anisotropic diffusion filter (ADF) method, (d) is a fourth-order partial differential equation (FPDE) method, (e) is a fusion method based on dense block-convolutional layer feature extraction (DenseFuse) method, (f) is a fusion method based on convolutional neural network (IFCNN) method, (g) is a fusion method guided by sparse learning (LRRNet) method, (h) is a fusion method based on nested connections and spatial / channel attention model (NestFuse) method, (i) is a residual fusion network (RFN-Nest) method, (j) is a unified unsupervised fusion (U2Fusion) method, (k) is a generative adversarial network fusion (FusionGAN) method, and (l) is the present invention (Ours).

[0105] Figure 6This is a cumulative distribution diagram of the evaluation parameters. Based on the above 9 comparison methods and the 100 fusion results obtained by the method of this invention, their entropy (EN), mutual information (MI), standard deviation (SD) and visual information fidelity (VIF) are calculated respectively, and their cumulative distribution diagram is drawn.

[0106] The above experiments show that the image fused by the method of the present invention has higher infrared target brightness, richer scene details, higher contrast, and no spectral mixing phenomenon.

[0107] This invention presents an infrared and visible light image fusion method based on a cross-modal enhancement and multi-attention fusion strategy. During the encoding stage, a cross-modal enhancement network is added to complementarily enhance infrared and visible light features, capturing their inherent correlation. Simultaneously, a detail injection module is added to enhance the representation of detailed information. Furthermore, a multi-attention fusion strategy replaces the ordinary fusion method, introducing texture attention and brightness attention modules to process visible light and infrared features respectively. A spatial-channel attention module calculates the fusion weights to guide the fusion of each component, obtaining the final fusion result.

[0108] Obviously, the above embodiments are merely illustrative examples for clear explanation and are not intended to limit the implementation. Those skilled in the art will recognize that other variations or modifications can be made based on the above description. It is neither necessary nor possible to exhaustively list all possible implementations here. However, obvious variations or modifications derived therefrom are still within the scope of protection of this invention.

Claims

1. A method for fusing infrared and visible light images based on a cross-modal enhancement and multi-attention fusion strategy, characterized in that, Includes the following steps: Step 1: Acquire the original infrared image and the original visible image from the same scene that have been registered; Step 2: Encoding, which enhances the features extracted from the convolutional layer based on the cross-modal enhancement module and the minutiae injection module; Step 3: Fusion, based on a multi-attention mechanism to guide the fusion of visible light and infrared features; Step 4: Decoding, decoding the fused features to reconstruct the fused image; Step two specifically includes: A 1×1 convolutional layer is applied to remove modal differences between infrared and visible light images; four 3×3 convolutional layers are applied to further extract deeper features. The features are input into the cross-modal enhancement module and processed by the polarization self-attention module to obtain valuable features; the local features processed by the polarization self-attention module are then subjected to two processes: refined feature extraction and cross-modal complementation. The joint features output by the cross-modal enhancement module and the second convolutional layer are injected into the third layer, and the resulting feature is injected into the fourth layer. Finally, the output from the detail injection module is processed by the fifth convolutional layer; Step three specifically includes: In the first step, infrared and visible light features are processed by the luminance attention module and the texture attention module, respectively; the visible light texture information is preserved by a high-pass filter layer, and then the maximum and average values ​​are obtained by the max pooling layer and the average pooling layer, respectively, and finally processed by the cascade operator. After unification, its dimensionality is reduced by using two consecutive 1×1 convolutional layers, and then the sigmoid equation is applied. Perform regularization to generate weight parameters. The formula is as follows: in, This indicates a max pooling operation. This indicates a high-pass filtering operation. This indicates the average pooling operation. Indicates the convolution operation; Maximum and average brightness are obtained from max pooling and average pooling layers, and by the concatenation operator. They are combined; the result reduces its dimensionality through two consecutive 1×1 convolutional layers, and then by the sigmoid equation. Perform regularization to generate weight parameters. The formula is as follows: Infrared features and visible light characteristics By respectively using their respective weights and Smart multiplication operator addition operators Obtain intermediate features and The specific formula is as follows: ; The second step is to cascade the intermediate features of infrared and visible light in the channel to obtain the joint feature. The weight values ​​are then obtained through parallel processing of channel attention and spatial attention. and The product of the two is then processed by the sigmoid equation. Regularization is applied to generate the final weight values. ; in, Indicates global average pooling. Indicates the convolution operator; Fusion features Obtained from the following formula: 。 2. The infrared and visible light image fusion method based on cross-modal enhancement and multi-attention fusion strategy according to claim 1, characterized in that, Step four specifically involves using the fusion result as input to the decoding module, which then reconstructs the image.

3. The infrared and visible light image fusion method based on cross-modal enhancement and multi-attention fusion strategy according to claim 2, characterized in that, The decoding module includes five convolutional layers: the kernel size of the first four convolutional layers is 3×3, and the activation function is Leaky-Relu(); the kernel size of the last convolutional layer is 1×1, and the activation function is Tanh().