Infrared-visible image fusion method based on diffusion style feature prior
By collaboratively designing the diffusion style feature prior module (DFP) and the cross-modal fusion module (CMF), the problems of insufficient structural modeling and inadequate cross-modal alignment in infrared and visible light image fusion are solved, generating high-quality fused images suitable for high-level vision tasks in complex scenes.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHONGQING UNIV OF TECH
- Filing Date
- 2026-02-06
- Publication Date
- 2026-06-19
AI Technical Summary
Existing infrared and visible light image fusion methods struggle to maintain detail clarity and structural integrity in complex scenes, suffer from insufficient cross-modal alignment and fusion, and lack computational efficiency and stability, making it difficult to meet the requirements for high-precision fusion.
The visible light structure is explicitly modeled using the diffusion style feature prior module (DFP), and multi-scale encoding and alignment are performed using the cross-modal fusion module (CMF). Spatial adaptive feature maps are generated through the diffusion style U-Net network, and model parameters are optimized using an iterative update mechanism and an unsupervised loss function to achieve accurate alignment and complementary fusion of infrared and visible light images.
The generated fused image significantly improves the clarity of details such as edges and textures in complex scenes, enhances structural integrity, reduces error accumulation, improves computational efficiency, and is highly adaptable, making it suitable for high-level vision tasks such as military surveillance and autonomous driving.
Smart Images

Figure CN122243757A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of multi-sensor information processing technology, and specifically to an infrared-visible image fusion method based on diffusion style feature priors. Background Technology
[0002] Image fusion, a core technology in multi-sensor information processing, aims to generate new images with richer information and more complete structure by integrating multi-source image information from different sensors or imaging conditions. This improves image readability and robustness, providing reliable data support for high-level visual tasks such as military surveillance, nighttime security, autonomous driving, and target detection and recognition. Among these, infrared and visible light image fusion is one of the most valuable applications. Infrared images can highlight the thermal radiation characteristics of targets and have strong target saliency, but they lack texture and structural information. Visible light images contain rich edge, texture, and scene structural details, but their imaging quality is prone to degradation under complex conditions such as low illumination and strong light interference. The complementary fusion of these two technologies is of great significance for comprehensively depicting a scene.
[0003] Early infrared and visible light image fusion methods primarily relied on traditional manually designed rules. These methods processed the original images through multi-scale decomposition, sparse representation, subspace transformation, or saliency analysis, and then fused them according to preset criteria. While these methods achieved certain results in specific scenarios, they were highly sensitive to parameter settings and scene changes, lacked adaptive adjustment capabilities, and struggled to cope with complex and ever-changing real-world application environments, gradually failing to meet the demands for high-precision fusion.
[0004] With the rise of deep learning technology, data-driven fusion methods have become mainstream. These methods automatically learn feature extraction, fusion, and reconstruction processes through neural networks, significantly reducing reliance on manual rules. In recent years, some studies have introduced the concept of diffusion models to improve structure preservation by modeling the distribution of image features; other approaches have attempted to incorporate cross-modal alignment mechanisms to alleviate spatial inconsistencies between infrared and visible light images. These explorations have driven the development of fusion technology, but overall, core bottlenecks have not yet been overcome, and there is still considerable room for improvement in fusion performance in complex scenarios.
[0005] Current technologies suffer from four major problems: First, insufficient prior modeling of visible light structures. Most methods directly use raw or shallow perceptual features, which easily loses subtle structures in weak texture, low contrast, or noisy scenes, leading to blurred fusion results. Second, insufficient coupling between cross-modal alignment and fusion. The alignment module and fusion process are designed separately, lacking collaborative optimization, which easily produces artifacts or structural misalignment. Third, limited adaptability to complex scenes. Under conditions such as low illumination, strong noise, and partial occlusion, it is difficult to balance the saliency of thermal targets and structural integrity, often resulting in overemphasis on infrared information or loss of visible light details. Fourth, insufficient framework efficiency and stability. Multi-stage processing can easily introduce error accumulation, and some complex models have high inference overhead, which is not conducive to real-time or resource-constrained applications.
[0006] Therefore, there is an urgent need for an infrared-visible light image fusion method that can explicitly model visible light structural information, achieve cross-modal multi-scale collaborative alignment and fusion, and balance efficiency and stability, so as to provide high-quality input images for high-level vision tasks in complex scenes. Summary of the Invention
[0007] To address the aforementioned technical problems, this application discloses an infrared-visible image fusion method based on diffusion style feature priors, specifically including:
[0008] Acquire infrared and visible light images;
[0009] The visible light image and the randomly sampled time step are input into the diffusion style feature prior module DFP. The spatial adaptive feature map is generated through the diffusion style U-Net network. The structure-aware visible light representation and feature prior are obtained through feature partitioning and iterative update mechanism.
[0010] The structure-aware visible light representation is converted to a luminance-chrominance space, and the luminance channel is normalized.
[0011] Multi-scale encoders in the cross-modal fusion module CMF extract multi-level features from infrared images and normalized visible light brightness channels, respectively.
[0012] Cross-modal alignment filter (CMAF) is introduced at each scale feature level to align and fuse infrared and visible light features;
[0013] After unifying the scale of the fusion features from various scales, the images are stitched together and a fusion brightness image is generated using a fusion prediction head.
[0014] By combining the chromaticity components of the fused luminance image and the visible light image, the final fused image is reconstructed.
[0015] A phased training strategy and an unsupervised fusion loss function are used to optimize the model parameters. The unsupervised fusion loss function includes brightness preservation constraints, gradient consistency constraints, and color consistency constraints.
[0016] Preferably, the diffusion-style U-Net network includes a network structure based on ResNet Blocks, consisting of a downsampling path, an intermediate module, an upsampling path, and a time step embedding module (TSE). The time step embedding module injects time step parameters into each ResNet Block to modulate the feature response within the block in an additive manner to adapt to the feature prediction requirements of different time steps. Furthermore, the diffusion-style U-Net is used for conditional feature prediction, performs forward inference only once, and does not involve noise sampling and iterative denoising processes.
[0017] Preferably, the operation flow of the iterative update mechanism is as follows: taking the original visible light image as the initial state; selecting feature sub-images after feature mapping in sequence; updating the feature sub-images obtained by feature mapping in channel order, using an explicit nonlinear mapping function, and iteratively updating the current image state according to deterministic update rules; and outputting a structure-aware visible light representation after repeating the process a preset number of times. The process is a deterministic mapping.
[0018] Preferably, the multi-scale encoder is a shared structure that extracts three layers of hierarchical feature representations sequentially through multi-layer convolution. The features of each layer maintain a consistent spatial resolution, and the structural and semantic information of different layers is represented by changes in channel dimension. The features of different scales correspond to small receptive field detail information, medium receptive field local structure, and large receptive field global semantic information, respectively, providing multi-dimensional feature support for cross-modal alignment.
[0019] Preferably, the cross-modal alignment filter (CMAF) includes feature-enhanced convolution, attention-based cross-modal feature matching, residual weighted fusion, and channel alignment operations. The feature matching process achieves spatial alignment by calculating the correlation matrix between infrared and visible light features. Channel alignment unifies the feature dimension through 1×1 convolution. The residual weighted fusion uses adaptive weight coefficients to balance the contributions of the two modal features.
[0020] Preferably, the phased training strategy is as follows: in the first phase, only the diffusion style feature prior module (DFP) is pre-trained using visible light images to construct a stable structural prior; in the second phase, the parameters of the DFP module are fixed, and only the cross-modal fusion module (CMF) is trained to achieve specific optimization of cross-modal feature alignment and fusion.
[0021] Preferably, the loss function for the brightness preservation constraint is: ,in, To fuse brightness images in The pixel value of the location, For the original visible light brightness channel in The pixel value of the location, and These are the height and width of the image, respectively, used to constrain the brightness consistency between the fused image and the visible light image.
[0022] Preferably, the gradient consistency constraint loss function is: ,in, and These represent the gradient operators in the horizontal and vertical directions, respectively, used to maintain the edge and structural integrity of the fused image.
[0023] Preferably, the constraint loss function for color consistency is: ,in, , To fuse images in The chromaticity component of the position, , For the original visible light image in The chromaticity component of the location is used to avoid color distortion during the blending process.
[0024] Preferably, the total loss of the unsupervised fusion loss function is:
[0025]
[0026] in, , , Loss balancing weights are used to collaboratively optimize the brightness, structure, and color quality of the fused image.
[0027] Compared with the prior art, the technical solution of this application has the following technical effects:
[0028] This invention uses the Diffusion Style Feature Prior Module (DFP) to explicitly model the structure of visible light images. By leveraging the Diffusion Style U-Net and iterative update mechanism, it progressively perceives structural information and suppresses noise interference without introducing random noise. This effectively compensates for the shortcomings of visible light structure modeling, making the edges, textures, and other details in the fusion result clearer and significantly improving structural integrity.
[0029] The cross-modal fusion module CMF of this invention employs multi-scale coding and cross-modal alignment filter CMAF to work together, achieving precise alignment and complementary fusion of infrared and visible light features at different scale levels. This solves the problems of cross-modal spatial inconsistency and insufficient coupling between alignment and fusion, avoids artifacts and structural misalignment in the fused image, and improves target saliency and overall scene consistency.
[0030] The single-stage inference process of this invention simplifies the traditional multi-stage processing framework and reduces error accumulation. At the same time, the phased training strategy optimizes the performance of each module in a targeted manner, taking into account both the inference efficiency and stability of the model. The unsupervised fusion loss function constrains the fusion result from multiple dimensions such as brightness, gradient, and color, ensuring that infrared thermal target information and visible light structural details are fully preserved, thereby enhancing the adaptability to complex scenes and low-quality images.
[0031] Meanwhile, the overall technical solution of this application achieves a balance between fusion effect and computational efficiency while maintaining lightweight characteristics. The generated fused image information is richer and the visual effect is more natural, providing more reliable input data support for subsequent high-level vision tasks such as military surveillance, autonomous driving, target detection and recognition, and has broad practical application value.
[0032] The above description is only an overview of the technical solution of this application. In order to better understand the technical means of this application and implement it in accordance with the contents of the specification, and to make the above and other objects, features and advantages of this application more obvious and understandable, the preferred embodiments of this application are described in detail below with reference to the accompanying drawings.
[0033] The above and other objects, advantages and features of this application will become more apparent to those skilled in the art from the following detailed description of specific embodiments in conjunction with the accompanying drawings. Attached Figure Description
[0034] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort. In all drawings, similar elements or parts are generally identified by similar reference numerals. In the drawings, the elements or parts are not necessarily drawn to scale.
[0035] Based on the description of the figures and their corresponding technical content in the document, the titles of the figures are as follows:
[0036] Figure 1 Flowchart of the overall steps of an infrared-visible image fusion method based on diffusion style feature prior;
[0037] Figure 2 : Schematic diagram of the collaborative working framework between the diffusion style feature prior module and the cross-modal fusion module;
[0038] Figure 3 Detailed schematic diagram of the diffusion-style U-Net network structure and iterative feature-driven update mechanism;
[0039] Figure 4 Schematic diagram of the internal structure of the multi-scale coding and cross-modal alignment filter in the cross-modal fusion module;
[0040] Figure 5 Comparison of infrared and visible light images in daytime scenes and the fusion results of various comparison algorithms with the present invention;
[0041] Figure 6 Comparison of infrared and visible light images in low-light nighttime scenes, and the fusion results of various comparison algorithms with the present invention. Detailed Implementation
[0042] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. In the following description, specific details such as specific configurations and components are provided merely to help fully understand the embodiments of this application. Therefore, those skilled in the art should understand that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this application. In addition, for clarity and brevity, descriptions of known functions and structures are omitted in the embodiments.
[0043] It should be understood that the phrase "an embodiment" or "this embodiment" throughout the specification means that a specific feature, structure, or characteristic related to the embodiment is included in at least one embodiment of this application. Therefore, "an embodiment" or "this embodiment" appearing throughout the specification does not necessarily refer to the same embodiment. Furthermore, these specific features, structures, or characteristics can be combined in any suitable manner in one or more embodiments.
[0044] Furthermore, reference numerals and / or letters may be repeated in different examples within this application. Such repetition is for the purpose of simplification and clarity and does not in itself indicate a relationship between the various embodiments and / or settings discussed.
[0045] In this article, the term "and / or" is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can mean: A exists alone, B exists alone, and A and B exist simultaneously. The term " / and" in this article describes another type of relationship between related objects, indicating that two relationships can exist. For example, A / and B can mean: A exists alone, and A and B exist alone. In addition, the character " / " in this article generally indicates that the related objects before and after it are in an "or" relationship.
[0046] In this article, the term "at least one" is merely a description of the relationship between related objects, indicating that there can be three relationships. For example, "at least one of A and B" can mean: A exists alone, A and B exist simultaneously, or B exists alone.
[0047] It should also be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion.
[0048] Example 1
[0049] This embodiment mainly describes an infrared-visible image fusion method based on diffusion style feature priors, such as... Figures 1-2 As shown, it specifically includes:
[0050] Acquire infrared and visible light images;
[0051] The visible light image and the randomly sampled time step are input into the diffusion style feature prior module DFP. The spatial adaptive feature map is generated through the diffusion style U-Net network. The structure-aware visible light representation and feature prior are obtained through feature partitioning and iterative update mechanism.
[0052] The structure-aware visible light representation is converted to a luminance-chrominance space, and the luminance channel is normalized.
[0053] Multi-scale encoders in the cross-modal fusion module CMF extract multi-level features from infrared images and normalized visible light brightness channels, respectively.
[0054] Cross-modal alignment filter (CMAF) is introduced at each scale feature level to align and fuse infrared and visible light features;
[0055] After unifying the scale of the fusion features from various scales, the images are stitched together and a fusion brightness image is generated using a fusion prediction head.
[0056] By combining the chromaticity components of the fused luminance image and the visible light image, the final fused image is reconstructed.
[0057] A phased training strategy and an unsupervised fusion loss function are used to optimize the model parameters. The unsupervised fusion loss function includes brightness preservation constraints, gradient consistency constraints, and color consistency constraints.
[0058] Furthermore, the acquired infrared image is a single-channel thermal radiation image, and its dimensions are represented as follows: (in For batch size during training or inference, Represents a single channel. , These represent the image's height and width, respectively. The pixel values of this image correspond to the thermal radiation intensity of the target in the scene, used to highlight the target's thermal salience. The acquired visible light image is a three-channel RGB image, with dimensions represented as follows: ( (Representing the three color channels: red, green, and blue), its pixel values contain the scene's structural outline, texture details, and color information.
[0059] Uniformly resize the images of both modalities: use bilinear interpolation to scale the images to a preset resolution (e.g., ...). This ensures that the spatial dimensions of the infrared image and the visible light image are completely consistent during subsequent feature extraction and fusion processes; at the same time, contrast-limited adaptive histogram equalization is performed on the brightness channel of the visible light image to obtain a stable brightness input for subsequent fusion.
[0060] Furthermore, the Diffusion Style Feature Prior Module (DFP), such as Figure 3 As shown, it generates 24-channel feature priors using a diffusion-style U-Net (DiffusionUNet). And based on this, visible light images Perform explicit iterative updates to obtain a priori representation of the structure, specifically:
[0061] The diffusion-style U-Net consists of a downsampling path (down), intermediate modules (mid), and an upsampling path (up), and introduces time-step embedding (TSE / temb) to conditionally modulate each ResNet block; the network has a U-Net structure with four resolution levels:
[0062] In the downsampling path, four resolution levels are set, each containing two ResNetBlocks (num_res_blocks=2). The base number of channels in the network is set to ch=8, and channel scaling is used to ensure that the number of feature channels for each resolution level is 8, 16, 24, etc. Each ResNetBlock employs a structure combining normalization layers, non-linear activation functions, and convolutional mappings, preferably using GroupNorm and Swish activation functions. Simultaneously, the linear projection result embedded at each time step is introduced into each ResNetBlock as conditional information, additively injected into the current feature representation, thereby achieving time-step conditional modulation. During the progressive advancement of resolution levels, the downsampling path reduces the spatial resolution of the feature maps through downsampling operations, enabling the network to complete multi-scale feature abstraction from local details to high-level structural information while gradually expanding the number of channels.
[0063] In the intermediate module (mid), further feature integration and global context modeling are performed on the lowest resolution features. This module consists of multiple ResNetBlocks and is used to perform global context modeling on features at the smallest spatial scale, thereby enhancing the network's ability to characterize the overall structural relationships and providing a stable structural prior representation for subsequent feature reconstruction.
[0064] In the upsampling path, the spatial resolution of the feature map is restored through progressive upsampling operations. At each corresponding resolution level, a skip connection is introduced to concatenate and fuse the feature map retained at the same level in the downsampling path with the current level's features along the channel dimension. This skip connection mechanism supplements the detailed information of high-resolution levels during the gradual restoration of spatial resolution, effectively preventing the loss of structural information during the reconstruction phase while the number of feature channels gradually decreases during the upsampling process.
[0065] Time Step Embedding (TSE) Module Figure 3 (a) TSE: Its role is to convert randomly sampled time steps t into embedding vectors that can be injected into the network. Specifically, time step t is first sinusoidally positionally encoded to obtain the basic embedding, and then two layers of linear mapping are used to generate the time step embedding vector; this embedding vector is additively injected into the features of each ResNetBlock for conditionally guided feature modeling. It should be noted that this diffusion-style U-Net only performs forward inference once and does not involve noise sampling or repeated denoising processes.
[0066] Feature mapping generation and partitioning, such as Figure 3 As shown in (a), the diffusion-style U-Net is used with the preprocessed visible light image. (dimension is) The time-step embedding vectors output by the TSE module are used as inputs, and the output space is an adaptive feature map. (its dimensions are) ,correspond Figure 3 (a) split into (output); will The feature map is evenly divided into 8 feature sub-maps along the channel dimension. ( Each feature submap has a dimension of 1. — Each feature subgraph after partitioning corresponds to a guiding feature for one iteration update, ensuring that each update can focus on structural information in different dimensions.
[0067] Iterative feature-driven updates, such as Figure 3 As shown in (b),
[0068]
[0069] in This represents element-wise multiplication. For explicit nonlinear mapping functions, it is defined as:
[0070]
[0071] The intermediate representation is obtained after the 4th iteration. After completing the 8th iteration, the prior structural representation is obtained. .
[0072] After completing 8 iterations, the structural prior representation will be... Converted to the YCbCr color space, the luminance component was separated. and chromaticity components , Furthermore, regarding the luminance component... Perform contrast-limited adaptive histogram equalization (CLAHE) to obtain the visible light brightness input for subsequent fusion. .
[0073] Furthermore, cross-modal fusion modules (CMF), such as Figure 4 As shown, specifically:
[0074] Multi-scale encoders, such as Figure 4 As shown in (a), a shared structure is adopted (i.e., the infrared image and the visible light brightness channel use the same encoder parameters), and respectively from the infrared image (dimension is) ) and the normalized visible light brightness channel (dimension is) Extract features at three scales (corresponding to) Figure 4 (Level 1, Level 2, Level 3 in (a)):
[0075] Level 1 (Low-scale features): Features are extracted through two 3×3 convolutional layers, outputting low-scale features. (Characteristics corresponding to the visible light brightness channel) and (Corresponding to the features of the infrared image), this scale feature corresponds to the detailed information of the image (such as edges and textures).
[0076] Level 2 (Mesoscale Features): Downsampling is achieved by performing a 3×3 convolutional layer with a stride of 2 on the low-scale features, followed by feature extraction through two more 3×3 convolutional layers to output the mesoscale features. and This scale feature corresponds to the local structural information of the image.
[0077] Level 3 (High-scale features): Downsampling is achieved by performing a 3×3 convolutional layer with a stride of 2 on the mesoscale features, and then features are extracted by two more 3×3 convolutional layers to output the high-scale features. and The scale feature corresponds to the global semantic information of the image. The above three-layer fusion features first unify the channel dimension through the channel alignment layer (Align) in the aggregation stage, and then splice the mid- and high-level features in the channel dimension to achieve effective aggregation of multi-scale features.
[0078] Cross-modal alignment filter (CMAF), such as Figure 4 As shown in (b), a cross-modal alignment filter (CMAF) module is deployed at each scale level (corresponding to...). Figure 4 (a) Its function is to achieve spatial alignment and complementary fusion of infrared and visible light features, and it applies this to the input features at each level. and First, normalization (Norm) is performed, followed by local feature enhancement through depthwise separable convolution (DWConv). Then, the enhanced features from the two modalities are concatenated along the channel dimension and mixed via a hybrid mapping (MixConv) to generate a fused representation, which is used as the query Q. Cross-Attn is then performed using visible light features as the key K and infrared features as the value V to achieve spatial alignment and complementary fusion. The attention output, after being mapped by Proj_out, is fused with the two modal features using a gated residual method to obtain the final result. Its form can be expressed as:
[0079]
[0080] in The gating coefficient is a learnable factor.
[0081] Feature aggregation and fusion brightness generation, such as Figure 4 As shown in (a), after fusing features at each scale, the fused features at different scales need to be aggregated to generate the final fused brightness image:
[0082] Uniform feature scale: all fused features at each level have the same spatial resolution. Only channel dimension alignment is required via Align.
[0083] Feature stitching: fusing low-scale features Mesoscale fusion characteristics and high-scale fusion features Concatenate along the channel dimension to obtain aggregated features. (its dimensions are) ,correspond Figure 4 (a) Output of Concat).
[0084] Fusion brightness generation: Input the aggregated features into the fusion prediction head (corresponding to...) Figure 4 (a) ResNet Block×2 + Norm-SILU + Conv_out), the structure of this prediction head is 2 ResNet Blocks + layer normalization + SILU activation function + 1×1 convolutional layer. The two ResNet Blocks are used to integrate aggregated features, layer normalization and SILU activation function are used to enhance non-linear expression, and 1×1 convolutional layer is used to compress the number of channels to 1, finally generating the fused image brightness. (its dimensions are) ).
[0085] Furthermore, the image brightness will be fused. With visible light chromaticity component , The colors are combined and converted back to the RGB color space to obtain the final fused image, whose color information remains consistent with the visible light branch.
[0086] Furthermore, a phased training strategy is adopted. Since the DFP and CMF modules have different functions, a phased training strategy is used to optimize the parameters of the two modules separately. The specific process is as follows:
[0087] Phase 1 (DFP module pre-training): Training is conducted using only visible light images, with the input being... With a random sampling time step t, the DFP is optimized using an unsupervised loss consisting of multiple constraints to output a stable structural prior representation. Training employs a phased strategy: the first phase pre-trains the DFP, and the second phase freezes the DFP and trains the CMF; both phases use adaptive gradient optimization for parameter updates, and a learning rate decay strategy can be set to promote convergence.
[0088] Phase 2 (CMF Module Training): Fix all parameters of the DFP module and train only the CMF module; the input is the infrared image and the feature priors output by the DFP module. The output is the final fused image; the training objective is to minimize the unsupervised fusion total loss function, the optimizer uses AdamW, and the initial learning rate is... The training iterations were set to 150 rounds, with the learning rate decreasing to 0.5 times the original value every 30 rounds.
[0089] The total loss of unsupervised fusion is weighted by the constraints of brightness preservation, gradient consistency, and color consistency. The formula for the total loss is:
[0090]
[0091] in , , The loss balancing weight is used to adjust the contribution ratio of different constraint terms.
[0092] The detailed formulas and physical meanings of each constraint term are as follows:
[0093] Brightness retention constraint loss To ensure the consistency of pixel values between the fused luminance image and the visible light luminance channel, and to guarantee that the fused image retains the structural information of the visible light image, the formula is:
[0094] ,
[0095] in To fuse the brightness image in the first The sample, the first line, number Column pixel values, This represents the normalized pixel value of the visible light luminance channel at the corresponding position. For absolute value operations, This is a normalization factor to avoid the influence of batch size and image size on the loss value.
[0096] Gradient consistency constraint loss To constrain the gradient consistency between the fused luminance image and the visible light luminance channel, ensuring that the fused image retains the edge and texture details of the visible light image, the formula is:
[0097] ,
[0098] ,
[0099]
[0100] in and These are the gradients of the image in the horizontal and vertical directions (corresponding to edge information), respectively. This is a normalization factor to ensure a balance between the contributions of horizontal and vertical gradients.
[0101] Color consistency constraint loss Constraining the chromaticity consistency between the fused image and the original visible light image to avoid color distortion during the fusion process can be expressed as the sum of the average absolute values of the chromaticity differences.
[0102] This embodiment details how the present invention perceives visible light structural details through a diffusion style feature prior module and combines a multi-scale cross-modal alignment fusion mechanism to solve the problems of insufficient structural modeling and cross-modal misalignment in traditional methods. The generated fused image takes into account both the saliency of thermal targets and the integrity of texture. At the same time, the single-stage process improves efficiency and provides high-quality input for high-level vision tasks.
[0103] Based on Example 1, to comprehensively verify the effectiveness of the proposed multi-scale cross-modal infrared-visible image fusion method, systematic quantitative and qualitative verification was conducted on the publicly available LLVIP dataset. Seven mainstream fusion algorithms—U2Fusion, SeAFusion, ITFuse, SFINet, MaeFuse, MLFuse, and MSFAFusion—were selected as comparison objects. All comparison algorithms used publicly available implementation code and the optimal parameters set in the original papers to ensure the fairness and objectivity of the verification results.
[0104] The quantitative evaluation started from four core dimensions: structure preservation, information richness, visual consistency and cross-modal alignment effect of the fused images. Five authoritative indicators were selected: structural difference (SD), information entropy (EN), structural fidelity (SF), visual information fidelity (VIF) and alignment gain (AG). The fusion results of 50 pairs of infrared-visible light images in the dataset were statistically analyzed. The specific quantitative comparison results are shown in Table 1.
[0105] Table 1 presents a quantitative comparison of 50 pairs of images from the LLVIP dataset based on five metrics.
[0106] Method SD EN SF VIF AG U2Fusion 28.052 6.261 9.385 0.615 2.822 SeAFusion 45.125 7.235 13.113 0.306 3.832 ITFuse 33.892 6.814 5.544 0.617 1.901 SFINet 45.233 7.135 <![CDATA[ 15.319 ]]> 0.933 <![CDATA[ 4.477 ]]> MaeFuse 42.130 7.088 7.306 0.678 2.741 MLFuse 39.558 6.899 11.634 0.748 3.234 MSFAFusion 47.804 <![CDATA[ 7.319 ]]> 14.499 <![CDATA[ 1.021 ]]> 4.146 Ours <![CDATA[ 47.673 ]]> 7.412 17.289 1.093 5.339
[0107] As shown in Table 1, this invention demonstrates comprehensive and outstanding advantages across five indicators. It ranks first in EN, SF, VIF, and AG, and second in SD, fully validating the method's overall performance. Information entropy (EN) measures the information carrying capacity of the fused image; a higher value indicates richer effective information integration. This invention achieves an EN of 7.412, significantly higher than all compared algorithms and 0.093 higher than the second-ranked MSFAFusion. This proves its ability to fully mine and fuse the thermal target information of the infrared image with the structural details of the visible light image, achieving deep complementarity between the two modalities. Structure fidelity (SF) reflects the ability of the fused image to retain the structural features of the source image. This invention, with an SF value of 17.289, far surpasses other algorithms and 1.97 higher than the second-ranked SFINet. This indicates that the method can retain the edge, texture, and other structural details of the visible light image to the greatest extent during the fusion process, effectively avoiding structural blurring or breakage problems.
[0108] Visual information fidelity (VIF) focuses on information consistency at the level of human visual perception. The VIF value of this invention is 1.093, superior to MSFAFusion's 1.021 and SFINet's 0.933, indicating that the fusion result is visually closer to the source image, with more natural detail presentation and better conformity to human visual characteristics. Alignment gain (AG) specifically evaluates the spatial alignment effect of cross-modal features. This invention achieves a high score of 5.339, significantly outperforming other algorithms and exceeding the second-ranked SFINet by 0.862. This fully verifies the effectiveness of the multi-scale cross-modal alignment filter (CMAF), effectively mitigating the spatial inconsistency problem between infrared and visible light images caused by different imaging mechanisms. In terms of structural dissimilarity (SD), this invention achieves an excellent score of 47.673, only slightly lower than MSFAFusion's 47.804. While maintaining high structural fidelity, it achieves a good balance between structural consistency and information complementarity, avoiding information loss caused by excessive pursuit of structural consistency.
[0109] Qualitative verification selected image pairs from two typical scenarios: normal daytime lighting and low nighttime lighting. The fusion results are as follows: Figure 5 , Figure 6 As shown ( Figure 5 , Figure 6 (a) is an infrared image. Figure 5 , Figure 6 (b) is a visible light image. Figure 5 , Figure 6 (c)-(i) are the results of the comparison algorithm. Figure 5 , Figure 6(j) represents the results of this invention. In daytime scenes, infrared images clearly highlight the thermal radiation features of the target, but lack background texture; visible light images contain rich scene structure, but the contrast between the target and the background is low. Among the comparison algorithms, the fusion results of U2Fusion and ITFuse suffer from insufficient target saliency and blurred background texture; SeAFusion and MaeFuse overemphasize infrared information, resulting in the loss of visible light details and an overall dark image; SFINet, MLFuse, and MSFAFusion, while able to balance target and structure, still suffer from local artifacts or unclear edges. The fusion results of this invention (j) represent the results of this invention. Figure 5 (j) Figure 6 In (j), the target outline is clear and sharp, the thermal significance is prominent, and the background texture, edge details and scene structure in the visible light image are completely preserved. The overall contrast is balanced, the visual effect is natural and harmonious, and there are no artifacts or structural misalignment. It perfectly presents the advantages of both modalities.
[0110] In low-light nighttime scenes, visible light images suffer from insufficient illumination, resulting in significant noise and blurred details, while infrared images clearly capture target information. The fusion results of contrast algorithms generally suffer from noise amplification, target-background confusion, or loss of structural details. Some algorithms even allow background textures to contaminate the target area, severely impacting the execution of subsequent visual tasks. This invention, leveraging the structural awareness capabilities of the Diffused Style Feature Prior (DFP) module and the precise alignment fusion mechanism of the CMF module, effectively suppresses noise interference in low-light environments while enhancing the detail representation of weakly textured areas. The fusion result exhibits clear target-background boundaries, structural integrity, low noise, and strong visual consistency, fully demonstrating excellent adaptability to complex, low-quality scenes.
[0111] The combined quantitative and qualitative verification results show that the present invention, through the collaborative design of diffusion style feature prior construction and multi-scale cross-modal alignment fusion, exhibits significant advantages in information integration, structure preservation, visual consistency, and cross-modal alignment. It can generate high-quality fused images, providing more reliable input data support for subsequent high-level vision tasks such as target detection and recognition.
[0112] The above are merely preferred embodiments of the present invention and are not intended to limit the scope of protection of the present invention. For those skilled in the art, the present invention can have various modifications and variations. Any changes, modifications, substitutions, integrations, and parameter changes made to these embodiments within the spirit and principles of the present invention, without departing from the principles and spirit of the present invention, through conventional substitutions or to achieve the same function, fall within the scope of protection of the present invention.
Claims
1. An infrared-visible image fusion method based on diffusion style feature prior, characterized in that, include: Acquire infrared and visible light images; The visible light image and the randomly sampled time step are input into the diffusion style feature prior module DFP. The spatial adaptive feature map is generated through the diffusion style U-Net network. The structure-aware visible light representation and feature prior are obtained through feature partitioning and iterative update mechanism. The structure-aware visible light representation is converted to a luminance-chrominance space, and the luminance channel is subjected to conventional contrast adjustment. Multi-scale encoders in the cross-modal fusion module CMF extract multi-level features from infrared images and normalized visible light brightness channels, respectively. Cross-modal alignment filter (CMAF) is introduced at each scale feature level to align and fuse infrared and visible light features; The fusion features at each scale are unified in channel dimension by the channel alignment layer Align and then stitched together. Finally, a fusion brightness image is generated by the fusion prediction head. By combining the chromaticity components of the fused luminance image and the visible light image, the final fused image is reconstructed. A phased training strategy and an unsupervised fusion loss function are used to optimize the model parameters. The unsupervised fusion loss function includes brightness preservation constraints, gradient consistency constraints, and color consistency constraints.
2. The infrared-visible image fusion method based on diffusion style feature prior as described in claim 1, characterized in that, The diffusion-style U-Net network includes a network structure based on ResNet Blocks, consisting of a downsampling path, an intermediate module, an upsampling path, and a time step embedding module (TSE). The time step embedding module injects time step parameters into each ResNet Block to modulate the feature response within the block in an additive manner to adapt to the feature prediction requirements of different time steps. Furthermore, the diffusion-style U-Net is used for conditional feature prediction, performs forward inference only once, and does not involve noise sampling and iterative denoising processes.
3. The infrared-visible image fusion method based on diffusion style feature prior as described in claim 2, characterized in that, The operation flow of the iterative update mechanism is as follows: taking the original visible light image as the initial state; selecting the feature sub-images after feature mapping in sequence; The feature sub-maps obtained by feature mapping are updated sequentially according to channel order, and an explicit nonlinear mapping function is used to update them according to deterministic rules; after repeating this process a preset number of times, a structure-aware visible light representation is output.
4. The infrared-visible image fusion method based on diffusion style feature prior as described in claim 1, characterized in that, The multi-scale encoder is a shared structure that extracts three layers of hierarchical feature representations through multiple convolutions. The feature spatial resolution of each layer remains consistent, and the number of channels increases progressively. These layers represent hierarchical information from details to high-level semantics, providing multi-dimensional feature support for cross-modal alignment.
5. The infrared-visible image fusion method based on diffusion style feature prior as described in claim 4, characterized in that, The cross-modal alignment filter (CMAF) includes normalization processing, feature enhancement through depthwise separable convolution, attention-based cross-modal feature matching, and residual fusion output. Cross-modal feature matching achieves spatial alignment by calculating the correlation matrix, and residual fusion uses learnable scalar coefficients to adjust the attention output contribution.
6. The infrared-visible image fusion method based on diffusion style feature prior as described in claim 1, characterized in that, The phased training strategy is as follows: In the first phase, only the diffusion style feature prior module (DFP) is pre-trained using visible light images to learn a stable structural prior with regularization constraints and brightness / color consistency constraints; In the second phase, the parameters of the DFP module are fixed, and only the cross-modal fusion module (CMF) is trained to achieve specific optimization of cross-modal feature alignment and fusion.
7. The infrared-visible image fusion method based on diffusion style feature prior as described in claim 1, characterized in that, The loss function for the brightness preservation constraint is: ,in, To fuse brightness images in The pixel value of the location, For the original visible light brightness channel in The pixel value of the location, and These are the height and width of the image, respectively, used to constrain the brightness consistency between the fused image and the visible light image.
8. The infrared-visible image fusion method based on diffusion style feature prior as described in claim 1, characterized in that, The constraint loss function for gradient consistency is: ,in, and These represent the gradient operators in the horizontal and vertical directions, respectively, used to maintain the edge and structural integrity of the fused image.
9. The infrared-visible image fusion method based on diffusion style feature prior as described in claim 1, characterized in that, The constraint loss function for color consistency is: ,in, , To fuse images in The chromaticity component of the position, , For the original visible light image in The chromaticity component of the location is used to avoid color distortion during the blending process.
10. The infrared-visible image fusion method based on diffusion style feature prior according to any one of claims 1, 7, 8, and 9, characterized in that, The total loss of the unsupervised fusion loss function is: ,in, , , Loss balancing weights are used to collaboratively optimize the brightness, structure, and color quality of the fused image.