An infrared and visible light image fusion method and system based on model training optimization
By constructing high-dimensional semantic features that simulate infrared and visible light image views and text descriptions, and combining them with the Stable Diffusion UNet architecture, the problem of insufficient training supervision in infrared and visible light image fusion is solved, achieving high-quality and semantically controllable fusion results.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGDONG UNIV OF TECH
- Filing Date
- 2026-03-23
- Publication Date
- 2026-06-19
Smart Images

Figure CN122243765A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image processing technology, and in particular to an infrared and visible light image fusion method and system based on model training optimization. Background Technology
[0002] Infrared and visible light image fusion (IVIF) is a key technology in computer vision, aiming to combine the thermal target perception capabilities of infrared images with the rich texture details of visible light images to generate fused images that are more comprehensive and suitable for complex scenarios such as security monitoring and autonomous driving. However, this technology has long faced several core challenges.
[0003] First, image fusion tasks lack a definable, unique "ideal ground truth," making it difficult to perform accurate end-to-end supervised training of the model. Existing solutions often address this by constructing pseudo-ground truths, such as generating pseudo-modal pairs from natural images or relying on the output of low-light enhancement networks to guide fusion. However, these pseudo-ground truths differ significantly in modal domain from real imaging scenes, easily leading to error accumulation and impairing the model's generalization ability. Another approach is based on self-supervised strategies such as masked image modeling, but existing masking methods generally suffer from limited strategy simplification and weak cross-modal feature alignment capabilities.
[0004] Secondly, existing methods struggle to achieve precise and semantically controllable fusion. Traditional methods often rely on low-level visual losses such as gradients and structural similarity for optimization, failing to explicitly express high-level semantic intentions such as "highlighting hot targets, preserving details, and suppressing degradation." In recent years, although some works have attempted to introduce textual descriptions for semantic guidance, they mostly employ a decoupled paradigm of "fusion first, editing later," or only perform post-modulation of textual features. This fails to deeply and end-to-end integrate high-level semantic constraints into the entire fusion generation process, resulting in limited or even ineffective semantic guidance.
[0005] Therefore, in the absence of a true fusion ground truth, how to design a stable and effective training supervision mechanism and achieve deep fusion of semantics and generation process is a problem that current infrared and visible light image fusion technology urgently needs to solve. Summary of the Invention
[0006] To address at least one of the aforementioned technical problems, this invention proposes an infrared and visible light image fusion method and system based on model training optimization.
[0007] The first aspect of this invention provides an infrared-visible image fusion method based on model training optimization, comprising: Acquire at least one high-resolution natural image, perform normalization preprocessing on the high-resolution natural image, and generate a base view for multimodal imaging simulation; Based on the base view, a simulated infrared image view and a simulated visible light image view are generated. A degradation operation is performed on the simulated infrared image view and the simulated visible light image view to construct a degradation input image pair that simulates the difference between real imaging. This pair, together with the high-definition natural image, constitutes a dual degradation view input-original high-definition true value image training sample pair. The original high-definition ground truth images from the training sample pairs are input into the Qwen-VL visual language large model to generate text descriptions containing scene type, hot target distribution, visible light texture details and image degradation. The text descriptions are then mapped into high-dimensional semantic feature vectors. An infrared and visible light image fusion model is constructed based on the Stable Diffusion UNet architecture and a backbone network. The simulated infrared image view and the simulated visible light image view are mapped to the latent space respectively to obtain infrared latent features and visible light latent features. The infrared latent features and visible light latent features are aligned and fused in a dual-modal manner to output the dual-modal latent features. The bimodal latent features and high-dimensional semantic feature vectors are input into the infrared-visible image fusion model for training. The infrared image to be fused and the visible image are input into the trained infrared-visible image fusion model. The bimodal latent features are extracted through dual-branch encoding, combined with high-level constraints of textual semantic features, and the final infrared-visible image is output through DDIM inverse diffusion sampling.
[0008] In this scheme, acquiring at least one high-resolution natural image, performing standardized preprocessing on the high-resolution natural image, and generating a base view for multimodal imaging simulation specifically involves: Acquire at least one high-definition natural image, crop the high-definition natural image to a preset standard pixel value, and perform pixel value normalization processing on the cropped image; Noise removal is performed on the normalized image using a small-scale Gaussian kernel. The mean and standard deviation of the pixel values of the noise-removed image are calculated. A linear transformation is performed on each pixel value of the image so that the mean of the pixel values of the processed image is 0 and the standard deviation is 1, thus obtaining the basic view of the standardized preprocessed multimodal imaging simulation.
[0009] In this scheme, the step of generating a simulated infrared image view and a simulated visible light image view based on the base view, performing a degradation operation on the simulated infrared image view and the simulated visible light image view to construct a degradation input image pair simulating the differences in real imaging, and forming a dual degradation view input-original high-definition ground truth image training sample pair with the high-definition natural image, specifically: Two identical copies of the base view are generated based on the base view of the multimodal imaging simulation, serving as the simulated infrared image view and the simulated visible light image view, respectively. Define a mask block library containing various predetermined geometric shapes and sizes, generate a master mask image based on a preset mask ratio, and determine the shape, position and size of the non-zero mask region in the master mask image by randomly sampling from the mask block library, so that the proportion of the total mask area to the entire image area reaches a preset threshold. Generate a mirror image of the master mask that is perfectly aligned in the spatial dimension. Assign a value of 0 to the pixel positions that are 1 in the master mask image and assign a value of 1 to the pixel positions that are 0 in the master mask image. This results in a slave mask image that is pixel-complementary to the master mask image, wherein the non-zero regions of the master mask and the slave mask are complementary. The generated master mask is applied to a simulated infrared image view, and the mask is applied to a simulated visible light image view to obtain a spatially aligned and pixel-complementary mask coverage area. A degradation operation is performed on the masked area. The degradation operation is randomly selected from a predefined degradation operation library, including Gaussian blur, random noise, brightness shift, and contrast shift. Complementary degradation is performed on a preset proportion of the masked area, and joint degradation is performed on the remaining proportion of the masked area. The simulated infrared image view and the simulated visible light image view, which have undergone degradation processing, are combined with the high-definition ground truth image to form a dual-degraded view input-original high-definition ground truth image training sample pair.
[0010] In this scheme, the original high-resolution ground truth images from the training sample pairs are input into the Qwen-VL visual language large model to generate text descriptions containing scene type, hot object distribution, visible light texture details, and image degradation status. These text descriptions are then mapped into high-dimensional semantic feature vectors, specifically: The original high-definition ground truth images from the training sample pairs are input into the Qwen-VL visual language big model. The visual language big model generates fine-grained scene text descriptions of the original high-definition ground truth images based on scene text description generation instructions. These descriptions include scene type, number and distribution of significant hot targets, key visible light textures and environmental details, and descriptions of image degradation. The fine-grained scene text description is treated as a text string composed of a sequence of words and input into the text encoder module of the StableDiffusion UNet architecture. The text encoder module is a text encoder based on the CLIP model architecture. The encoder performs word segmentation on the input text string, converts the text string into a sequence of word indexes, and inputs the sequence of word indexes into the embedding layer of the text encoder. Each word index is mapped to a fixed-dimensional word vector to obtain the initial embedding vector sequence of the text. The initial embedding vector sequence is input into a multi-layer Transformer encoder. The word vector sequence is context-encoded through a multi-head self-attention mechanism and a feedforward neural network. Semantic association information between words is aggregated and extracted. The vector sequence output by the encoder is pooled to generate a high-dimensional semantic feature vector containing global semantic information and a fixed dimension.
[0011] In this scheme, an infrared and visible light image fusion model is constructed based on the Stable Diffusion UNet architecture and a fusion backbone network. The simulated infrared image view and the simulated visible light image view are mapped to the latent space to obtain infrared latent features and visible light latent features. The infrared latent features and visible light latent features are then aligned and fused in a dual-modal manner to output the dual-modal latent features. Specifically: An infrared and visible light image fusion model is constructed using the Stable Diffusion UNet model as the backbone network for image fusion. LoRA low-rank adaptive fine-tuning is applied to all attention layers in the UNet model. During training, only the weight parameters of the LoRA are optimized, and the original pre-trained weights of the UNet model are frozen. A dual-branch parallel encoder is constructed. Based on the Stable Diffusion variational autoencoder, downsampling features are extracted from the simulated infrared image view and the simulated visible light image view, respectively. The image view is encoded into a latent space feature representation with fixed dimension to obtain infrared latent features and visible light latent features. Both infrared latent features and visible light latent features are three-dimensional tensors. A cross-modal interactive attention module is constructed, taking the infrared latent features and the visible light latent features as inputs. The cross-modal interactive attention module performs cross-attention calculation on the two, using the latent features of one modality as the query vector and the latent features of the other modality as both the key vector and the value vector for feature interaction, and calculates a complementary attention map. The attention map is weighted and fused with the corresponding input latent features to output the fused features. The output features of the fusion process are then concatenated along the channel dimension to obtain aligned and deeply fused bimodal latent features.
[0012] In this scheme, the step of inputting the dual-modal latent features and high-dimensional semantic feature vectors into the infrared-visible light image fusion model for training specifically involves: Initialize the number of iterations, initial learning rate, batch size, and loss function weight coefficients of the Stable Diffusion model. During forward propagation of the model, sample random noise with the same shape as the bimodal latent features from the standard normal distribution. Add noise to the bimodal latent features according to the noise coefficient corresponding to the preset diffusion time step to obtain the noisy latent features at the current time step. The noisy latent features, the high-dimensional semantic feature vector, and the embedding vector corresponding to the current diffusion time step are input into the UNet network of the infrared-visible light image fusion model. The UNet network is processed by the encoder, bottleneck layer, decoder, and cross-attention module to output the predicted noise. When calculating the model loss, the predicted noise is compared with the real random noise of the degradation operation to construct the diffusion noise prediction loss, and the pixel-level mask reconstruction loss between the reconstructed image obtained by decoding the UNet network reconstruction output through the variational autoencoder of the Stable Diffusion and the original high-definition ground truth image in the training sample pair is calculated to obtain the mask reconstruction loss. By introducing fusion knowledge prior loss and semantic consistency loss, the diffusion noise prediction loss, mask reconstruction loss, fusion knowledge prior loss and semantic consistency loss are weighted and summed to obtain the total loss of model training. Based on the total loss, the trainable parameters of the LoRA fine-tuning module in the infrared-visible image fusion model are updated using the gradient backpropagation algorithm until the model converges, thus completing the training of the infrared-visible image fusion model.
[0013] In this scheme, the infrared and visible light image fusion models, which are trained by inputting the infrared and visible light images to be fused, extract dual-modal latent features through dual-branch coding, combine high-level constraints of textual semantic features, and output the final infrared and visible light fused image through DDIM inverse diffusion sampling. Specifically: The pixel-registered infrared image and visible light image to be fused are obtained. The infrared image and visible light image to be fused are input into the trained infrared and visible light image fusion model. They are encoded separately by a dual-branch parallel encoder to obtain the latent features of the infrared image and the latent features of the visible light image to be fused. The infrared image latent features and visible light image latent features to be fused are input into the cross-modal interactive attention module for dual-modal feature alignment and fusion to obtain the fused dual-modal latent features. The infrared image and visible light image pair to be fused are input into the Qwen-VL visual language large model to generate a fine-grained text description corresponding to the scene of the image pair. The text description is encoded by the CLIP text encoder to obtain a high-dimensional semantic feature vector for the fusion stage. Based on the DDIM inverse diffusion sampling algorithm, initial noise latent variables are sampled from the standard normal distribution. Combined with the fused dual-modal latent features and the high-dimensional semantic feature vector of the fusion stage, under the guidance of the UNet network of the infrared and visible light image fusion model, iterative denoising is performed step by step. After a predetermined number of sampling steps, the denoised fused latent variables are obtained. The denoised fusion latent variables are input into the decoder of the variational autoencoder of the Stable Diffusion to obtain the final infrared-visible fusion image.
[0014] A second aspect of the present invention also provides an infrared-visible image fusion system based on model training optimization. The system includes a memory and a processor. The memory includes a program for an infrared-visible image fusion method based on model training optimization. When the program for the infrared-visible image fusion method based on model training optimization is executed by the processor, it implements the steps of the infrared-visible image fusion method based on model training optimization as described in any of the preceding claims.
[0015] This invention discloses an infrared and visible light image fusion method and system based on model training optimization. First, simulated infrared and visible light views are generated from high-resolution natural images and degraded, constructing a dual-degraded view input-ground training sample pair. Second, the ground image is input into a visual language model to generate text describing the scene, thermal targets, textures, and degradation, and mapped to high-dimensional semantic features. A fusion model is constructed using Stable Diffusion UNet as the backbone, mapping and fusing the simulated views into dual-modal latent features. During the training phase, these latent features are combined with semantic vectors for optimization. In application, the image to be fused is input into the trained model, and dual-modal latent features are extracted through dual-branch encoding. Under high-level textual semantic constraints, the final infrared and visible light fused image is output through DDIM inverse diffusion sampling. This invention utilizes semantically guided generative training to effectively improve the quality of the fused image. Attached Figure Description
[0016] Figure 1 A flowchart of an infrared-visible light image fusion method based on model training optimization according to the present invention is shown; Figure 2 This invention illustrates a flowchart of mapping text descriptions to high-dimensional semantic feature vectors. Figure 3 The flowchart of training the infrared and visible light image fusion model of the present invention is shown; Figure 4 A block diagram of an infrared-visible image fusion system based on model training optimization according to the present invention is shown. Detailed Implementation
[0017] To better understand the above-mentioned objectives, features, and advantages of the present invention, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that, unless otherwise specified, the embodiments and features described in these embodiments can be combined with each other.
[0018] Many specific details are set forth in the following description in order to provide a full understanding of the invention. However, the invention may also be practiced in other ways different from those described herein, and therefore the scope of protection of the invention is not limited to the specific embodiments disclosed below.
[0019] Figure 1 The flowchart of an infrared-visible image fusion method based on model training optimization according to the present invention is shown.
[0020] like Figure 1 As shown, the first aspect of the present invention provides an infrared-visible image fusion method based on model training optimization, comprising: S102, acquire at least one high-definition natural image, perform standardized preprocessing on the high-definition natural image, and generate a basic view for multimodal imaging simulation; S104, Generate a simulated infrared image view and a simulated visible light image view based on the base view, perform degradation operation on the simulated infrared image view and the simulated visible light image view to construct a degradation input image pair that simulates the difference in real imaging, and form a dual degradation view input-original high-definition true value image training sample pair with the high-definition natural image; S106, input the original high-definition ground truth images from the training sample pairs into the Qwen-VL visual language large model to generate text descriptions containing scene type, hot target distribution, visible light texture details and image degradation, and map the text descriptions into high-dimensional semantic feature vectors; S108: Based on the Stable Diffusion UNet architecture, an infrared and visible light image fusion model is constructed by fusing the backbone network. The simulated infrared image view and the simulated visible light image view are mapped to the latent space respectively to obtain infrared latent features and visible light latent features. The infrared latent features and visible light latent features are aligned and fused in a dual-modal manner to output the dual-modal latent features. S110, the dual-modal latent features and high-dimensional semantic feature vectors are input into the infrared-visible image fusion model for training. The infrared image to be fused and the visible image are input into the trained infrared-visible image fusion model. The dual-modal latent features are extracted through dual-branch encoding. Combined with the high-level constraints of text semantic features, the final infrared-visible image is output through DDIM inverse diffusion sampling.
[0021] It should be noted that, based on a single high-resolution natural RGB image, two simulated views (representing infrared and visible light, respectively) are generated through cropping, normalization, and preprocessing. Irregular complementary masks are then generated for these views, and various degradation operations, including Gaussian blur, random noise, and brightness shift, are applied to certain areas. This constructs training sample pairs containing degraded inputs and the original ground truth, forming a stable pixel-level supervision signal in the absence of a true ground truth fusion. Simultaneously, the Qwen-VL visual language model is used to generate fine-grained text descriptions for the infrared and visible light images of the same scene, and these descriptions are transformed into semantic feature vectors using the CLIP text encoder, serving as high-level semantic constraints. Finally, the pre-trained StableDiffusion XL model is used... UNet serves as the backbone network, retaining its powerful generative capabilities while reducing training costs by adding a lightweight LoRA fine-tuning module. A dual-branch parallel encoder is designed to extract latent features from degraded infrared and visible light images respectively. A cross-modal interactive attention module is used to achieve bimodal feature alignment and fusion. The fused latent features are concatenated with noisy latent variables from the diffusion process and input into UNet. Textual semantic feature vectors are injected end-to-end into each step of the diffusion denoising process via a cross-attention mechanism, achieving joint constraints between bimodal visual conditions and textual semantic conditions. During training, noise prediction is used to optimize the objective, and the mask reconstruction loss, fusion knowledge prior loss, and semantic loss are jointly optimized. Consistency loss and diffusion loss enable the model to achieve end-to-end stable convergence under the condition of no truth value. In the inference stage, the registered infrared and visible light images are input and, after the same preprocessing, Qwen-VL generates the corresponding scene text description and encodes it into semantic feature vectors. Then, the preprocessed image and the text semantic feature vectors are input into the trained fusion model. The model extracts and fuses bimodal latent features through dual-branch encoding and cross-modal interaction. Combined with the high-level constraints of text semantics, the fused image is directly output after DDIM inverse diffusion sampling. Finally, it achieves robust, semantically controllable, detail-preserving infrared and visible light image fusion with excellent cross-scene generalization ability in complex scenes.
[0022] According to an embodiment of the present invention, the step of acquiring at least one high-resolution natural image, performing normalization preprocessing on the high-resolution natural image, and generating a base view for multimodal imaging simulation specifically includes: Acquire at least one high-definition natural image, crop the high-definition natural image to a preset standard pixel value, and perform pixel value normalization processing on the cropped image; Noise removal is performed on the normalized image using a small-scale Gaussian kernel. The mean and standard deviation of the pixel values of the noise-removed image are calculated. A linear transformation is performed on each pixel value of the image so that the mean of the pixel values of the processed image is 0 and the standard deviation is 1, thus obtaining the basic view of the standardized preprocessed multimodal imaging simulation.
[0023] It should be noted that, based on a single high-definition natural RGB image as input, the input image is first uniformly cropped to a resolution of 512×512, and the pixel values are linearly normalized from 0-255 to the range of [-1,1]. Simultaneously, lightweight Gaussian denoising and global contrast normalization preprocessing are performed to eliminate the interference of differences in image quality and dynamic range between different images on training stability.
[0024] According to an embodiment of the present invention, the step of generating a simulated infrared image view and a simulated visible light image view based on the base view, performing a degradation operation on the simulated infrared image view and the simulated visible light image view to construct a degradation input image pair simulating the differences in real imaging, and forming a dual degradation view input-original high-definition ground truth image training sample pair with the high-definition natural image, specifically involves: Two identical copies of the base view are generated based on the base view of the multimodal imaging simulation, serving as the simulated infrared image view and the simulated visible light image view, respectively. Define a mask block library containing various predetermined geometric shapes and sizes, generate a master mask image based on a preset mask ratio, and determine the shape, position and size of the non-zero mask region in the master mask image by randomly sampling from the mask block library, so that the proportion of the total mask area to the entire image area reaches a preset threshold. Generate a mirror image of the master mask that is perfectly aligned in the spatial dimension. Assign a value of 0 to the pixel positions that are 1 in the master mask image and assign a value of 1 to the pixel positions that are 0 in the master mask image. This results in a slave mask image that is pixel-complementary to the master mask image, wherein the non-zero regions of the master mask and the slave mask are complementary. The generated master mask is applied to a simulated infrared image view, and the mask is applied to a simulated visible light image view to obtain a spatially aligned and pixel-complementary mask coverage area. A degradation operation is performed on the masked area. The degradation operation is randomly selected from a predefined degradation operation library, including Gaussian blur, random noise, brightness shift, and contrast shift. Complementary degradation is performed on a preset proportion of the masked area, and joint degradation is performed on the remaining proportion of the masked area. The simulated infrared image view and the simulated visible light image view, which have undergone degradation processing, are combined with the high-definition ground truth image to form a dual-degraded view input-original high-definition ground truth image training sample pair.
[0025] It should be noted that two identical image views are generated from the preprocessed high-resolution source RGB image, serving as a simulated infrared view and a simulated visible light view, respectively. Irregular complementary masks of 512×512 are generated for the two views. The mask blocks are randomly selected from three sizes: 16×16, 32×32, and 64×64, to ensure that the irregular complementary masks satisfy the pixel-level complementary relationship. Complementary degradation operation is performed on 75% of the mask area, and joint degradation operation is performed on 25% of the area. The degradation operation is randomly selected from Gaussian blur, random noise mask, Gaussian noise, and brightness shift. The modal differences between infrared and visible light images are simulated through differentiated degradation processing. Finally, the degraded simulated infrared view and simulated visible light view are generated. The original high-resolution source RGB image is used as the global reconstruction supervision benchmark to construct a training sample pair of "dual degraded view input - original high-resolution image ground truth". Stable pixel-level supervision signal is formed under the condition of no real fusion ground truth. The degradation operations simulate various degradations in real imaging scenarios; the complementary degradation applies a degradation operation to the area covered by the master mask in the simulated infrared image view, while the original information is preserved at the corresponding position of the slave mask in the simulated visible light image view, and vice versa; the joint degradation applies the same degradation operation to the simulated infrared image view and the simulated visible light image view at the same spatial location (specified by the non-zero intersection or union of the master mask and slave mask).
[0026] Figure 2 The flowchart illustrating how the present invention maps text descriptions to high-dimensional semantic feature vectors is shown.
[0027] According to an embodiment of the present invention, the step of inputting the original high-resolution ground truth images from the training sample pairs into the Qwen-VL visual language large model to generate a text description containing scene type, hot object distribution, visible light texture details, and image degradation status, and mapping the text description into a high-dimensional semantic feature vector, specifically: The original high-definition ground truth images from the training sample pairs are input into the Qwen-VL visual language big model. The visual language big model generates fine-grained scene text descriptions of the original high-definition ground truth images based on scene text description generation instructions. These descriptions include scene type, number and distribution of significant hot targets, key visible light textures and environmental details, and descriptions of image degradation. The fine-grained scene text description is treated as a text string composed of a sequence of words and input into the text encoder module of the StableDiffusion UNet architecture. The text encoder module is a text encoder based on the CLIP model architecture. The encoder performs word segmentation on the input text string, converts the text string into a sequence of word indexes, and inputs the sequence of word indexes into the embedding layer of the text encoder. Each word index is mapped to a fixed-dimensional word vector to obtain the initial embedding vector sequence of the text. The initial embedding vector sequence is input into a multi-layer Transformer encoder. The word vector sequence is context-encoded through a multi-head self-attention mechanism and a feedforward neural network. Semantic association information between words is aggregated and extracted. The vector sequence output by the encoder is pooled to generate a high-dimensional semantic feature vector containing global semantic information and a fixed dimension.
[0028] It should be noted that, in terms of semantic guidance, the original high-resolution RGB source image is input into the Qwen-VL-7B visual language large model. A pre-defined scene-specific text guidance model generates a fine-grained text description corresponding to the scene. This generated text description is then input into the SDXL native CLIP text encoder to obtain a 77×768-dimensional text semantic feature vector, which serves as a high-level semantic constraint in the fusion generation process. This feature vector acts as a high-level semantic prior, injecting into each iteration of the diffusion denoising process end-to-end through the Stable Diffusion cross-attention mechanism, thus solving the semantic uncontrollability problem caused by traditional methods relying solely on low-level visual losses. The scene-specific text description generated based on the Qwen-VL visual language large model shows improved accuracy in feature capture for infrared and visible light cross-modal scenes compared to a general text encoder. Through the SDXL native cross-attention mechanism, end-to-end deep fusion of text semantic constraints and the diffusion denoising process is achieved, breaking the limitations of the decoupled paradigm. Users can flexibly adjust the fusion's target prominence, detail preservation, and degradation suppression effects by modifying the text description, significantly improving semantic controllability and scene adaptability. The scene text description generation instruction is a predefined instruction used to guide the model to generate a specific text description for infrared-visible light fusion scenes. This instruction explicitly requires that the description generated by the model must include four core dimensions: scene type, number and location of significant thermal targets, key visible light textures and environmental details, and image degradation issues.
[0029] According to an embodiment of the present invention, the infrared and visible light image fusion model constructed based on the Stable Diffusion UNet architecture and the fusion backbone network maps the simulated infrared image view and the simulated visible light image view to the latent space respectively, obtains infrared latent features and visible light latent features, performs bimodal feature alignment and fusion on the infrared latent features and visible light latent features, and outputs bimodal latent features, specifically: An infrared and visible light image fusion model is constructed using the Stable Diffusion UNet model as the backbone network for image fusion. LoRA low-rank adaptive fine-tuning is applied to all attention layers in the UNet model. During training, only the weight parameters of the LoRA are optimized, and the original pre-trained weights of the UNet model are frozen. A dual-branch parallel encoder is constructed. Based on the Stable Diffusion variational autoencoder, downsampling features are extracted from the simulated infrared image view and the simulated visible light image view, respectively. The image view is encoded into a latent space feature representation with fixed dimension to obtain infrared latent features and visible light latent features. Both infrared latent features and visible light latent features are three-dimensional tensors. A cross-modal interactive attention module is constructed, taking the infrared latent features and the visible light latent features as inputs. The cross-modal interactive attention module performs cross-attention calculation on the two, using the latent features of one modality as the query vector and the latent features of the other modality as both the key vector and the value vector for feature interaction, and calculates a complementary attention map. It's important to note that downsampling feature extraction refers to the model processing the input simulated infrared / visible light image through its encoder (such as the variational autoencoder in Stable Diffusion), progressively reducing its spatial resolution (e.g., dimensions) while increasing the number of feature channels, thereby extracting high-level, abstract semantic feature representations from the pixel-level raw data. Infrared latent features and visible light latent features are feature tensors obtained from the infrared and visible light image views respectively through the aforementioned downsampling encoding process, residing in the same latent space. Each encapsulates key information of its corresponding modality (thermal radiation information or texture details). The complementary attention map is key intermediate data calculated in the cross-modal interactive attention module. Specifically, it uses a latent feature of one modality (e.g., infrared features) as a "query" to calculate the correlation (attention weights) with various parts of a latent feature of another modality (e.g., visible light features). This weighted map is called the attention map.
[0030] The attention map is weighted and fused with the corresponding input latent features to output the fused features. The output features of the fusion process are then concatenated along the channel dimension to obtain aligned and deeply fused bimodal latent features.
[0031] It should be noted that in the construction and training optimization of the fusion network, a pre-trained SDXL 1.0 UNet is used as the fusion backbone network. A LoRA fine-tuning module with rank r=64 is added to all attention layers, the original SDXL weights are frozen, and a dual-branch VAE encoder is designed to map the degraded image to a 64×64×4 latent space. Cross-modal interactive attention modules complete the bimodal feature alignment and fusion, outputting fused latent features. A network architecture is constructed based on the powerful pre-trained Stable Diffusion UNet, with targeted enhancements to cross-modal information interaction capabilities. The core lies in using LoRA technology to perform lightweight fine-tuning of the UNet attention layers, retaining the model's rich prior knowledge while requiring only a few parameters to adapt it to specific fusion tasks. A dual-branch parallel encoder efficiently maps infrared and visible light views to feature representations in the same high-dimensional latent space. The cross-modal interactive attention module performs cross-attention calculations using features from one modality as queries and features from the other modality as keys and values. This mechanism forces the model to actively seek the most relevant and complementary information from the features of another modality during information interaction. Essentially, it establishes a soft correspondence and attention matching relationship between features at the latent space level. In this way, infrared features dominated by thermal radiation and visible light features rich in texture details can be deeply and adaptively aligned and fused at the information level, ultimately stitching together a comprehensive and complementary bimodal fusion representation at the channel dimension.
[0032] Figure 3 A flowchart of the training process for the infrared and visible light image fusion model of the present invention is shown.
[0033] According to an embodiment of the present invention, the step of inputting the dual-modal latent features and high-dimensional semantic feature vectors into the infrared-visible light image fusion model for training specifically involves: Initialize the number of iterations, initial learning rate, batch size, and loss function weight coefficients of the Stable Diffusion model. During forward propagation of the model, sample random noise with the same shape as the bimodal latent features from the standard normal distribution. Add noise to the bimodal latent features according to the noise coefficient corresponding to the preset diffusion time step to obtain the noisy latent features at the current time step. The noisy latent features, the high-dimensional semantic feature vector, and the embedding vector corresponding to the current diffusion time step are input into the UNet network of the infrared-visible light image fusion model. The UNet network is processed by the encoder, bottleneck layer, decoder, and cross-attention module to output the predicted noise. When calculating the model loss, the predicted noise is compared with the real random noise of the degradation operation to construct the diffusion noise prediction loss, and the pixel-level mask reconstruction loss between the reconstructed image obtained by decoding the UNet network reconstruction output through the variational autoencoder of the Stable Diffusion and the original high-definition ground truth image in the training sample pair is calculated to obtain the mask reconstruction loss. By introducing fusion knowledge prior loss and semantic consistency loss, the diffusion noise prediction loss, mask reconstruction loss, fusion knowledge prior loss and semantic consistency loss are weighted and summed to obtain the total loss of model training. Based on the total loss, the trainable parameters of the LoRA fine-tuning module in the infrared-visible image fusion model are updated using the gradient backpropagation algorithm until the model converges, thus completing the training of the infrared-visible image fusion model.
[0034] It should be noted that the optimization objective of the diffusion model is to learn and predict the random Gaussian noise added during the diffusion process. The fused latent features and the noisy latent variables from the diffusion process are concatenated and input into the UNet encoder. Simultaneously, the text semantic feature vector T is input into the cross-attention modules of each layer of UNet to achieve dual-condition joint injection, jointly optimizing the mask reconstruction loss, fusion prior knowledge loss, semantic consistency loss, and diffusion loss. The weight coefficients are set to 1.0, 0.8, 0.5, and 1.0, respectively. The global batch size for training is set to 64, and the learning rate is set to 1×10^-5, completing the end-to-end convergence training of the model. The mask reconstruction loss is obtained by calculating the difference between the corresponding regions of the reconstructed image and the original high-resolution ground truth image. The fusion prior knowledge loss ensures that the fusion result retains the thermal target saliency of the infrared image and the texture details of the visible light image by comparing the similarity of the reconstructed image with the simulated infrared image view and the simulated visible light image view in terms of gradient, structure, and edge features. The semantic consistency loss ensures that the semantic content of the fused image is consistent with the high-level constraints of the text description by comparing the cosine similarity between the high-dimensional semantic feature vector and the image feature vector extracted by the image encoder branch of the CLIP text encoder.
[0035] According to an embodiment of the present invention, in the infrared and visible light image fusion model that inputs the infrared image to be fused and the visible light image into the trained model, dual-modal latent features are extracted through dual-branch coding, combined with high-level constraints of text semantic features, and the final infrared and visible light fused image is output through DDIM inverse diffusion sampling, specifically as follows: The pixel-registered infrared image and visible light image to be fused are obtained. The infrared image and visible light image to be fused are input into the trained infrared and visible light image fusion model. They are encoded separately by a dual-branch parallel encoder to obtain the latent features of the infrared image and the latent features of the visible light image to be fused. The infrared image latent features and visible light image latent features to be fused are input into the cross-modal interactive attention module for dual-modal feature alignment and fusion to obtain the fused dual-modal latent features. The infrared image and visible light image pair to be fused are input into the Qwen-VL visual language large model to generate a fine-grained text description corresponding to the scene of the image pair. The text description is encoded by the CLIP text encoder to obtain a high-dimensional semantic feature vector for the fusion stage. Based on the DDIM inverse diffusion sampling algorithm, initial noise latent variables are sampled from the standard normal distribution. Combined with the fused dual-modal latent features and the high-dimensional semantic feature vector of the fusion stage, under the guidance of the UNet network of the infrared and visible light image fusion model, iterative denoising is performed step by step. After a predetermined number of sampling steps, the denoised fused latent variables are obtained. The denoised fusion latent variables are input into the decoder of the variational autoencoder of the Stable Diffusion to obtain the final infrared-visible fusion image.
[0036] It should be noted that during the inference phase, this invention uses registered infrared and visible light images as input. First, both images are uniformly cropped to a resolution of 512×512, consistent with the training phase. Pixel values are linearly normalized from 0-255 to the [-1,1] range, completing standardization preprocessing. No mask generation or degradation injection operations are performed throughout the process. For semantic guidance, the preprocessed infrared and visible light images are input in pairs into the Qwen-VL-7B model. A dedicated infrared-visible light fusion text guidance model generates a scene text description, which is then input into the SDXL native CLIP text encoder to obtain the corresponding 77×768-dimensional text semantic feature vector. In the fusion generation stage, the preprocessed infrared and visible light images, along with the generated text semantic feature vector, are input into the trained fusion model. The model extracts latent features from both modal images using a dual-branch VAE encoder. After feature alignment and fusion by a cross-modal interactive attention module, combined with high-level constraints of the text semantic features, and after DDIM inverse diffusion sampling, the final infrared-visible light fused image is directly output.
[0037] According to an embodiment of the present invention, it further includes: After the model training is completed, the fusion quality evaluation results of the infrared and visible light image fusion model under various preset typical scenarios and the running time records of the corresponding DDIM reverse diffusion sampling steps are obtained. Based on the correlation between the fusion quality evaluation results and the running time records, an adaptive decision table for the number of sampling steps is constructed. The adaptive decision table for sampling steps is embedded into the inference interface of the infrared and visible light image fusion model. When the infrared and visible light images to be fused are received, the decision table is queried according to the scene complexity and real-time requirements of the images to determine the appropriate sampling step range. The infrared image and the visible light image to be fused are input into the trained fusion model. The dual-modal latent features are extracted through dual-branch encoding. Combined with the high-level constraints of text semantic features, the dual-modal latent features, the high-dimensional semantic feature vector, and the determined sampling step interval parameters are input into the UNet network. The number of iterations is dynamically adjusted during the DDIM reverse diffusion sampling process according to the sampling step interval. Stepwise denoising is performed within the sampling step range to generate denoised fusion latent variables, which are then input into the variational autoencoder decoder of the Stable Diffusion to output the final fused image that significantly reduces computational latency while ensuring fusion quality meets scene requirements.
[0038] It should be noted that during the fusion inference stage, since DDIM inverse diffusion sampling requires multiple iterations of denoising to generate high-quality fusion latent variables, the more sampling steps there are, the better the generated image performs in terms of detail preservation and thermal radiation information restoration. However, the computation time increases linearly, which will cause serious delays in real-time scenarios that require millisecond-level response, such as drone inspection and vehicle night vision. This forces the system to make a trade-off between image quality and speed, affecting the efficiency of target detection and decision-making. To this end, after the model training is completed, this invention first runs the fusion model under various preset typical scenarios and records the fusion quality evaluation results and running time corresponding to different sampling steps. By analyzing the correlation between the two, an adaptive decision table that can dynamically adjust the number of sampling steps according to the scene complexity and real-time requirements is constructed, and this table is embedded in the inference interface. In practical applications, after the system receives the infrared and visible light images to be fused, it queries the decision table according to the scene complexity and real-time requirements of the image to determine a sampling step range that balances quality and speed. Then, the dual-modal latent features obtained by dual-branch encoding of the image and the text semantic features are fed into the UNet network together. Within this range, DDIM inverse diffusion sampling is performed to reduce the number of iterations, thereby significantly reducing the computational latency while maintaining the quality of the fused image to meet the scene requirements, achieving synergistic optimization of real-time performance and fusion effect.
[0039] Figure 4 A block diagram of an infrared-visible image fusion system based on model training optimization according to the present invention is shown.
[0040] A second aspect of the present invention provides an infrared-visible image fusion system based on model training optimization. The system includes a memory 401, a processor 402, and a communication interface 403. The memory includes a program for an infrared-visible image fusion method based on model training optimization. The communication interface is used for data connection and communication between the memory and the processor. When the program for the infrared-visible image fusion method based on model training optimization is executed by the processor, it implements the steps of the infrared-visible image fusion method based on model training optimization as described in any of the above claims.
[0041] This invention discloses an infrared and visible light image fusion method and system based on model training optimization. First, simulated infrared and visible light views are generated from high-resolution natural images and degraded, constructing a dual-degraded view input-ground training sample pair. Second, the ground image is input into a visual language model to generate text describing the scene, thermal targets, textures, and degradation, and mapped to high-dimensional semantic features. A fusion model is constructed using Stable Diffusion UNet as the backbone, mapping and fusing the simulated views into dual-modal latent features. During the training phase, these latent features are combined with semantic vectors for optimization. In application, the image to be fused is input into the trained model, and dual-modal latent features are extracted through dual-branch encoding. Under high-level textual semantic constraints, the final infrared and visible light fused image is output through DDIM inverse diffusion sampling. This invention utilizes semantically guided generative training to effectively improve the quality of the fused image.
[0042] Those skilled in the art will understand that all or part of the steps of the above method embodiments can be implemented by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, it performs the steps of the above method embodiments. The aforementioned storage medium includes various media capable of storing program code, such as mobile storage devices, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0043] Alternatively, if the integrated units of this invention are implemented as software functional modules and sold or used as independent products, they can also be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the embodiments of this invention, or the parts that contribute to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the methods described in the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as mobile storage devices, ROM, RAM, magnetic disks, or optical disks.
[0044] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.
Claims
1. An infrared-visible light image fusion method based on model training and optimization, characterized in that, Includes the following steps: Acquire at least one high-resolution natural image, perform normalization preprocessing on the high-resolution natural image, and generate a base view for multimodal imaging simulation; Based on the base view, a simulated infrared image view and a simulated visible light image view are generated. A degradation operation is performed on the simulated infrared image view and the simulated visible light image view to construct a degradation input image pair that simulates the difference between real imaging. This pair, together with the high-definition natural image, constitutes a dual degradation view input-original high-definition true value image training sample pair. The original high-definition ground truth images from the training sample pairs are input into the Qwen-VL visual language large model to generate text descriptions containing scene type, hot target distribution, visible light texture details and image degradation. The text descriptions are then mapped into high-dimensional semantic feature vectors. An infrared and visible light image fusion model is constructed based on the Stable Diffusion UNet architecture and a backbone network. The simulated infrared image view and the simulated visible light image view are mapped to the latent space respectively to obtain infrared latent features and visible light latent features. The infrared latent features and visible light latent features are aligned and fused in a dual-modal manner to output the dual-modal latent features. The bimodal latent features and high-dimensional semantic feature vectors are input into the infrared-visible image fusion model for training. The infrared image to be fused and the visible image are input into the trained infrared-visible image fusion model. The bimodal latent features are extracted through dual-branch encoding, combined with high-level constraints of textual semantic features, and the final infrared-visible image is output through DDIM inverse diffusion sampling.
2. The infrared-visible image fusion method based on model training optimization according to claim 1, characterized in that, The process of acquiring at least one high-resolution natural image, performing standardized preprocessing on the high-resolution natural image, and generating a base view for multimodal imaging simulation specifically involves: Acquire at least one high-definition natural image, crop the high-definition natural image to a preset standard pixel value, and perform pixel value normalization processing on the cropped image; Noise removal is performed on the normalized image using a small-scale Gaussian kernel. The mean and standard deviation of the pixel values of the noise-removed image are calculated. A linear transformation is performed on each pixel value of the image so that the mean of the pixel values of the processed image is 0 and the standard deviation is 1, thus obtaining the basic view of the standardized preprocessed multimodal imaging simulation.
3. The infrared-visible image fusion method based on model training optimization according to claim 1, characterized in that, The process involves generating simulated infrared and simulated visible light image views based on the base view, performing degradation operations on these views to construct a degradation input image pair simulating real imaging differences, and combining this pair with the high-definition natural image to form a dual-degraded view input-original high-definition ground truth image training sample pair. Specifically: Two identical copies of the base view are generated based on the base view of the multimodal imaging simulation, serving as the simulated infrared image view and the simulated visible light image view, respectively. Define a mask block library containing various predetermined geometric shapes and sizes, generate a master mask image based on a preset mask ratio, and determine the shape, position and size of the non-zero mask region in the master mask image by randomly sampling from the mask block library, so that the proportion of the total mask area to the entire image area reaches a preset threshold. Generate a mirror image of the master mask that is perfectly aligned in the spatial dimension. Assign a value of 0 to the pixel positions that are 1 in the master mask image and assign a value of 1 to the pixel positions that are 0 in the master mask image. This results in a slave mask image that is pixel-complementary to the master mask image, wherein the non-zero regions of the master mask and the slave mask are complementary. The generated master mask is applied to a simulated infrared image view, and the mask is applied to a simulated visible light image view to obtain a spatially aligned and pixel-complementary mask coverage area. A degradation operation is performed on the masked area. The degradation operation is randomly selected from a predefined degradation operation library, including Gaussian blur, random noise, brightness shift, and contrast shift. Complementary degradation is performed on a preset proportion of the masked area, and joint degradation is performed on the remaining proportion of the masked area. The simulated infrared image view and the simulated visible light image view, which have undergone degradation processing, are combined with the high-definition ground truth image to form a dual-degraded view input-original high-definition ground truth image training sample pair.
4. The infrared-visible image fusion method based on model training optimization according to claim 1, characterized in that, The original high-resolution ground truth images from the training sample pairs are input into the Qwen-VL visual language large model to generate text descriptions containing scene type, hot object distribution, visible light texture details, and image degradation. These text descriptions are then mapped into high-dimensional semantic feature vectors, specifically: The original high-definition ground truth images from the training sample pairs are input into the Qwen-VL visual language big model. The visual language big model generates fine-grained scene text descriptions of the original high-definition ground truth images based on scene text description generation instructions. These descriptions include scene type, number and distribution of significant hot targets, key visible light textures and environmental details, and descriptions of image degradation. The fine-grained scene text description is treated as a text string composed of a sequence of words and input into the text encoder module of the StableDiffusion UNet architecture. The text encoder module is a text encoder based on the CLIP model architecture. The encoder performs word segmentation on the input text string, converts the text string into a sequence of word indexes, and inputs the sequence of word indexes into the embedding layer of the text encoder. Each word index is mapped to a fixed-dimensional word vector to obtain the initial embedding vector sequence of the text. The initial embedding vector sequence is input into a multi-layer Transformer encoder. The word vector sequence is context-encoded through a multi-head self-attention mechanism and a feedforward neural network. Semantic association information between words is aggregated and extracted. The vector sequence output by the encoder is pooled to generate a high-dimensional semantic feature vector containing global semantic information and a fixed dimension.
5. The infrared-visible image fusion method based on model training optimization according to claim 1, characterized in that, The infrared and visible light image fusion model, based on the Stable Diffusion UNet architecture and fused with a backbone network, maps the simulated infrared image view and the simulated visible light image view to the latent space, respectively, to obtain infrared latent features and visible light latent features. The infrared latent features and visible light latent features are then aligned and fused in a dual-modal manner to output the dual-modal latent features. Specifically: An infrared and visible light image fusion model is constructed using the Stable Diffusion UNet model as the backbone network for image fusion. LoRA low-rank adaptive fine-tuning is applied to all attention layers in the UNet model. During training, only the weight parameters of the LoRA are optimized, and the original pre-trained weights of the UNet model are frozen. A dual-branch parallel encoder is constructed. Based on the Stable Diffusion variational autoencoder, downsampling features are extracted from the simulated infrared image view and the simulated visible light image view, respectively. The image view is encoded into a latent space feature representation with fixed dimension to obtain infrared latent features and visible light latent features. Both infrared latent features and visible light latent features are three-dimensional tensors. A cross-modal interactive attention module is constructed, taking the infrared latent features and the visible light latent features as inputs. The cross-modal interactive attention module performs cross-attention calculation on the two, using the latent features of one modality as the query vector and the latent features of the other modality as both the key vector and the value vector for feature interaction, and calculates a complementary attention map. The attention map is weighted and fused with the corresponding input latent features to output the fused features. The output features of the fusion process are then concatenated along the channel dimension to obtain aligned and deeply fused bimodal latent features.
6. The infrared-visible image fusion method based on model training optimization according to claim 1, characterized in that, The step of inputting the dual-modal latent features and high-dimensional semantic feature vectors into the infrared-visible light image fusion model for training is specifically as follows: Initialize the number of iterations, initial learning rate, batch size, and loss function weight coefficients of the Stable Diffusion model. During forward propagation of the model, sample random noise with the same shape as the bimodal latent features from the standard normal distribution. Add noise to the bimodal latent features according to the noise coefficient corresponding to the preset diffusion time step to obtain the noisy latent features at the current time step. The noisy latent features, the high-dimensional semantic feature vector, and the embedding vector corresponding to the current diffusion time step are input into the UNet network of the infrared-visible light image fusion model. The UNet network is processed by the encoder, bottleneck layer, decoder, and cross-attention module to output the predicted noise. When calculating the model loss, the predicted noise is compared with the real random noise of the degradation operation to construct the diffusion noise prediction loss, and the pixel-level mask reconstruction loss between the reconstructed image obtained by decoding the UNet network reconstruction output through the variational autoencoder of the Stable Diffusion and the original high-definition ground truth image in the training sample pair is calculated to obtain the mask reconstruction loss. By introducing fusion knowledge prior loss and semantic consistency loss, the diffusion noise prediction loss, mask reconstruction loss, fusion knowledge prior loss and semantic consistency loss are weighted and summed to obtain the total loss of model training. Based on the total loss, the trainable parameters of the LoRA fine-tuning module in the infrared-visible image fusion model are updated using the gradient backpropagation algorithm until the model converges, thus completing the training of the infrared-visible image fusion model.
7. The infrared-visible image fusion method based on model training optimization according to claim 1, characterized in that, In the infrared-visible image fusion model that inputs the infrared image to be fused and the visible light image into the trained model, dual-modal latent features are extracted through dual-branch coding, combined with high-level constraints of text semantic features, and the final infrared-visible image is output through DDIM inverse diffusion sampling. Specifically: The pixel-registered infrared image and visible light image to be fused are obtained. The infrared image and visible light image to be fused are input into the trained infrared and visible light image fusion model. They are encoded separately by a dual-branch parallel encoder to obtain the latent features of the infrared image and the latent features of the visible light image to be fused. The infrared image latent features and visible light image latent features to be fused are input into the cross-modal interactive attention module for dual-modal feature alignment and fusion to obtain the fused dual-modal latent features. The infrared image and visible light image pair to be fused are input into the Qwen-VL visual language large model to generate a fine-grained text description corresponding to the scene of the image pair. The text description is encoded by the CLIP text encoder to obtain a high-dimensional semantic feature vector for the fusion stage. Based on the DDIM inverse diffusion sampling algorithm, initial noise latent variables are sampled from the standard normal distribution. Combined with the fused dual-modal latent features and the high-dimensional semantic feature vector of the fusion stage, under the guidance of the UNet network of the infrared and visible light image fusion model, iterative denoising is performed step by step. After a predetermined number of sampling steps, the denoised fused latent variables are obtained. The denoised fusion latent variables are input into the decoder of the variational autoencoder of the Stable Diffusion to obtain the final infrared-visible fusion image.
8. An infrared-visible light image fusion system based on model training and optimization, characterized in that, The infrared-visible image fusion system based on model training optimization includes a storage device and a processor. The storage device includes an infrared-visible image fusion method program based on model training optimization. When the infrared-visible image fusion method program based on model training optimization is executed by the processor, it implements the steps of the infrared-visible image fusion method based on model training optimization as described in any one of claims 1 to 7.