Infrared and visible light image fusion system based on semantic driven space-frequency routing
By introducing a fusion architecture with DSPE, CMBA, FASRP and DFMB modules, the problems of redundant information processing and noise suppression in infrared and visible light image fusion systems under complex environments are solved, achieving efficient and adaptive image fusion effects and outputting high-quality fused images.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NANJING UNIV OF POSTS & TELECOMM
- Filing Date
- 2026-03-30
- Publication Date
- 2026-06-26
AI Technical Summary
Existing infrared and visible light image fusion systems suffer from redundant information processing, high computational complexity, poor generalization ability, inadequate noise suppression, and lack of adaptability when facing complex and degraded environments, making it difficult to maintain high-quality fusion results under extreme weather conditions.
A semantically driven spatial-frequency routing fusion architecture is adopted. By introducing a dual-branch degradation-aware semantic prior extraction module (DSPE), a cross-modal cross-attention module (CMBA), a multi-layer frequency domain aware spatial routing module (FASRP), and a dynamic frequency modulation module (DFMB), deep interaction and adaptive modulation of infrared and visible light images are achieved, thereby improving the model's adaptive generalization ability and computational efficiency in complex environments.
In extreme environments, it achieves high-quality fusion of infrared target salient localization and visible light texture, improving the system's robustness and computational efficiency in complex and degraded environments, and outputting fused images with high precision.
Smart Images

Figure CN121961895B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of computer vision and image processing technology, specifically relating to an infrared and visible light image fusion system based on semantic-driven space-frequency routing. Background Technology
[0002] With the continuous iteration and upgrading of imaging sensor technology in terms of performance and its widespread adoption in various industries, the acquisition of multimodal image data has become increasingly convenient, greatly promoting the rapid development of multimodal image fusion technology. Among them, infrared and visible light image fusion, as a core branch of this field, has significant research and application value. Images of different modalities can reflect the radiation characteristics, texture structure, and key environmental details of the same observation area from different physical dimensions. In practical outdoor monitoring, assisted driving, and intelligent security applications, due to the complex factors of drastic changes in ambient light and target occlusion, single-modal images often cannot provide the system with comprehensive and accurate decision-making information. Infrared images can capture the thermal radiation characteristics of objects and have significant target detection advantages in low-light or smoky environments. Visible light images, on the other hand, contain rich spatial layout and detailed textures. The effective complementarity of these two modal information has become a key way to improve the reliability of all-weather visual perception.
[0003] Currently, Convolutional Neural Networks (CNNs) are widely used in the field of infrared and visible light image processing due to their excellent local feature extraction capabilities. Existing CNN-based infrared and visible light fusion systems typically follow the standard paradigm of data preprocessing, feature extraction, feature fusion, and image reconstruction, transforming the pixel space into a feature space with stronger representational capabilities through convolution operations. However, existing infrared and visible light image fusion systems still face significant technical bottlenecks when dealing with complex degradation environments.
[0004] Existing methods often employ only simple channel concatenation or element-wise addition operations in the infrared-visible feature fusion stage. While this fusion mechanism utilizes the complementarity between modes to some extent, the equivalent processing of different modal features introduces a large amount of redundant information. As the network depth increases, this redundancy not only leads to an exponential increase in computational complexity but also lacks in-depth mining of cross-modal feature interactions at the underlying level, thereby weakening the performance of downstream tasks in terms of detail preservation and salient target prominence.
[0005] When faced with extreme weather conditions such as smog, rain, and snow, or complex degraded environments, traditional fusion networks fall short in feature alignment and noise suppression. The fusion quality often fluctuates drastically with changes in environmental degradation patterns, resulting in poor generalization ability. In recent years, state-space models with global modeling capabilities and linear computational complexity have begun to be introduced into visual tasks to achieve efficient processing of high-dimensional features. However, in practical applications, the state transition parameters of traditional 2D-SSM are essentially static or have fixed content, lacking the ability to adaptively perceive complex environmental degradation factors. Early fusion methods aimed at introducing prior constraints often rely on manually designed complex rules for iterative optimization, making it difficult to effectively pass gradients to downstream tasks, thus preventing the model from achieving truly efficient end-to-end training.
[0006] Furthermore, architectures that rely solely on spatial domain modeling have inherent limitations when dealing with complex frequency domain information. When faced with environmental degradation, if the high-frequency detail components and low-frequency structural components of the image cannot be effectively decoupled and processed in a targeted manner, the system will struggle to enhance realistic textures while suppressing degradation noise. Simple spatial domain feature enhancement cannot fully utilize the distribution characteristics of the image in the frequency dimension, which greatly limits the robustness of the model in complex and unknown environments.
[0007] Based on the above, achieving a fine balance between handling modal redundancy, suppressing environmental degradation interference, and realizing adaptive frequency modulation remains a core challenge that urgently needs to be addressed in the field of infrared and visible light image fusion. Currently, a fully automatic, high-precision fusion architecture that can deeply integrate advanced semantic priors and frequency domain adaptive modulation is needed to overcome the limitations of traditional single-domain models. Summary of the Invention
[0008] To address the aforementioned issues, this invention discloses an infrared and visible light image fusion system based on semantic-driven spatial frequency routing. By innovatively introducing a dual-branch degradation-aware semantic prior extraction module (DSPE), a cross-modal cross-attention module (CMBA), a multi-layer frequency domain-aware spatial routing module (FASRP), and a dynamic frequency modulation module (DFMB), the system significantly improves the model's adaptive generalization ability and overall computational efficiency in the face of complex degradation environments while maintaining high-precision image fusion.
[0009] The original design intention of the dual-branch degradation-aware semantic prior extraction module (DSPE) is to break the limitation of traditional single network architectures lacking high-level semantic guidance. This module uses the powerful visual language model BLIP-2 OPT to extract a two-dimensional semantic prior covering "weather context" and "physical degradation mode" from infrared and visible light images. These highly refined semantic tensors serve as global context constraints and are directly injected into the underlying feature extraction operators and frequency domain gating, achieving cross-dimensional alignment from macro-text understanding to micro-parameter modulation, which greatly improves the model's adaptive perception capability for unknown and complex degradation scenarios.
[0010] In the spatial feature extraction stage, this invention constructs a dual-branch feature extraction architecture consisting of alternating stacks of CMBA and FASRP. The CMBA module achieves deep interaction and redundancy elimination of infrared and visible light low-level features at both shallow and deep layers through feature splitting and cross-weighting. The FASRP module focuses on the refined extraction and complementary routing of multi-scale features. Its internal Local-Global Feature Extractor (LGE) combines a two-dimensional state-space model (2D-SSM) driven by text priors, maintaining linear complexity while efficiently extracting long-range spatial dependencies. FASRP guides the network to focus on shallow features in abrupt edge regions and rely on deep features in flat background regions by dynamically evaluating the spatial high-frequency energy generation routing mask, effectively solving the high-frequency detail smoothing problem that is easily caused by traditional multi-scale fusion.
[0011] The Dynamic Frequency Modulation (DFMB) module addresses the challenges of feature decoupling and noise suppression in the frequency dimension to some extent. To overcome the shortcomings of traditional spatial domain methods that tend to ignore the inherent distribution patterns in the frequency domain, DFMB utilizes Fast Fourier Transform (FFT) to transform joint spatial features to the frequency domain and rigorously decouples them into phase and amplitude spectra. Based on semantic prior guidance from the DSPE module, dynamic gating and reconstruction are performed on the segmented high-frequency, mid-frequency, and low-frequency amplitude components. This mechanism effectively isolates degradation noise, ensuring that the neural network retains the global phase information that determines the image's geometric structure while suppressing environmental interference.
[0012] The progressive decoder constructs the underlying image using visible light and infrared deep features, following a hierarchical recovery logic from coarse to fine, injecting low, medium, and high frequency features modulated in the frequency domain step by step; it uses a residual fusion mechanism based on high-fidelity visible light features, supplemented by a multi-stage deep supervised loss (Stage Loss) to determine the optimal parameter combination for the fusion process; through the above operations, it outputs a high-quality fused image that combines salient infrared target localization with visible light detail texture.
[0013] The technical solution adopted by this invention to solve its technical problem is as follows:
[0014] An infrared-visible image fusion system based on semantic-driven spatial frequency routing is proposed. The method relies on an image fusion network SFMNet, which integrates infrared-visible image feature extraction, dynamic frequency modulation, and progressive decoding. This network includes: an infrared-visible image input module, a dual-branch degradation-aware semantic prior extraction module (DSPE), a cross-modal cross-attention module (CMBA), a multi-layer frequency-domain aware spatial routing module (FASRP), a dynamic frequency modulation module (DFMB), and a progressive fusion decoding module. The specific implementation process is as follows:
[0015] The infrared and visible light image input module receives infrared and visible light images. Before entering the core feature extraction network, it performs strict data preprocessing operations on the input infrared and visible light images, including spatial size cropping and alignment and numerical normalization, so as to ensure that the dual-modal input feature tensors are highly consistent in spatial resolution and physical scale.
[0016] The dual-branch degenerate perception semantic prior extraction module DSPE realizes cross-dimensional semantic information mining and global context feature modeling. This module introduces the parameter-sharing visual language model BLIP-2 OPT, which receives infrared and visible light images and corresponding orthogonal dual-dimensional text prompts. With the help of multilayer perceptron and self-attention mechanism, the extracted infrared and visible light features are encoded into high-dimensional semantic tensors with core guiding role. These prior tensors are dimensionality-reduced and mapped into adaptive frequency modulation weights through a weight mapping layer and a dynamic frequency gating weight branch, thereby providing accurate macroscopic environmental constraints for the underlying infrared and visible light feature processing and dynamic frequency gating.
[0017] The Cross-Modal Attention Module (CMBA) is used to achieve cross-modal deep interaction and redundancy elimination of infrared and visible light features in the spatial domain. This module splits infrared and visible light features into attention-guided and backbone-preserving branches through a channel segmentation mechanism, uses cross-elemental addition to facilitate the initial fusion of low-level features, and combines global average pooling and max pooling to aggregate global background statistics and local most salient target responses, respectively. The cross-modal cross-guided weights purified by the multilayer perceptron provide precise spatial feature guidance for the deep fusion and dimensional reorganization of infrared and visible light modalities at the channel level through adaptive weighting and channel shuffling operations.
[0018] The multi-layer frequency domain-aware spatial routing module FASRP, combined with a local-global feature extractor modulated by semantic priors, achieves adaptive spatial-frequency decoupling and complementary fusion of deep low-frequency features and shallow high-frequency details. This module constructs an intermediate mask generation branch, which uses depthwise separable convolution to evaluate the high-frequency energy distribution of each pixel in the image space, generating a spatial routing mask with pixel-level accuracy. With the two-dimensional state space model 2D-SSM integrated in the local-global feature extractor, the system guides the network to focus on shallow high-resolution features in abrupt edge regions and rely on deep low-frequency features in flat background regions, thereby effectively solving the detail smearing and over-smoothing defects that are prone to occur in traditional fusion architectures.
[0019] The Dynamic Frequency Modulation (DFMB) module performs joint feature modeling and dynamic noise suppression in the spatial and frequency domain dimensions. This module receives joint infrared and visible light features as input, and uses a two-dimensional fast Fourier transform to strictly decouple the spatial features into a phase spectrum representing the structure and an amplitude spectrum representing the texture. It then uses frequency segmentation to decompose the amplitude spectrum into high-frequency, mid-frequency, and low-frequency components. Based on the semantic modulation weights output by the DSPE module, it performs channel-level adaptive gated multiplication on the amplitude components of different frequency bands. Finally, it reassembles the modulated amplitude and enhanced phase features, accurately mapping them back to the spatial domain using a two-dimensional inverse fast Fourier transform, thereby outputting fused feature components with sharp edges, rich details, and visual consistency.
[0020] The progressive fusion decoding module receives high-level features from the spatial domain after cross-routing and multi-scale frequency features modulated in the frequency domain, integrating them into a unified, highly discriminative fused image representation. This module injects modulated low-frequency, mid-frequency, and high-frequency features step-by-step into an upsampling decoding link composed of multiple basic residual blocks. Furthermore, a late-stage residual fusion mechanism for deep visible light features is introduced at the top layer of the decoding layer to preserve original edge details to the greatest extent possible. Simultaneously, a multi-stage deep supervised loss function, Stage Loss, is used to determine the optimal parameter combination for the fusion network at different scales. Finally, image reconstruction is completed via the output convolutional layer, outputting a high-quality fused image that combines infrared salient targets with visible light textures.
[0021] In the above technical solution, an infrared and visible light image fusion architecture based on semantic-driven space-frequency routing is constructed for infrared and visible light images. This architecture adopts a dual-parallel mechanism in the spatial and frequency domains. In the spatial feature extraction stage, two parallel feature extraction backbones are used, one on the left and one on the right. The backbones are composed of alternating stacks of cross-modal cross-attention modules (CMBA) and multi-layer frequency-domain perceptual spatial routing modules (FASRP). Between the two-modal backbones, a dual-branch degenerate perception semantic prior extraction module (DSPE) is used to perform cross-dimensional environmental semantic mining. Subsequently, the spatial joint features of the infrared and visible light images are transformed to the frequency domain using a dynamic frequency modulation module (DFMB), and dynamic gating modulation and space-frequency integration are performed based on semantic priors. Finally, multi-scale frequency features and high-level spatial features are input into the fusion decoding module, and a high-quality image is reconstructed through a progressive fusion strategy. The specific implementation steps are as follows:
[0022] The system first receives infrared and visible light images of the same physical scene through the infrared and visible light image input module; it then performs rigorous data preprocessing operations on the input infrared and visible light images, including linear normalization of numerical ranges and spatial dimension alignment and cropping; the specific process includes:
[0023] The input infrared image after preprocessing is Visible light image is ,in Represents the number of samples. The dimension size corresponding to the feature channels, and These are the height and width of the image, respectively. Then, the two images are fed into their respective initial convolutional layers for shallow feature mapping, transforming them from a pixel space lacking high-dimensional semantics to a tensor space with preliminary feature representation capabilities. The specific process is as follows:
[0024] ;
[0025] ;
[0026] in, For the initial feature map convolution operation of 3×3, For visible light feature tensor, For infrared feature tensors;
[0027] A further improvement of this invention lies in introducing a dual-branch degradation-aware semantic prior extraction module (DSPE) to address the bottleneck of the model's lack of generalization ability when facing unknown and complex degradation scenarios, and incorporating natural language instructions into the weight calculation of the underlying image fusion filtering; the specific process includes:
[0028] First, using the parameter-sharing visual language model BLIP-2 OPT, and leveraging the open-ended visual question answering (VQA) paradigm, pre-defined text prompts with orthogonal two dimensions are input into the model, namely descriptions of weather and salient objects. And descriptions of underlying physical degradation types Next, the infrared and visible light images, along with the corresponding text, are used for reasoning to generate tokens containing high-dimensional semantics; the specific process is as follows:
[0029] ;
[0030] ;
[0031] in, Semantic cues for perceiving environmental context and salient targets. To diagnose degradation clue words for underlying physical noise, Visible light image, Infrared image, For pre-trained visual language models with shared parameters, Encode the initial semantic features. Encode the initial degenerate features;
[0032] Next, self-attention purification and high-dimensional feature mapping are performed on the extracted infrared and visible light feature codes. In the two parallel branches of the DSPE module, condensed semantic tokens are extracted through a multilayer perceptron (MLP) and a self-attention mechanism. These tokens are then processed again by MLP dimensionality reduction and activation functions to generate independent high-dimensional prior feature vectors. The specific process is as follows:
[0033] ;
[0034] ;
[0035] ;
[0036] ;
[0037] in, With core semantic token, As the core degenerate token, For linear mapping transformation matrix, For Sigmoid normalized activation function, For high-dimensional semantic prior weights, For high-dimensional degenerate prior weights;
[0038] Finally, the dimensions are precisely compressed to the target frequency band number through a mapping network, ultimately generating dynamic frequency modulation weights to guide the fusion of underlying features; the specific process is as follows:
[0039] ;
[0040] ;
[0041] in, For the joint prior tensor, As a bottleneck layer for dimensionality reduction, For low-frequency weights, For mid-frequency weights, High-frequency weights;
[0042] A further improvement of this invention lies in embedding a cross-modal attention module (CMBA) into the feature extraction architecture, aiming to achieve early fusion and information interaction of infrared and visible light underlying features; the specific process includes:
[0043] First, the visible light characteristics... infrared features After dimensionality reduction using 3×3 convolutions, the data is split in half along the channel dimension into an attention-guided branch and a main branch. Then, the attention-guided branches for the infrared and visible light modes are cross-added element-wise. The specific process is as follows:
[0044] ;
[0045] ;
[0046] ;
[0047] ;
[0048] in, This is a 3×3 convolution operation. For channel splitting operation, To guide the saliency of joint space, To guide the characteristics of joint spatial correlation, This is a shallow spatial feature map of an infrared image. This is a shallow spatial feature map of a visible light image;
[0049] Next, the joint features Feed it into the global average pooling GAP to aggregate global background and contextual statistics, while also... The data is fed into a global max pooling (GMP) algorithm to focus on locally salient targets; subsequently, a multilayer perceptron (MLP) is used for feature dimensionality reduction and nonlinear mapping, and the final attention weights are output by a sigmoid activation function; the specific process is as follows:
[0050] ;
[0051] ;
[0052] in, This is a global average pooling operation. This is a global max pooling operation. It is a multilayer perceptron. Use the Sigmoid activation function; Spatial modulation weights, As amplitude modulation weight;
[0053] Then, cross-modal adaptive weighting and deep feature reorganization are performed to reorder the weights. and Acting in opposite directions and The above features are reconstructed after channel splicing and channel shuffling. The specific process is as follows:
[0054] ;
[0055] ;
[0056] ;
[0057] in, This is an element-wise multiplication operation. This is for channel splicing operations. This is a composite recombination operation involving channel shuffling and dimensional aggregation;
[0058] Finally, cross-modal enhancement features will be implemented. The input is fed into the Cross-Modal Cross-Attention (CBMA) module, where it undergoes deep cross-modal feature interaction and saliency extraction processing through four cascaded stacked CBMA modules. Utilizing the spatial saliency perception branch and channel correlation measurement branch within each module, iterative weight reshaping and information permeation are performed on infrared and visible light features. This aims to progressively enhance the spatial localization accuracy and semantic relevance of heterogeneous targets, ultimately outputting a highly discriminative joint guided feature tensor. The specific process is as follows:
[0059] ;
[0060] in, These are the intermediate layer features of the output. For cross-modal attention module mapping operations;
[0061] A further improvement of this invention lies in embedding a multi-layer frequency domain-aware spatial routing module (FASRP) into the feature extraction process, which solves the problem of detail obfuscation in traditional feature fusion to some extent through dynamic routing masks; the specific process includes:
[0062] First, input features After passing through the high-frequency preservation branch above and being mapped by the linear layer, the data is fed into the Local-Global Feature Extractor (LGE) for equal-resolution processing to preserve high-resolution high-frequency textures to the greatest extent possible; simultaneously, the input features... After passing through the low-frequency context branch below, the features sequentially undergo downsampling, LGE unit, and upsampling operations; the specific process is as follows:
[0063] ;
[0064] ;
[0065] in, For downsampling, For upsampling, It is a local and global feature extraction unit; For low-resolution local details;
[0066] Next, the middle branch of this module uses depthwise separable convolution (DWConv) to capture local high-frequency energy and generate a spatial routing mask. Its function is to accurately assess the spatial distribution probability of image edges; the specific process is as follows:
[0067] ;
[0068] in, For high-resolution local details;
[0069] The specific process of the local-global feature extractor unit is as follows:
[0070] Input first through The SiLU activation function is used to extract local spatial patterns, which are then fed into a 2D-SSM state-space model for global dependency modeling with linear complexity. The semantic prior generated by the DSPE module, after being processed by a linear layer, is injected as a conditional parameter into the state transition process of the 2D-SSM, thereby endowing the model with scene perception capabilities. The specific process is as follows:
[0071] ;
[0072] in, For layer normalization operation; For the input tensor, For semantic prior vectors, It is a local and global feature extraction unit;
[0073] Then use the generated mask With inverse mask Complementary routing and reorganization are performed on multi-scale features to ensure that the network focuses on shallow, high-resolution features in edge regions; the specific process is as follows:
[0074] ;
[0075] in, This is an aggregation feature;
[0076] Then aggregate features The data is then input into the multi-layer frequency domain sensing spatial routing module FASRP. After being processed by four cascaded FASRP modules, cascaded space-frequency sensing and feature routing are performed. The internal spatial routing and frequency sensing branches are used to decouple the features at multiple scales and dynamically assign weights. Finally, the deep representation features, refined and reconstructed in the space-frequency domain, are output. The specific process is as follows:
[0077] ;
[0078] in, The intermediate feature tensor of the output. For frequency domain sensing spatial routing nonlinear mapping operations;
[0079] Finally, the feature tensors output by the four FASRP modules are added element-wise with the feature tensors output by the four CBMA modules to achieve deep fusion of spatial-frequency domain sensing features and cross-modal attention features; the specific processing flow is as follows:
[0080] ;
[0081] in, For infrared joint aggregation feature tensor;
[0082] Similarly, in the visible light branch, the final output is the visible light joint aggregated feature tensor. The specific process includes:
[0083] The system first acquires visible light spatial routing features after four layers of parallel frequency-domain aware spatial routing processing, and visible light guiding features after four layers of cross-modal cross-attention enhancement. Then, it uses residual addition to deeply fuse the two feature tensors. This step aims to organically combine the rich high-fidelity texture details and spatial distribution features in the visible light image with the saliency-aware information after cross-modal correction, thereby generating the final visible light joint aggregated feature tensor. The specific process is as follows:
[0084] ;
[0085] in, For visible light joint aggregation feature tensor;
[0086] A further improvement of this invention lies in the construction of a Dynamic Frequency Modulation (DFMB) module; the input infrared and visible light shallow spatial features are converted to the frequency domain using a two-dimensional Fast Fourier Transform (2D-FFT), performing strict decoupling of amplitude and phase; the specific process includes:
[0087] First, the visible light shallow spatial features and the infrared shallow spatial features are stitched together along the channel dimension to construct a cross-modal joint feature. Then, a two-dimensional fast Fourier transform is used to convert this joint feature from the spatial domain to the frequency domain, and it is rigorously decoupled into an amplitude spectrum representing texture and energy, and a phase spectrum representing structure and contour. The specific process is as follows:
[0088] ;
[0089] ;
[0090] ;
[0091] in, It has visible light characteristics. Infrared characteristics, This is for channel splicing operations. For the joint feature tensor, This is a two-dimensional fast Fourier transform operation. For amplitude extraction mapping, For phase extraction mapping;
[0092] Next, frequency segmentation and phase fidelity processing are performed on the amplitude spectrum characterizing the energy distribution. Based on the frequency bandpass characteristics, it is divided into three independent components: high frequency, mid frequency, and low frequency, to achieve isolation of specific frequency characteristics; simultaneously, the phase spectrum is... Deep convolution operations are used to enhance structural consistency; the specific process is as follows:
[0093] ;
[0094] ;
[0095] in, For frequency division operation, For high frequency amplitude, For mid-frequency amplitude, Low frequency amplitude, For feature mapping convolution, It is a linear rectified activation function. This represents deep phase characteristics;
[0096] Then, the semantic modulation weights generated by the DSPE module are introduced to perform channel-level adaptive gated multiplication on the amplitude features of a specific frequency band, thereby achieving adaptive noise suppression and salient feature enhancement guided by text instructions; at the same time, deep convolution enhancement is performed on the phase spectrum to ensure structural consistency; the specific process is as follows:
[0097] ;
[0098] ;
[0099] in, For dynamic modulation weights, Amplitude characteristics;
[0100] Finally, an inverse Fourier transform and spatial feature reconstruction are performed to reconstruct the amplitude features of each frequency band after modulation. Each with enhanced global phase features By combining these methods, the frequency domain is mapped back to the spatial domain, generating multi-scale frequency-domain enhanced spatial features. The specific process is as follows:
[0101] ;
[0102] in, This is a two-dimensional inverse fast Fourier transform; This is the imaginary part of the Fourier transform. Spatial feature map;
[0103] A further improvement of this invention lies in the adoption of a progressive fusion decoding module based on residual topology. This module follows a coarse-to-fine hierarchical recovery criterion, and its function is to output high-fidelity image features. The specific process includes:
[0104] First, based on the characteristics of visible light Infrared deep space features As a base, it is combined with the reconstructed low-frequency components The result is then fed into the basic residual block Block 0; in the subsequent upsampling link, the intermediate frequency components are processed according to the logic from low to high frequency. With high frequency components The decoding network is injected level by level; each level has a basic residual block. Layer Norm and Skip Connection are used to ensure stable propagation of nonlinear features; the specific process is as follows:
[0105] ;
[0106] ;
[0107] ;
[0108] in, For infrared target structure, For visible light texture details, For global background energy, Guide the rough outline of the image. Guides the fine edges and textures of the image. Based on the reconstruction of features, This is a mid-level reconstruction feature. For advanced reconstruction features; The deepest fusion feature, This is a feature of sub-deep fusion. This is a feature of shallow to medium-level fusion;
[0109] Basic residual block A skip connection mechanism is used to prevent gradient vanishing, and its internal transformation process is expressed as follows:
[0110] ;
[0111] in, Given the feature tensor of the input block, This is a 3×3 convolution operation. For layer normalization;
[0112] Ultimately, the high-fidelity spatial characteristics output by the visible light branch will be achieved. The residual is directly injected into the top-level decoding block, avoiding the smoothing loss of visible light edge details caused by depth convolution; the specific process is as follows:
[0113] ;
[0114] ;
[0115] in, For texture detail features of visible light images, For the final fusion features, To reconstruct features from the final decoded output, For the final fused output image;
[0116] Meanwhile, a multi-stage loss function was selected. , and The network employs multiple layers to conduct deep supervision and measurement of the difference between the model-generated predictions and the true distribution; it automatically optimizes the hyperparameter space best suited for downstream tasks by relying on the gradient calculations of feedback; finally, it outputs a fused image through a mapping convolutional layer, completing the process from infrared and visible light input to fused image output.
[0117] The beneficial effects of this invention are as follows:
[0118] This invention proposes an infrared-visible image fusion system based on semantic-driven spatial frequency routing. At the spatial feature extraction level, a cross-modal cross-attention module (CMBA) is used to achieve deep interaction and channel redundancy elimination of low-level infrared-visible features. Simultaneously, a multi-layer frequency-domain sensing spatial routing module (FASRP) is designed. Its internal local-global feature extractor (LGE) utilizes a two-dimensional state-space model (2D-SSM) for long-range modeling with linear complexity and combines it with a dynamic spatial routing mask to achieve complementary extraction of local high-frequency and global low-frequency features, effectively overcoming the detail smearing defects caused by traditional fusion methods. To address complex degradation weather conditions such as haze and low illumination, a dual-branch degradation-aware semantic prior extraction module (DSPE) is introduced. This module relies on the visual language large model BLIP-2. OPT extracts a two-dimensional semantic prior encompassing weather context and physical degradation patterns. In the core Dynamic Frequency Modulation (DFMB) module, the system uses Fast Fourier Transform (FFT) to strictly decouple joint features into phase and amplitude spectra, and dynamically generates modulation weights for high, medium, and low frequency amplitude components based on the aforementioned semantic priors. This mechanism achieves precise suppression and adaptive enhancement of noise in specific degradation frequency bands while absolutely preserving the image's geometric structure.
[0119] Meanwhile, the system adopts a progressive decoding link composed of multiple residual blocks, injects multi-dimensional modulation information sequentially according to the recovery criterion from coarse to fine, and introduces a post-fusion mechanism of high-fidelity visible light features at the top layer; through the text-driven-space-frequency collaborative mechanism and the closed-loop optimization of the multi-stage loss function, the generalization robustness of the model under extreme degradation scenarios is improved, and the final output is a high-quality fused image that combines infrared salient target localization and visible light texture. Attached Figure Description
[0120] Figure 1 This is a flowchart of the present invention.
[0121] Figure 2 This is the overall network block diagram of the present invention.
[0122] Figure 3 This is a framework diagram of the various modules of the present invention.
[0123] Figure 4 This is a visualization of the results of this invention on the M3FD dataset and other models.
[0124] Figure 5 This is a visualization of the results of this invention on the TNO dataset and other models. Detailed Implementation
[0125] The present invention will be further illustrated below with reference to the accompanying drawings and specific embodiments. It should be understood that the following specific embodiments are for illustrative purposes only and are not intended to limit the scope of the invention.
[0126] like Figure 1 As shown, the overall process of this invention is as follows: First, spatial features of infrared and visible light images are extracted. Then, the DSPE module is used to obtain two-dimensional textual semantic priors covering environmental context and physical degradation, and the features are simultaneously transformed from the spatial domain to the frequency domain through the DFMB module. Next, the decoupled high, medium and low frequency components are dynamically modulated according to the semantic prior weights to achieve adaptive denoising and feature enhancement. Finally, the modulated information and high-level spatial features are deeply integrated through the progressive fusion decoding module to reconstruct a high-quality fused image that combines infrared target localization and detailed texture.
[0127] like Figure 2 and Figure 3 As shown, this invention discloses an infrared and visible light image fusion system based on semantic-driven spatial-frequency routing. This system constructs a deep neural network system called SFMNet, specifically including: an infrared and visible light image input module, a dual-branch degradation-aware semantic prior extraction module (DSPE), a cross-modal cross-attention module (CMBA), a multi-layer frequency-domain aware spatial routing module (FASRP), a dynamic frequency modulation module (DFMB), and a progressive fusion decoding module. These modules cooperate in the spatial domain, frequency domain, and high-dimensional semantic domain to jointly constitute a system that outputs a high-quality, high-fidelity fused image from an infrared and visible light image input.
[0128] The infrared and visible light image input module receives infrared and visible light dual-modal images registered for the same scene. First, the two input images undergo preprocessing to map the infrared and visible light data to a unified dimensional distribution and spatial scale, meeting the computational requirements of the tensor dimension in subsequent deep networks. Then, each modality image passes through an initial convolutional layer (Conv), mapping and expanding the low-dimensional pixel information into a high-dimensional feature space with preliminary semantic representation capabilities.
[0129] The Cross-Modal Attention Module (CMBA) receives the initial extracted features. In this module, infrared and visible light features are first extracted locally through 3×3 convolutional layers, followed by a Channel Split operation, splitting them in half along the channel dimension into an attention-guided branch and a backbone-preserving branch. Subsequently, the infrared and visible light modal guidance branches undergo a cross-elemental addition operation, which achieves preliminary information fusion and interaction between infrared saliency and visible light texture at a shallow feature extraction layer. The combined features are then fed into the Global Average Pooling (GAP) and Global Max Pooling (GMP) branches. GAP aggregates local details in the spatial dimension into a global contextual descriptive vector describing the background illumination; while GMP extracts the most salient local responses representing heat sources and high-frequency edges. The two pooling outputs are then subjected to nonlinear transformations of feature dimensionality reduction and enhancement via a Multilayer Perceptron (MLP), and finally passed through a Sigmoid function. The activation function generates cross-modal cross-guided weights; finally, these attention weights are multiplied element-wise and applied back to the original input features that have not been split, forming a cross-modal enhanced representation; the enhanced two features are then deeply fused and reconstructed through Concat channel concatenation and Channel Shuffle channel shuffling, thereby effectively eliminating redundant information between modalities while preserving the global fidelity of the original modality.
[0130] The multi-layer frequency domain aware spatial routing module (FASRP) receives the extracted features. First, the input features pass through a center mask generation branch, using a 3×3 depthwise separable convolution (DWConv) to capture local high-frequency textures and edge abrupt energy. A spatial routing mask is then generated using a 1×1 convolution and a sigmoid activation function, which accurately assesses the high-frequency distribution probability of each spatial pixel in the image. Simultaneously, the input features are split into upper and lower multi-scale feature extraction branches. In the upper branch, the features pass through a linear mapping layer and are then fed into a local-global feature extractor (LGE), where high-frequency local detail features are extracted at equal resolution. In the lower branch, the features first undergo downsampling to compress the spatial resolution, then are fed into LGE to extract low-frequency global contextual features using a larger receptive field, and finally upsampling to restore the original spatial dimension. The LGE unit, as the core operator for feature extraction, first utilizes DWConv... The SiLU activation function is used to extract local spatial texture, which is then fed into a 2D state-space model (2D-SSM). This unit models the global long-range dependencies of features in 2D space while maintaining linear computational complexity. Finally, stable features are output through layer-normalized LN and skip connections. After multi-scale feature extraction, the system uses the generated mask W and its inverse mask to perform element-wise multiplication with the high-frequency features above and the low-frequency features below, respectively. Finally, complementary fusion is performed through addition. This routing mechanism forces the network to focus on shallow features in abrupt edge regions and rely on deep features in flat background regions, effectively avoiding the smearing of high-frequency details by deep networks.
[0131] While extracting spatial dual-branch features, the dual-branch degradation-aware semantic prior extraction module (DSPE) simultaneously mines macroscopic environmental features. This module receives stitched infrared and visible light images and orthogonal dual-dimensional preset text prompts. First, the image and text are input into the parameter-shared visual language model BLIP-2 OPT, where infrared and visible light feature encodings are extracted using the open visual question answering (VQA) paradigm. Subsequently, the extracted encodings are filtered and refined by a multilayer perceptron (MLP) and self-attention mechanisms to extract the most representative core contextual information for each prompt, generating condensed semantic tokens and mapping them to high-dimensional semantic tensors (Semantic Priors). Then, these prior tensors are directly injected as conditional parameters into the 2D-SSM state transition matrix of the aforementioned LGE unit. On the other hand, through concat stitching, MLP dimensionality reduction mapping, and sigmoid activation operations, dynamic modulation weights for different specific frequency bands are generated. This transforms the macroscopic complex environmental semantics into microscopic low-level filtering control parameters, giving the model a powerful adaptive perception capability for degradation scenes.
[0132] The Dynamic Frequency Gated Module (DFMB) receives joint features from infrared and visible light images. First, it uses a 2D Fast Fourier Transform (2D-FFT) to transform the spatial domain joint features to the frequency domain, strictly decoupling them into a "phase spectrum" representing the global contour structure and an "amplitude spectrum" representing detailed textures and specific noise. Next, it performs a Frequency Split operation, dividing the amplitude spectrum into three independent tensor components—high-frequency, mid-frequency, and low-frequency—based on frequency distribution characteristics. These amplitude components are then mapped through convolutional layers and multiplied element-wise with the corresponding dynamic modulation weights generated by the DSPE module, achieving targeted noise suppression and specific frequency band enhancement guided by text priors. Simultaneously, the phase spectrum undergoes structural consistency enhancement through a continuous deep convolutional network. Finally, the modulated high, mid, and low-frequency amplitude components are recombined with the enhanced phase components and reconstructed back into the spatial domain via an inverse Fast Fourier Transform, forming enhanced multi-scale frequency features.
[0133] The progressive fusion decoding module receives high-level semantic features from the spatial domain and reconstructed frequency features for image domain restoration and reconstruction. Following a coarse-to-fine restoration criterion, the system first uses the deep spatial features output from the infrared and visible light branches as the underlying heat source target base, adding and fusing them with the low-frequency components output from the DFMB, and then feeding them into the basic residual block Block 0 for initial decoding. Each block module sequentially includes 3×3 convolutions, layer normalization (LayerNorm), and ReLU activation, and utilizes skip connections to directly superimpose input features onto the output, achieving memory enhancement and stable transfer of nonlinear features. Next, in the progressively upsampled decoding link, mid-frequency and high-frequency components are sequentially injected into the corresponding deep blocks through feature addition operations, achieving hierarchical completion of frequency gradients. To maximize the preservation of high-frequency textures in the visible light mode, the system introduces a late-stage fusion layer at the top of the decoder. The fusion strategy directly injects high-fidelity spatial features from the visible light branch into the residual to compensate for the loss of pixel-level details caused by downsampling in deep networks. During this process, the output of each stage of the network is supplemented with a multi-stage loss function, Stageloss, for deep supervision and parameter optimization. Finally, the fused features are compressed through the output mapping convolutional layer to output a high-quality fused image that simultaneously possesses accurate localization of infrared saliency and dynamic details of visible light.
[0134] Specifically, the steps of the method used in the above system are as follows:
[0135] Infrared and visible light images are input into the infrared and visible light image input module. Then, numerical normalization and size cropping operations are performed on the original images. This eliminates dimensional differences between different sensors, ensuring that the bimodal input tensor perfectly matches the processing requirements of the subsequent deep convolutional network in terms of spatial scale and data distribution. The input visible light and infrared images are defined as follows: and ,in Represents the number of samples in the batch. The dimension size corresponding to the feature channels, and These are the height and width of the image.
[0136] Infrared and visible light images are first passed through initial 3×3 convolutional layers to map them from a low-dimensional pixel space to a high-dimensional feature space with preliminary semantic representation capabilities; the specific process is as follows:
[0137] ;
[0138] ;
[0139] in, For the initial feature map convolution operation of 3×3, For visible light feature tensor, For infrared feature tensors;
[0140] Infrared and visible light images and semantic text are combined using the parameter-sharing visual language model BLIP-2 OPT. Leveraging the open-ended visual question answering (VQA) paradigm, pre-defined text prompts in an orthogonal two-dimensional format—descriptions of weather and salient objects—are input into the model. And descriptions of underlying physical degradation types The concatenated image source data and text are used for joint reasoning to generate tokens containing high-dimensional semantics; the specific process is as follows:
[0141] ;
[0142] ;
[0143] in, Semantic cues for perceiving environmental context and salient targets. To diagnose degradation clue words for underlying physical noise, Visible light image, Infrared image, For pre-trained visual language models with shared parameters, Encode the initial semantic features. Encode the initial degenerate features;
[0144] The extracted infrared and visible light features are encoded and then subjected to self-attention purification and high-dimensional feature mapping. In the two parallel branches of the DSPE module, a condensed semantic token is extracted through a multilayer perceptron (MLP) and a self-attention mechanism. This token is then further processed by MLP dimensionality reduction and activation functions to generate independent high-dimensional prior feature vectors. The specific process is as follows:
[0145] ;
[0146] ;
[0147] ;
[0148] ;
[0149] in, As the core semantic token, As the core degenerate token, For the linear mapping transformation matrix, For Sigmoid normalized activation function, For high-dimensional semantic prior weights, For high-dimensional degenerate prior weights;
[0150] The dimensions are precisely compressed to the target frequency band number using a mapping network, ultimately generating dynamic frequency modulation weights to guide the fusion of underlying features; the specific process is as follows:
[0151] ;
[0152] ;
[0153] in, For the joint prior tensor, As a bottleneck layer for dimensionality reduction, For low-frequency weights, For mid-frequency weights, High-frequency weights;
[0154] The initial features are fed into the cross-modal attention module (CMBA); infrared features With visible light characteristics After convolution processing, channel splitting is performed, dividing each channel into a pilot branch and a main branch; the specific process is as follows:
[0155] ;
[0156] ;
[0157] ;
[0158] ;
[0159] in, This is a 3×3 convolution operation. For channel splitting operation, To guide the saliency of joint space, To guide the characteristics of joint spatial correlation, This is a shallow spatial feature map of an infrared image. This is a shallow spatial feature map of a visible light image;
[0160] Joint features and The data are fed into Global Average Pooling (GAP) and Global Max Pooling (GMP) to extract global context and local most salient responses, respectively. The pooling results are then refined using a Multilayer Perceptron (MLP) and processed with a Sigmoid activation function to generate cross-modal cross-guided weights. The specific process is as follows:
[0161] ;
[0162] ;
[0163] in, This is a global average pooling operation. This is a global max pooling operation. It is a multilayer perceptron. The Sigmoid activation function is used. Spatial modulation weights, As amplitude modulation weight;
[0164] The weight is then applied inversely to the original, unsplit input features, which are then restored to their original dimensions via channel concatenation and channel shuffle, resulting in cross-modal enhanced features. The specific process is as follows:
[0165] ;
[0166] ;
[0167] ;
[0168] in, For channel shuffling and recombination functions;
[0169] Cross-modal enhancement features The input is fed into the Cross-Modal Cross-Attention (CBMA) module, where it undergoes deep cross-modal feature interaction and saliency extraction processing through four cascaded stacked CBMA modules. Utilizing the spatial saliency perception branch and channel correlation measurement branch within each module, iterative weight reshaping and information permeation are performed on infrared and visible light features. This aims to progressively enhance the spatial localization accuracy and semantic relevance of heterogeneous targets, ultimately outputting a highly discriminative joint guided feature tensor. The specific process is as follows:
[0170] ;
[0171] in, These are the intermediate layer features of the output. For cross-modal attention module mapping operations;
[0172] Input features After passing through the high-frequency retention branch above and being mapped by the linear layer, the data is fed into the Local-Global Feature Extractor (LGE) for equal-resolution processing to preserve high-resolution high-frequency textures to the greatest extent possible; simultaneously, the input features... After passing through the low-frequency context branch below, the features sequentially undergo downsampling, LGE unit, and upsampling operations; the specific process is as follows:
[0173] ;
[0174] ;
[0175] in, For downsampling, For upsampling, It is a local and global feature extraction unit; For low-resolution local details;
[0176] Upsample the output By using the intermediate branch, depthwise separable convolution (DWConv) is employed to capture local high-frequency energy and generate a spatial routing mask. Its function is to accurately assess the spatial distribution probability of image edges; the specific process is as follows:
[0177] ;
[0178] in, For high-resolution local details;
[0179] The specific process of the local-global feature extractor unit is as follows:
[0180] Input first through The SiLU activation function is used to extract local spatial patterns, which are then fed into a 2D-SSM state-space model for global dependency modeling with linear complexity. The semantic prior generated by the DSPE module is processed by a linear layer and then injected as a conditional parameter into the state transition process of the 2D-SSM, endowing the model with scene awareness capabilities. The specific process is as follows:
[0181] ;
[0182] in, For layer normalization operation, For the input tensor, For semantic prior vectors, It is a local and global feature extraction unit;
[0183] The generated mask With inverse mask Complementary routing and reorganization are performed on multi-scale features to ensure that the network focuses on shallow, high-resolution features in edge regions; the specific process is as follows:
[0184] ;
[0185] in This is an aggregation feature;
[0186] Aggregation features The data is then input into the multi-layer frequency domain sensing spatial routing module FASRP. After being processed by four cascaded FASRP modules, cascaded space-frequency sensing and feature routing are performed. The internal spatial routing and frequency sensing branches are used to decouple the features at multiple scales and dynamically assign weights. Finally, the deep representation features, refined and reconstructed in the space-frequency domain, are output. The specific process is as follows:
[0187] ;
[0188] in, The intermediate feature tensor of the output. For frequency domain sensing spatial routing nonlinear mapping operations;
[0189] The feature tensors output by the four FASRP modules are added element-wise with the feature tensors output by the four CBMA modules to achieve deep fusion of spatial-frequency domain sensing features and cross-modal attention features; the specific processing flow is as follows:
[0190] ;
[0191] in, For infrared joint aggregation feature tensor;
[0192] Similarly, in the visible light branch, the final output is the visible light joint aggregated feature tensor. The specific process includes:
[0193] The system first acquires visible light spatial routing features after four layers of parallel frequency-domain aware spatial routing processing, and visible light guiding features after four layers of cross-modal cross-attention enhancement. Then, it uses residual addition to deeply fuse the two feature tensors. This step aims to organically combine the rich high-fidelity texture details and spatial distribution features in the visible light image with the saliency-aware information after cross-modal correction, thereby generating the final visible light joint aggregated feature tensor. The specific process is as follows:
[0194] ;
[0195] in, For visible light joint aggregation feature tensor;
[0196] Infrared and visible light shallow spatial features are stitched together along the channel dimension to construct cross-modal joint features. Then, a two-dimensional fast Fourier transform is used to convert this joint feature from the spatial domain to the frequency domain, and it is rigorously decoupled into an amplitude spectrum representing texture and energy, and a phase spectrum representing structure and contour. The specific process is as follows:
[0197] ;
[0198] ;
[0199] ;
[0200] in, and Visible light characteristics and infrared characteristics, This is for channel splicing operations. For the joint feature tensor, This is a two-dimensional fast Fourier transform operation. For amplitude extraction mapping, For phase extraction mapping;
[0201] Amplitude spectrum Based on the frequency bandpass characteristics, it is divided into three independent components: high frequency, mid frequency, and low frequency, to achieve isolation of specific frequency characteristics; simultaneously, the phase spectrum is... Deep convolution operations are used to enhance structural consistency; the specific process is as follows:
[0202] ;
[0203] ;
[0204] in, For frequency division operation, It is a high-frequency amplitude. For mid-frequency amplitude, Low frequency amplitude, For feature mapping convolution, It is a linear rectified activation function. This represents deep phase characteristics;
[0205] Adaptive gated modulation is applied to the amplitude components of each frequency band. After convolution and mapping with an activation function, each amplitude component is multiplied by its corresponding frequency modulation weight. This is to achieve micro-signal filtering guided by macro-text prompts; the specific process is as follows:
[0206] ;
[0207] in, For dynamic modulation weights, Amplitude characteristics;
[0208] The amplitude characteristics of each frequency band after modulation Each with enhanced global phase features By combining these methods, the frequency domain is mapped back to the spatial domain, generating multi-scale frequency-domain enhanced spatial features. The specific process is as follows:
[0209] ;
[0210] in, This represents the two-dimensional inverse fast Fourier transform. This is the imaginary part of the Fourier transform. Spatial feature map;
[0211] The converged spatial and frequency features are fed into a progressive decoding link, following a hierarchical recovery principle from coarse to fine. First, visible light... and infrared features As a salience base, with low-frequency components The superimposed data is then fed into the basic decoding block Block 0. The decoding block internally uses layer normalization (LN) and skip connections to refine and stabilize the nonlinear features. The specific process is as follows:
[0212] ;
[0213] ;
[0214] ;
[0215] in, For infrared target structure, For visible light texture details, For global background energy, Guide the rough outline of the image. Guides the fine edges and textures of the image. Based on reconstructing features, This is a mid-level reconstruction feature. For advanced reconstruction features; The deepest fusion feature, This is a feature of sub-deep fusion. This is a feature of shallow to medium-level fusion;
[0216] Basic residual block A skip connection mechanism is used to prevent gradient vanishing, and its internal transformation process is expressed as follows:
[0217] ;
[0218] in, Given the feature tensor of the input block, This is a 3×3 convolution operation. For layer normalization;
[0219] Ultimately, the high-fidelity spatial characteristics output by the visible light branch will be achieved. The residual is directly injected into the top-level decoding block, completely avoiding the smoothing loss of visible light edge details caused by depth convolution. The specific process is as follows:
[0220] ;
[0221] ;
[0222] in, For texture detail features of visible light images, For the final fusion feature, To reconstruct features from the final decoded output, For the final fused output image;
[0223] Meanwhile, a multi-stage loss function was selected. , and Multiple layers conduct deep supervision to measure the difference between the model-generated prediction results and the true distribution; the network automatically optimizes the hyperparameter space best suited for downstream tasks by relying on the gradient calculation of feedback; the fused image is output through the mapping convolutional layer to complete the process from infrared and visible light input to fused image output.
[0224] like Figure 4 and Figure 5As shown, to verify the effectiveness of the proposed method, detailed comparative experiments were conducted on the M3FD and TNO datasets against nine advanced infrared and visible light image fusion methods, including FusionGAN, SDNet, u2Fusion, TarDal, NestFuse, SwinFusion, Dif-Fusion, CDDFuse, and TUFusion. All network architectures were implemented using the PyTorch and TensorFlow frameworks, and an NVIDIA GeForce RTX 4090 GPU was used for model training and testing.
[0225] In terms of experimental setup, this invention uses the Adam optimizer for training, with a learning rate set to 0.0001. During training, a weighted combination of stage loss and perceptual loss is employed to ensure pixel-level intensity consistency, structural similarity, and boundary integrity of changing regions, thereby measuring the difference between the model-generated predictions and the real modal images. To ensure the comparability and fairness of the experimental results, all comparison methods are performed under the same training strategy, data preprocessing procedures, and the same training / validation / testing partitioning.
[0226] This experiment uses two typical datasets, M3FD and TNO, to comprehensively verify the performance of the proposed method. M3FD is a large-scale, high-resolution image fusion dataset encompassing thousands of precisely aligned infrared and visible light images; the TNO dataset extensively covers diverse and complex geographical scenes, including urban streets, forest paths, and open fields. Both datasets primarily focus on salient targets such as pedestrians and vehicles, emphasizing robustness and practicality in real-world perception tasks. This places extremely high demands on the algorithm's cross-modal feature extraction capabilities, pseudo-change suppression capabilities, and accurate localization capabilities for subtle ground features such as building clusters and road edges under complex imaging conditions.
[0227] Table 1 shows the quantitative evaluation metrics for each fusion method on the M3FD test dataset, including information entropy (EN), mutual information (MI), spatial frequency (SF), average gradient (AG), correlation difference (SCD), visual information fidelity (VIF), peak signal-to-noise ratio (PSNR), fusion performance index (Qabf), fusion noise (Nabf), and structural similarity index (SSIM). The best performance is indicated in bold.
[0228]
[0229] Table 1
[0230] Table 1 shows the quantitative experimental results, which indicate that in the comparative test on the M3FD dataset, the SFMNet proposed in this study outperforms existing mainstream methods in spatial frequency (SF), correlation difference (SCD), and peak signal-to-noise ratio (PSNR). It also demonstrates competitive performance in information entropy (EN), mutual information (MI), average gradient (AG), visual information fidelity (VIF), fusion performance index (Qabf), fusion noise (Nabf), and structural similarity (SSIM).
[0231] Table 2 shows the quantitative evaluation metrics for each fusion method on the TNO test dataset, including information entropy (EN), mutual information (MI), spatial frequency (SF), average gradient (AG), correlation difference (SCD), visual information fidelity (VIF), peak signal-to-noise ratio (PSNR), fusion performance index (Qabf), fusion noise (Nabf), and structural similarity index (SSIM). The best performance is indicated in bold.
[0232]
[0233] Table 2
[0234] Table 2 shows the quantitative experimental results, which indicate that in the comparative test on the TNO dataset, the SFMNet proposed in this study outperforms existing mainstream methods in terms of information entropy (EN), average gradient (AG), correlation difference (SCD), and peak signal-to-noise ratio (PSNR). It also demonstrates competitive performance in terms of mutual information (MI), spatial frequency (SF), visual information fidelity (VIF), fusion performance index (Qabf), fusion noise (Nabf), and structural similarity (SSIM).
[0235] Both quantitative and qualitative experimental results indicate that the system is suitable for all-weather sensing tasks and target detection tasks in complex environments.
[0236] It should be noted that the above content merely illustrates the technical concept of the present invention and should not be construed as limiting the scope of protection of the present invention. For those skilled in the art, various improvements and modifications can be made without departing from the principle of the present invention, and all such improvements and modifications fall within the scope of protection of the claims of the present invention.
Claims
1. An infrared-visible image fusion system based on semantic-driven space-frequency routing, characterized in that, The system includes an infrared and visible light image input module, a dual-branch degradation perception semantic prior extraction module (DSPE), a cross-modal cross attention module (CMBA), a multi-layer frequency domain perception spatial routing module (FASRP), a dynamic frequency modulation module (DFMB), and a progressive fusion decoding module. The infrared and visible light image input module receives the source infrared image and the visible light image, and performs initial shallow spatial feature extraction. The dual-branch degenerate perception semantic prior extraction module DSPE receives infrared and visible light images and corresponding two-dimensional text prompts. It extracts high-dimensional semantic priors through a large visual language pre-trained model with shared parameters, and finally generates adaptive frequency modulation weights through nonlinear mapping and gated weight extraction. The cross-modal cross-attention module (CMBA) receives infrared and visible light features in the spatial domain and performs cross-modal interaction processing, and performs cross-correlation mapping based on the saliency components of infrared and visible light features in the spatial and channel dimensions. The multi-layer frequency domain-aware spatial routing module FASRP combines a local-global feature extractor modulated by semantic priors to extract multi-scale spatial features, and performs routing mask calculation logic based on local high-frequency energy components to perform linear spatial-frequency mapping between deep low-frequency feature components and shallow high-frequency detail components. The dynamic frequency modulation module DFMB transforms the combined infrared and visible light features to the frequency domain and performs frequency segmentation. Based on the modulation weights output by the DSPE module, it performs multiplication and inverse transformation reconstruction processing on the segmented high-frequency, mid-frequency and low-frequency components based on the weight coefficients. The progressive fusion decoding module uses infrared and visible light deep spatial features as the underlying base, injects modulated low, medium and high frequency components step by step in the network decoding stage, and performs feature weighting processing of visible light spatial feature components and Fourier phase prior components through the global residual branch to output the target fused image.
2. The infrared-visible image fusion system based on semantic-driven space-frequency routing according to claim 1, characterized in that, In the infrared-visible light image input module, the specific process for extracting features includes: Each input image is fed into an initial convolutional layer for shallow feature mapping, and then fed into a two-branch feature extraction architecture consisting of alternating stacks of multiple FASRP and CMBA modules. This maps the image from pixel space to a deep feature space with preliminary semantic representation. The specific process is as follows: ; ; in, For 3×3 initial feature map convolution operation, For visible light feature tensor, For infrared feature tensors; Visible light image, This is an infrared image.
3. The infrared-visible image fusion system based on semantic-driven space-frequency routing according to claim 2, characterized in that, The dual-branch degradation-aware semantic prior extraction module (DSPE) transforms the macroscopic image degradation environment and text prompts into microscopic frequency-domain adaptive modulation weights. The specific process includes: Step 1: Construct orthogonal two-dimensional text prompts and input the infrared and visible light images along with the corresponding text prompts into a parameter-shared visual language model; leveraging the cross-modal understanding capabilities of the large model, extract the initial infrared and visible light feature codes corresponding to different physical intentions; the specific process is as follows: ; ; in, Semantic cues for perceiving environmental context and salient targets. To diagnose degradation clue words for underlying physical noise, Visible light image, Infrared image, For pre-trained visual language models with shared parameters, Encode the initial semantic features. Encode the initial degenerate features; Step 2: Perform self-attention purification and high-dimensional feature mapping on the extracted infrared and visible light feature codes. In the two parallel branches of the DSPE module, semantic tokens are extracted through multilayer perceptron (MLP) and self-attention mechanism, respectively. Then, through MLP dimensionality reduction and activation function processing, independent high-dimensional prior feature vectors are generated. The specific process is as follows: ; ; ; ; in, As the core semantic token, As the core degenerate token, For linear mapping transformation matrix, For Sigmoid normalized activation function, For high-dimensional semantic prior weights, For high-dimensional degenerate prior weights; Step 3: Perform feature-level interaction and cross-dimensional mapping on the high-dimensional prior weights of the two branches; specifically, ... and Channel concatenation is performed to achieve lossless information fusion. Then, a mapping network is used to precisely compress its dimensions to the target frequency band number, ultimately generating dynamic frequency modulation weights to guide the fusion of underlying features. The specific process is as follows: ; ; in, For the joint prior tensor, As a bottleneck layer for dimensionality reduction, For low-frequency weights, For mid-frequency weights, For high-frequency weights.
4. The infrared-visible image fusion system based on semantic-driven space-frequency routing according to claim 3, characterized in that, The Cross-Modal Attention Module (CMBA) utilizes channel segmentation and cross-modal feature overlay to achieve preliminary fusion and information interaction of infrared and visible light underlying features; the specific process includes: Step 1: After performing 3×3 convolutions on the infrared and visible light features respectively, split them in half along the channel dimension to form attention-guided branches. Retain branches from the main trunk Subsequently, the corresponding branches of the infrared and visible light images are cross-added element by element. The specific process is as follows: ; ; ; ; in, This is a 3×3 convolution operation. For channel splitting operation, To guide the saliency of joint space, To guide the characteristics of joint spatial correlation, This is a shallow spatial feature map of an infrared image. This is a shallow spatial feature map of a visible light image; Step 2: Perform spatial dimension compression and channel dimension correlation modeling on the above-mentioned joint guidance features to generate cross-modal cross-guidance weights; Send to the global average pooling branch, and at the same time... The data is fed into the global max pooling branch; subsequently, dimensionality reduction and reconstruction are performed using a multilayer perceptron, and the final channel attention weights are output by the sigmoid activation function. The specific process is as follows: ; ; in, This is a global average pooling operation. This is a global max pooling operation. It is a multilayer perceptron. Use the Sigmoid activation function; Spatial modulation weights, As amplitude modulation weight; Step 3, Reverse action on the unprocessed original input features and First, to preserve the global fidelity of the original modality to the greatest extent possible; finally, the two weighted features are concatenated, and the channel order is shuffled by a channel shuffling operation to output the reconstructed features. The specific process is as follows: ; ; ; in, For element-wise multiplication, This is for channel splicing operations. This is a complex recombination operation that includes channel shuffling and dimension aggregation; Step 4: Add cross-modal enhancement features The input is fed into the Cross-Modal Cross-Attention (CBMA) module. After passing through four cascaded and stacked CBMA modules, deep cross-modal feature interaction and saliency extraction processing are performed, outputting a highly discriminative joint guided feature tensor. The specific process is as follows: ; in, These are the intermediate layer features of the output. This is a cross-modal cross-attention module mapping operation.
5. The infrared-visible image fusion system based on semantic-driven space-frequency routing according to claim 4, characterized in that, In the multi-layer frequency-aware spatial routing module FASRP, an intermediate frequency-aware mask generation branch is constructed to evaluate the spatial local high-frequency energy of the input features; the specific process includes: Step 1: Input features After passing through the high-frequency retention branch above and being mapped by the linear layer, the data is fed into the Local-Global Feature Extractor (LGE) for equal-resolution processing; simultaneously, the input features... After passing through the low-frequency context branch below, the features sequentially undergo downsampling, LGE unit, and upsampling operations; the specific process is as follows: ; ; in, For downsampling, For upsampling, It is a local and global feature extraction unit; For low-resolution local details; Step 2: Input Features Simultaneously, through an intermediate branch, local high-frequency energy is captured using depthwise separable convolution (DWConv) to generate a spatial routing mask. The specific process is as follows: ; in, For high-resolution local details; The specific process of the local-global feature extractor unit includes: Input first through The SiLU activation function is used to extract local spatial patterns, which are then fed into a 2D-SSM state-space model for global dependency modeling with linear complexity. The semantic prior, generated by the DSPE module, is processed by a linear layer and then injected as a conditional parameter into the state transition process of the 2D-SSM. The specific process is as follows: ; in, For layer normalization operation; For the input tensor, For semantic prior vectors, It is a local and global feature extraction unit; Step 3: Use the generated mask With reverse mask Complementary routing and reorganization are performed on multi-scale features to ensure that the network focuses on shallow, high-resolution features in edge regions; the specific process is as follows: ; in, This is an aggregation feature; Step 4: Aggregate features The data is then input into the multi-layer frequency domain sensing spatial routing module FASRP. After being processed by four cascaded FASRP modules, cascaded space-frequency sensing and feature routing are performed. The internal spatial routing and frequency sensing branches are used to decouple the features at multiple scales and dynamically assign weights. Finally, the deep representation features, refined and reconstructed in the space-frequency domain, are output. The specific process is as follows: ; in, The intermediate feature tensor of the output. For frequency domain sensing spatial routing nonlinear mapping operations; Step 5: Add the feature tensors output by the four cascaded FASRP modules to the feature tensors output by the four cascaded CBMA modules element-wise; the specific processing flow is as follows: ; in, For infrared joint aggregation feature tensor; Similarly, in the visible light branch, the final output is the visible light joint aggregated feature tensor. The specific process includes: The system first acquires visible light spatial routing features after four layers of parallel frequency-domain sensing spatial routing processing, and visible light guiding features after four layers of cross-modal cross-attention enhancement. Then, it uses residual addition to perform deep fusion of the two feature tensors. The specific process is as follows: ; in, For visible light co-aggregation feature tensor.
6. The infrared-visible image fusion system based on semantic-driven space-frequency routing according to claim 5, characterized in that, In the Dynamic Frequency Modulation (DFMB) module, the multi-dimensional spatial frequency feature components are weighted and logically processed using weights generated by the Fast Fourier Transform and the Dual-Branch Degradation-Aware Semantic Prior Extraction (DSPE) module; the specific process includes: Step 1: Concatenate infrared and visible light features along the channel dimension to construct cross-modal joint features; then, use a two-dimensional fast Fourier transform to convert the joint features from the spatial domain to the frequency domain, and strictly decouple them into amplitude spectra representing texture and energy, and phase spectra representing structure and contour; the specific process is as follows: ; ; ; in, It has visible light characteristics. Infrared characteristics, This is for channel splicing operations. For the joint feature tensor, This is a two-dimensional fast Fourier transform operation. For amplitude extraction mapping, For phase extraction mapping; Step 2: Perform frequency segmentation and phase fidelity processing; analyze the amplitude spectrum characterizing the energy distribution. Based on the frequency bandpass characteristics, it is divided into three independent components: high frequency, mid frequency, and low frequency; simultaneously, the phase spectrum... Deep convolution is used; the specific process is as follows: ; ; in, For frequency division operation, It is a high-frequency amplitude. For mid-frequency amplitude, Low frequency amplitude, For feature mapping convolution, It is a linear rectified activation function. This represents deep phase characteristics; Step 3: Introduce the dynamic semantic prior weights generated by the DSPE module to perform adaptive gated modulation on the amplitude components of each segmented frequency band; after convolution and activation function mapping, each amplitude component is multiplied by the corresponding frequency modulation weight. The specific process is as follows: ; in, For dynamic modulation weights, Amplitude characteristics; Step 4: Perform inverse Fourier transform and spatial feature reconstruction to reconstruct the amplitude features of each frequency band modulated in Step 3. Compare with the enhanced global phase features in step 3. By combining these methods, the frequency domain is mapped back to the spatial domain, generating multi-scale frequency-domain enhanced spatial features. The specific process is as follows: ; in, This is a two-dimensional inverse fast Fourier transform. This is the imaginary part of the Fourier transform. This is a spatial feature map.
7. The infrared-visible image fusion system based on semantic-driven space-frequency routing according to claim 6, characterized in that, In the progressive fusion decoding module, basic residual blocks are used as the smallest processing unit for feature aggregation and nonlinear mapping; the specific process includes: Step 1: First, use visible light characteristics Infrared deep space features As a base, it is combined with the reconstructed low-frequency components After addition, the results are fed into the basic residual block; in the subsequent upsampling link, the intermediate frequency components are processed according to the logic from low to high frequency. With high frequency components The decoding network is injected step by step; the specific process is as follows: ; ; ; in, For infrared target structure, For visible light texture details, For global background energy, Guide the rough outline of the image. Guides the fine edges and textures of the image. Based on reconstructing features, This is a mid-level reconstruction feature. For advanced reconstruction features; The deepest fusion feature, This is a feature of sub-deep fusion. This is a feature of shallow to medium-level fusion; Basic residual block A skip connection mechanism is adopted, and its internal transformation process is as follows: ; in, Given the feature tensor of the input Block, This is a 3×3 convolution operation. For layer normalization; Step 2: High-fidelity spatial characteristics output from the visible light branch. The residual is directly injected into the top-level decoding block; the specific process is as follows: ; ; in, For texture detail features of visible light images, For the final fusion feature, To reconstruct features from the final decoded output, This is for the final merged output image.