A polarization degree and light intensity image fusion method based on bidirectional cross attention
By employing a bidirectional cross-attention method, the dynamic balance problem between intensity images and polarization images under complex backgrounds or low signal-to-noise ratio conditions in existing technologies is solved. This method achieves simultaneous preservation of structural details and polarization saliency in polarization imaging fusion, thereby improving the stability and robustness of the fused image.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHANGCHUN UNIV OF SCI & TECH
- Filing Date
- 2026-04-02
- Publication Date
- 2026-06-19
Smart Images

Figure CN121961874B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the fields of polarization imaging, computational imaging and image processing technology, and specifically relates to a method for image fusion of polarization degree and light intensity based on bidirectional cross-attention. Background Technology
[0002] Polarization imaging technology, by measuring the polarization state of light, can provide physical clues related to surface geometry, roughness, material, and scattering, in addition to traditional light intensity information. A typical polarization imaging system can invert polarization feature images such as the light intensity image S0 and the degree of linear polarization image DoLP. Among them, S0 is better for expressing structural and texture details, while DoLP is better for expressing target saliency and material differences.
[0003] In engineering applications, polarization images and light intensity images often need to be presented in a unified manner or used as input for downstream visual tasks. Therefore, it is usually desirable to fuse two complementary images into a single fused image, which simultaneously possesses: clear structure, rich detail, prominent target, and a certain degree of physical interpretability.
[0004] In existing polarization imaging fusion techniques, methods such as linear weighting, feature convolution, or unidirectional attention are commonly used to address the complementary information relationship between the intensity image S0 and the polarization degree image DoLP. Linear fusion methods based on fixed or empirical weights are simple to implement, but the weights typically remain constant globally, making it difficult to adaptively adjust them according to the imaging characteristics of different scenes or spatial regions. This can easily lead to weakened polarization information or obscured intensity structural details in complex backgrounds. Convolutional fusion methods based on feature convolution typically concatenate intensity and polarization features along the channel dimension and then fuse them using a convolutional neural network. While these methods can learn certain nonlinear mapping relationships, the interaction between different modal features mainly relies on local convolution operations, lacking an explicit cross-modal correlation modeling mechanism. This can easily lead to one modality dominating the fusion process, making it difficult to fully utilize the discriminative information of the other modality. In recent years, some methods have introduced attention mechanisms to weighted modal features, but most employ unidirectional attention structures, i.e., using only intensity features to guide polarization features, or only using polarization saliency to enhance intensity features. This one-way information injection method implicitly assumes a "master mode - slave mode" relationship. When the master mode itself is in a low-contrast state or is subject to noise interference, one-way injection not only fails to achieve effective compensation, but may also amplify the error and reduce the stability and robustness of the fusion result.
[0005] Therefore, existing technologies generally struggle to achieve a dynamic balance between preserving structural details and significantly enhancing polarization, especially under complex backgrounds or low signal-to-noise ratio conditions, where the fusion results are prone to texture loss or noise artifacts. To address this, it is necessary to propose a fusion mechanism capable of establishing a bidirectional information exchange relationship between the intensity image and the polarization degree image, as well as further processing to address noise issues, thereby achieving more stable and reliable polarization fusion imaging. Summary of the Invention
[0006] To address the aforementioned technical problems, this invention provides a polarization degree and light intensity image fusion method based on bidirectional cross-attention, comprising the following steps:
[0007] S1. Obtain the pixel-level registered light intensity image S0 and polarization degree image DoLP, and normalize the light intensity image S0 and polarization degree image DoLP.
[0008] S2. Construct a fusion network to fuse the light intensity image S0 and the polarization degree image DoLP, specifically as follows:
[0009] A dual-branch feature extraction module is constructed, in which the light intensity image S0 is input into the intensity branch and the polarization degree image DoLP is input into the polarization degree branch, respectively. The light intensity features of S0 and the DoLP features are obtained through shallow convolution and deep feature extraction, respectively.
[0010] A bidirectional cross-attention module is constructed to perform bidirectional cross-attention calculation on the S0 light intensity feature and the DoLP feature, respectively, to obtain the enhanced features. and The fused features are obtained by splicing them together along the channel dimension. ;
[0011] Construct a feature selection module to... , and Generate adaptive weights, then perform feature fusion, and output selected features. ;
[0012] Construct a multi-scale feature pyramid module, and Downsampling to multiple scales and aligning and fusing them yields pyramid features containing multi-scale contextual information. ;
[0013] Build the rebuild module, Mapped to fused image ;
[0014] S3. Optimize the fusion network using a self-supervised training strategy and construct a composite loss function. It includes at least gradient preservation loss. and polarization significant enhancement loss .
[0015] Furthermore, in the dual-branch feature extraction module, the intensity branch and the polarization branch use independent weight parameters. Both dual-branch structures include: a first convolutional layer for extracting shallow features, and multiple residual blocks or dense residual blocks for extracting deep features; and the feature scales of the two branches are aligned through a downsampling layer to obtain S0 light intensity features and DoLP features; a channel attention module is added after the first convolutional layer for the intensity branch, and a spatial attention module is added after the first convolutional layer for the polarization branch.
[0016] Furthermore, the bidirectional cross-attention module uses convolutional layers to divide the input feature map into image patches, and flattens each image patch to form a feature vector. ; for eigenvectors Layer normalization is performed, and the query vector Q, key vector K, and value vector V are obtained by projecting them onto the learnable weight matrix respectively. ,in, They represent The corresponding learnable weight matrix;
[0017] The attention calculation from the intensity branch to the polarization degree branch is as follows:
[0018] ;
[0019] The attention calculation from the polarization degree branch to the intensity branch is as follows:
[0020] ;
[0021] , , , , and The subscripts in the text indicate the query vector Q, key vector K, and value vector V corresponding to the S0 light intensity feature and the DoLP feature, respectively. The output of bidirectional cross-attention is the feature vector length. and After undergoing nonlinear transformation by a multilayer perceptron, the feature map is restored and then stitched together along the channel dimension to form a fused feature. The multilayer perceptron consists of two fully connected layers and a ReLU activation function.
[0022] Furthermore, in the feature selection module, a feature weight sub-network is used to generate corresponding features. , and The adaptive weighting subnetwork consists of a global average pooling layer, two fully connected layers, and a softmax activation function. First, the three input feature maps are concatenated along the channel dimension, then pooled and compressed into a feature vector. Finally, the fully connected layers learn the weight vectors corresponding to the three channels. , and And use softmax normalization to make the weight sum equal to 1; finally, through Perform weighted fusion.
[0023] Furthermore, in the multi-scale feature pyramid module, the multi-scale feature pyramid is used to... Downsampling to multiple scales encodes local details and global context separately; features from each scale are upsampled and aligned to the same resolution, then concatenated along the channel dimension and fused using a 1×1 convolution to obtain the final result. .
[0024] Furthermore, the reconstruction module consists of a 3×3 convolutional denoising and fusion layer, a residual block, a 3×3 convolutional reconstruction layer, and a 1×1 convolutional output layer. Mapped to fused image .
[0025] Furthermore, the gradient preservation loss L grad Specifically: ;in, The coefficient of peace. For Sobel operators, This represents the L1 norm.
[0026] Furthermore, polarization significant enhancement loss Specifically: ,in, The weighted graph is a saliency weight graph constructed based on DoLP. Obtained from the polarization degree image DoLP through nonlinear mapping ,in Control the steepness of the response in salient regions.
[0027] The beneficial effects of the method described in this invention are as follows:
[0028] Complementary features are fully utilized, and bidirectional information exchange is carried out between the S0 and DoLP branches through bidirectional cross-attention, so that the fusion result can simultaneously retain clear structural details and polarization saliency.
[0029] The adaptive contribution allocation and feature selection module can dynamically balance the intensity details and polarization saliency in different scenes and regions, reducing information loss caused by overemphasizing a certain mode.
[0030] Multi-scale context enhancement, where feature pyramids are fused to aggregate contextual information at different scales, helps suppress noise and improve the overall discernibility of the target.
[0031] Without the need to fuse ground truth, a self-supervised loss function designed for the fusion objective can be used to stably train and deploy in real polarization scenarios where objective ground truth is lacking. Attached Figure Description
[0032] Figure 1 This is a flowchart of the polarization degree and light intensity image fusion method described in this embodiment of the invention;
[0033] Figure 2 This is a schematic diagram of the bidirectional cross-attention module in an embodiment of the present invention;
[0034] Figure 3 This is a schematic diagram of the feature selection module in an embodiment of the present invention;
[0035] Figure 4 The S0 image input in this embodiment of the invention;
[0036] Figure 5 The DoLP image input in this embodiment of the invention;
[0037] Figure 6 This is a schematic diagram of the output fused image. Detailed Implementation
[0038] The technical solution of the present invention will now be clearly and completely described with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the protection scope of the present invention.
[0039] Example 1
[0040] This embodiment provides a method for fusing polarization degree and light intensity images based on bidirectional cross-attention, such as... Figure 1 As shown, it includes the following steps:
[0041] S1. Obtain pixel-level registered intensity image S0 and polarization image DoLP; normalize S0 and DoLP respectively to obtain S0 and DoLP, preferably scaling S0 to [0,1] and cropping DoLP to [0,1]. The input is intensity image S0 and polarization image DoLP, with spatial dimensions of H×W. Preferably, if there is a small disparity between the two inputs, geometric registration can be performed using feature matching and homography estimation before fusion. Linearly normalize S0 and DoLP respectively to obtain S0 and DoLP, which are used as inputs for subsequent networks.
[0042] S2. Construct a fusion network to fuse the light intensity image S0 and the polarization degree image DoLP, specifically as follows:
[0043] A dual-branch feature extraction module is constructed, with S0 input to the intensity branch and DoLP input to the polarization branch. The S0 light intensity feature and DoLP feature are obtained through shallow convolution and deep feature extraction, respectively. The features are then reweighted through a channel attention module and a spatial attention module to enhance the effective texture and salient region response.
[0044] A bidirectional cross-attention module is constructed to perform bidirectional cross-attention calculation on the S0 light intensity feature and the DoLP feature to obtain the enhanced feature. and The fused features are obtained by splicing them together along the channel dimension. .
[0045] Construct a feature selection module to... , and Adaptive weights are generated to dynamically allocate the contribution ratios of intensity details and polarization significance, and the selected features are output. .
[0046] Construct a multi-scale feature pyramid module, and Downsampling to multiple scales and aligning and fusing them yields pyramid features containing multi-scale contextual information. .
[0047] Build the rebuild module, Mapped to fused image ; Single-channel grayscale images are supported.
[0048] S3. Optimize the fusion network using a self-supervised training strategy and construct a composite loss function. It includes at least gradient preservation loss. and polarization significant enhancement loss Optional structural similarity loss can be added. To improve structural consistency and suppress noise artifacts.
[0049] Example 2
[0050] This embodiment further defines Embodiment 1 and provides a further explanation of the dual-branch feature extraction module.
[0051] The intensity branch and polarization branch employ independent weight parameters. Both branches can include: a first convolutional layer for extracting shallow features; multiple residual blocks or dense residual blocks for extracting deep features; and a downsampling layer to align the feature scales of the two branches, yielding S0 intensity features and DoLP features. Considering that DoLP images typically contain more high-frequency noise, the polarization branch can preferably incorporate a spatial attention module after shallow convolution to suppress background noise response, while the intensity branch focuses on channel attention to filter key texture features.
[0052] Example 3
[0053] This embodiment further defines Embodiment 1, providing a more detailed explanation of the bidirectional cross-attention module. The bidirectional cross-attention module is the core module of this method. By constructing a bidirectional attention calculation path between the intensity branch and the polarization degree branch, it achieves mutual guidance and synergistic enhancement of the two modal features. Specifically, the attention calculation from the intensity branch to the polarization degree branch utilizes the high signal-to-noise ratio structural and textural features in the intensity image as query vectors, guiding the polarization features to align and denoise at their corresponding spatial positions, thereby suppressing invalid responses caused by noise amplification in the polarization degree image. Conversely, the attention calculation from the polarization degree branch to the intensity branch utilizes polarization saliency information to guide the intensity features to be enhanced in low-contrast or target regions, ensuring that the fusion result maintains structural continuity while highlighting target discernibility. Through this bidirectional information interaction method, the problems of insufficient or biased information transmission in unidirectional attention structures are avoided, enabling the fused features to simultaneously inherit the stable structural information of the intensity image and the physical saliency features of the polarization degree image.
[0054] like Figure 2 As shown, preferably, a convolutional layer with a stride of 16 is used to divide the input feature map into image blocks, and each image block is flattened to form a feature vector. ;right Layer normalization is performed, and the query vector Q, key vector K, and value vector V are obtained by projecting them onto the learnable weight matrix.
[0055] ,in, They represent The corresponding learnable weight matrix;
[0056] The attention calculation from the intensity branch to the polarization degree branch is as follows:
[0057] ;
[0058] The attention calculation from the polarization degree branch to the intensity branch is as follows:
[0059] ;
[0060] , , , , and The subscripts in the text indicate the query vector Q, key vector K, and value vector V corresponding to the S0 light intensity feature and the DoLP feature, respectively. for , , , , and The corresponding length, , , , , and The corresponding lengths are all the same.
[0061] Using the high signal-to-noise ratio texture features of the intensity image as a query (Q), the polarization degree features are guided to undergo denoising and feature alignment at their corresponding spatial locations; and This approach utilizes the material saliency information of the polarization degree image to guide light intensity features for detail enhancement in low-contrast regions. Through bidirectional interaction, it resolves the information bias problem caused by unidirectional injection. The output of bidirectional cross-attention (…) and After undergoing nonlinear transformation by a multilayer perceptron, the feature map is restored and then stitched together along the channel dimension to form a fused feature. The multilayer perceptron consists of two fully connected layers and a ReLU activation function.
[0062] Example 4
[0063] This embodiment further defines Embodiment 1 and provides a more detailed explanation of the feature selection module. The feature selection module is used to dynamically reweight the fused features. Preferably, the feature weight sub-network generates weights for different channels or spatial locations, and uses softmax normalization to make the weights sum to 1, thereby achieving an adaptive allocation of the contribution ratio between intensity detail features and polarization saliency features, and outputting the selected features. .
[0064] like Figure 3 As shown, the feature weight sub-network consists of a global average pooling layer, two fully connected layers, and a Softmax activation function. First, the three input feature maps... , , The vectors are concatenated along the channel dimension, compressed into feature vectors through pooling, and then learned through a fully connected layer to obtain the weight vectors for the three corresponding channels. , and Finally passed Perform weighted fusion.
[0065] Example 5
[0066] This embodiment further defines Embodiment 1 and provides a further explanation of the multi-scale feature pyramid module. The multi-scale feature pyramid is constructed by... Downsampling to multiple scales and aligning and fusing them yields pyramid features containing multi-scale contextual information. Multi-scale feature pyramid fusion will Downsampling to multiple scales encodes local details and global context separately; features from each scale are upsampled and aligned to the same resolution, then concatenated along the channel dimension and fused using a 1×1 convolution to obtain the final result. To improve cross-scale characterization capabilities and enhance robustness.
[0067] Example 6
[0068] This embodiment further defines Embodiment 1 and provides a further explanation of the reconstruction module. The reconstruction module is constructed by... Mapped to fused image ; It is a single-channel grayscale image.
[0069] The reconstruction module consists of a 3×3 convolutional denoising and fusion layer, four residual blocks, a 3×3 convolutional reconstruction layer, and a 1×1 convolutional output layer. Mapped to fused image Output It can be a single-channel grayscale image; when it is necessary to visualize polarization information, a pseudo-color image can be output, in which the brightness channel expresses structural details and the saturation channel expresses DoLP saliency.
[0070] Example 7
[0071] This embodiment further defines Embodiment 1 and provides a further explanation of the self-supervised training strategy.
[0072] In existing fusion methods, loss function design typically focuses on intensity consistency constraints or overall polarization information enhancement. Some methods suppress polarization noise by constraining the similarity between the fused result and the intensity image at the pixel or structural level. However, these methods often over-rely on intensity information, making it difficult to reflect polarization saliency in complex backgrounds or low-contrast scenes. Another type of method enhances the overall response of the polarization degree image to improve target saliency. However, since DoLP images are easily affected by noise amplification under low signal-to-noise ratio conditions, directly performing global polarization enhancement can easily misjudge background noise as valid features, thus introducing obvious noise artifacts or texture distortion into the fusion result. Furthermore, loss designs that rely solely on gradient or contrast maximization struggle to distinguish between real structural edges and noise gradients, easily causing texture loss or over-smoothing in complex texture scenes.
[0073] To address the aforementioned problems, this invention does not globally amplify polarization information in its loss function design. Instead, it constructs a saliency weight map based on DoLP and applies differentiated constraints to saliency and insignificance polarization regions. Specifically, it employs a polarization saliency enhancement loss. By guiding the fusion result in high DoLP regions to enhance target discernibility, while constraining the fusion result in low DoLP regions to maintain consistency with the intensity image, the problem of background noise being incorrectly amplified is effectively avoided. This design ensures that polarization information only plays a enhancing role in physically significant regions, improving the stability of the fusion result under complex backgrounds and low signal-to-noise ratio conditions. Combined with gradient-preserving loss... Constraints on structural edges and structural similarity loss By adjusting the overall statistical properties, the composite self-supervised loss function adopted in this invention can achieve a balance between preserving structural details, enhancing polarization saliency, and suppressing noise without fusing the ground truth, significantly reducing the generation of texture loss and noise artifacts compared to existing technologies.
[0074] Gradient Preservation Loss : ;
[0075] The Sobel operator is used to calculate the gradient magnitudes of the fused image and the input image, with constraints on I. f gradient magnitude and Consistent or nearly identical, to preserve prominent edges and textures; among which For balance coefficient, For Sobel operators, L1 norm:
[0076] Significant increase in polarization loss : Construct a saliency weight graph based on DoLP. Weighted graph Obtained from DoLP image through nonlinear mapping ,in Control the steepness of the response in salient regions; Represents the L2 norm;
[0077] In the non-significant region of low DoLP response ( (approaching 0), assigning larger weights to force image fusion. Consistent with S0 at high signal-to-noise ratios; while in significant regions with pronounced polarization characteristics ( If it approaches 1), then it is allowed. It deviates from S0 to absorb polarization enhancement information.
[0078] Optional structural similarity loss Constrain the structural similarity between the fused image and S0 to reduce detail loss caused by over-smoothing: ,
[0079] ;
[0080] in This represents the average value. Indicates standard deviation, and The stability constant, indicated by the subscript. This represents the image to be compared.
[0081] The loss can be expressed as ,in - Preset weighting coefficients. The loss function is calculated using only the input S0, DoLP, and the output. The computation can be performed without the participation of the ground truth image, thus achieving network self-supervision.
[0082] During the training phase, the input image can be cropped into fixed-size image patches to form mini-batches, and geometric enhancements such as random flipping, rotation, and scaling can be applied. These enhancement operations must be applied simultaneously to S0 and DoLP to maintain pixel-level correspondence. The optimizer can be Adam, and the learning rate and weights of each loss can be set using the validation set. During the inference phase, only the registered S0 and DoLP images need to be input, such as... Figure 4 and 5 As shown, the fused image can be output end-to-end. ,like Figure 6 As shown.
Claims
1. A polarization and light intensity image fusion method based on bidirectional cross attention, characterized in that, The method includes the following steps: S1. Obtain the pixel-level registered light intensity image S0 and polarization degree image DoLP, and normalize the light intensity image S0 and polarization degree image DoLP. S2. Construct a fusion network to fuse the light intensity image S0 and the polarization degree image DoLP, specifically as follows: A dual-branch feature extraction module is constructed, in which the light intensity image S0 is input into the intensity branch and the polarization degree image DoLP is input into the polarization degree branch, respectively. The light intensity features of S0 and the DoLP features are obtained through shallow convolution and deep feature extraction, respectively. A bidirectional cross-attention module is constructed to perform bidirectional cross-attention calculation on the S0 light intensity feature and the DoLP feature, respectively, to obtain the enhanced features. and The fused features are obtained by splicing them together along the channel dimension. ; Construct a feature selection module to... , and Generate adaptive weights, then perform feature fusion, and output selected features. ; Generate corresponding features using a feature weight sub-network , and The adaptive weighting subnetwork consists of a global average pooling layer, two fully connected layers, and a softmax activation function. First, the three input feature maps are concatenated along the channel dimension, then pooled and compressed into a feature vector. Finally, the fully connected layers learn the weight vectors corresponding to the three channels. , and And use softmax normalization to make the weight sum equal to 1; finally, through Perform weighted fusion; Construct a multi-scale feature pyramid module, and Downsampling to multiple scales and aligning and fusing them yields pyramid features containing multi-scale contextual information. ; constructing a reconstruction module to map to a fused image ; S3. Optimize the fusion network using a self-supervised training strategy and construct a composite loss function. It includes at least gradient preservation loss. and polarization significant enhancement loss .
2. The polarization and light intensity image fusion method based on bidirectional cross attention according to claim 1, characterized in that, In the dual-branch feature extraction module, the intensity branch and the polarization branch use independent weight parameters. The dual-branch structure includes: a first convolutional layer for extracting shallow features, and multiple residual blocks or dense residual blocks for extracting deep features; and the feature scales of the two branches are aligned by a downsampling layer to obtain S0 light intensity features and DoLP features; a channel attention module is added after the first convolutional layer for the intensity branch, and a spatial attention module is added after the first convolutional layer for the polarization branch.
3. The polarization and light intensity image fusion method based on bidirectional cross attention according to claim 2, characterized in that, The convolutional layer is used in the bidirectional cross attention module to divide the input feature map into image blocks, and each image block is flattened to form a feature vector ; For eigenvectors Layer normalization is performed, and the query vector Q, key vector K, and value vector V are obtained by projecting them onto the learnable weight matrix respectively. ,in, They represent The corresponding learnable weight matrix; The attention calculation from the intensity branch to the polarization degree branch is as follows: ; The attention calculation from the polarization degree branch to the intensity branch is as follows: ; , , , , and The subscripts in the text indicate the query vector Q, key vector K, and value vector V corresponding to the S0 light intensity feature and the DoLP feature, respectively. The output of bidirectional cross-attention is the feature vector length. and After undergoing nonlinear transformation by a multilayer perceptron, the feature map is restored and then stitched together along the channel dimension to form a fused feature. The multilayer perceptron consists of two fully connected layers and a ReLU activation function.
4. The polarization and light intensity image fusion method based on bidirectional cross attention according to claim 3, characterized in that, In the multi-scale feature pyramid module, the multi-scale feature pyramid is used to... Downsampling to multiple scales encodes local details and global context separately; features from each scale are upsampled and aligned to the same resolution, then concatenated along the channel dimension and fused using a 1×1 convolution to obtain the final result. .
5. The polarization and light intensity image fusion method based on bidirectional cross attention according to claim 4, characterized in that, The reconstruction module consists of a 3×3 convolutional denoising and fusion layer, a residual block, a 3×3 convolutional reconstruction layer, and a 1×1 convolutional output layer. Mapped to fused image .