Dual-modal fusion method and system based on difference perception and cross attention
By detecting and completing occluded areas, generating and correcting difference-aware weights, and combining cross-attention calculation, the problems of unreliable fusion weights and the influence of occluded areas in existing technologies are solved, thereby improving the accuracy and completeness of bimodal image fusion.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NANYANG NORMAL UNIV
- Filing Date
- 2026-03-16
- Publication Date
- 2026-06-19
AI Technical Summary
In existing dual-modal image fusion methods, the allocation of fusion weights lacks reliable self-evaluation, leading to inaccuracies in registration deviation scenarios. The lack of information in occluded areas affects the completeness of feature extraction, and there is a lack of detection and completion processing for occluded areas.
By detecting and labeling occluded areas, completing visible light and infrared images, extracting texture edge intensity maps and temperature change response maps, generating difference-aware weight pairs, correcting weights through confidence maps, and combining cross-attention calculations, the fused image is reconstructed pixel by pixel with weights.
It improves the preservation of texture details and the accuracy of infrared target region representation in fused images under complex environments, eliminates the negative impact of registration deviation on the fusion results, and enhances the integrity and reliability of feature extraction.
Smart Images

Figure CN122243763A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of data processing technology, and in particular to a bimodal fusion method and system based on difference perception and cross attention. Background Technology
[0002] Visible light images and infrared images possess inherent information complementarity due to their different imaging principles. Visible light images rely on ambient lighting to reflect the texture edges and color details of a scene, while infrared images rely on target thermal radiation to reflect the temperature distribution information of a scene. Fusing these two modalities into a single image that simultaneously contains texture structure and thermal target information has significant application value in fields such as night vision detection and thermal imaging surveillance. Existing dual-modal image fusion methods are mainly divided into two categories: traditional methods and deep learning methods. Traditional methods include strategies based on pixel-weighted averaging, sparse representation, and multi-scale transformation, while deep learning methods extract and fuse features from the two images through convolutional neural networks or attention mechanisms.
[0003] However, existing fusion methods generally suffer from the following shortcomings: First, the allocation of fusion weights lacks explicit quantification of the differences in information content between the two modalities at various spatial locations. They either use fixed coefficients or rely on implicit learning by the network, failing to dynamically adjust the weight assignment based on the actual strength of the texture and temperature responses at each pixel location. Second, existing methods lack detection and completion processing of occluded regions before fusion. The loss of pixel information caused by occlusion directly affects the completeness of subsequent feature extraction. Third, existing methods lack an evaluation mechanism for the reliability of the generated fusion weights themselves. When there is a spatial registration deviation between the two modal images, the basis for weight calculation is inaccurate, and artifacts appear in the fusion result in conflict areas.
[0004] Further analysis based on the aforementioned shortcomings reveals a fundamental logical disconnect between the allocation of fusion weights and the reliability of the weights themselves in existing methods. Even with the introduction of a difference-aware weight calculation mechanism, if the cosine similarity between the two feature maps is systematically low due to registration deviation, the difference-aware weights themselves lose their credibility. Continuing to apply them to the fusion calculation will only propagate the incorrect weight allocation to the subsequent cross-attention calculation stage, leading to deviations in the direction of visible light texture enhancement of infrared details and the direction of temperature features suppression of visible light noise. Ultimately, this results in the problem of modal noise superposition in complex scenes. Therefore, it is necessary to introduce a self-evaluation and correction mechanism for the reliability of the weights after the difference-aware weights are generated to cut off the aforementioned error propagation link. Summary of the Invention
[0005] This application provides a dual-modal fusion method and system based on difference perception and cross-attention, which solves the problems in existing dual-modal image fusion methods, such as the lack of reliable self-evaluation of difference perception weights leading to inaccuracy of fusion weights in registration deviation scenarios, and the loss of information in occluded areas affecting the integrity of feature extraction. It improves the preservation of texture details and the accuracy of infrared target region representation in fused images under complex environments.
[0006] Firstly, this application provides a bimodal fusion method based on difference perception and cross-attention, the bimodal fusion method based on difference perception and cross-attention comprising:
[0007] Step S1: Detect and mark the occlusion areas in the visible light image and the infrared image as occlusion masks, and complete the visible light image and the infrared image according to the occlusion masks to obtain the completed visible light image and the completed infrared image.
[0008] Step S2: Extract a texture edge intensity map from the completed visible light image, extract a temperature change response map from the completed infrared image, and generate a difference-aware weight pair based on the pixel-by-pixel difference between the texture edge intensity map and the temperature change response map;
[0009] Step S3: Input the completed visible light image and the completed infrared image into the dual-branch encoder to obtain a visible light feature map and an infrared feature map. Construct a confidence map based on the cosine similarity between the visible light feature map and the infrared feature map. Correct the difference perception weight pair according to the confidence map to obtain a corrected weight pair.
[0010] Step S4: Using the visible light feature map as a guide, apply cross-attention calculation to the infrared feature map to obtain infrared enhancement features; using the infrared feature map as a guide, apply cross-attention calculation to the visible light feature map to obtain visible light noise reduction features; weight the infrared enhancement features and the visible light noise reduction features pixel by pixel according to the correction weight, and obtain the fused image through decoding and reconstruction.
[0011] Secondly, this application provides a bimodal fusion system based on difference perception and cross-attention, the bimodal fusion system based on difference perception and cross-attention comprising:
[0012] The detection module is used to detect and mark the occlusion areas in the visible light image and infrared image as occlusion masks, and to complete the visible light image and infrared image respectively according to the occlusion masks to obtain the completed visible light image and the completed infrared image.
[0013] The extraction module is used to extract a texture edge intensity map from the completed visible light image, extract a temperature change response map from the completed infrared image, and generate a difference-aware weight pair based on the pixel-by-pixel difference between the texture edge intensity map and the temperature change response map.
[0014] The input module is used to input the completed visible light image and the completed infrared image into the dual-branch encoder respectively to obtain a visible light feature map and an infrared feature map. A confidence map is constructed based on the cosine similarity between the visible light feature map and the infrared feature map. The difference perception weight pair is corrected according to the confidence map to obtain a corrected weight pair.
[0015] The weighting module is used to apply cross-attention calculation to the infrared feature map guided by the visible light feature map to obtain infrared enhancement features; apply cross-attention calculation to the visible light feature map guided by the infrared feature map to obtain visible light noise reduction features; and perform pixel-by-pixel weighting on the infrared enhancement features and the visible light noise reduction features according to the correction weight, and obtain a fused image after decoding and reconstruction.
[0016] The technical solution provided in this application introduces a pixel completion mechanism based on occlusion masking before fusion processing. This repairs information-deficient areas in visible light and infrared images caused by occlusion before they enter the feature extraction stage. This ensures that the subsequent extraction of texture edge intensity maps and temperature change response maps is based on spatially complete pixels, fundamentally avoiding the transfer of occlusion holes to the subsequent difference-aware weight calculation stage. Furthermore, the generation of difference-aware weight pairs is directly based on the difference in response values of the texture edge intensity map and temperature change response map at each pixel location. The information contribution strength of the two modalities at each spatial location is explicitly quantized and transformed into pixel-wise adaptive weights. Compared to the fixed coefficient or implicit learning weight allocation methods in existing technologies, the difference-aware weight pairs establish a clear numerical correspondence with the physical imaging information of each pixel location, making the weight allocation results interpretable. The introduction of the confidence map further self-evaluates the reliability of the difference-aware weight pairs. By calculating the cosine similarity between the visible light feature map and the infrared feature map at each spatial location, the degree of synergy between the two modal features is quantified into a scalar value and used to correct the weights. At locations with low confidence, the weights are contracted towards equality, cutting off the link between registration bias and fusion results. Thus, the corrected weight pairs simultaneously bear the constraints of both difference-aware information and feature reliability information. Attached Figure Description
[0017] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0018] Figure 1 This is a schematic diagram of an embodiment of the dual-modal fusion method based on difference perception and cross-attention in this application. Figure 2 This is a schematic diagram of the convergence curves of each loss function during the training process in the embodiments of this application; Figure 3 This is a schematic diagram illustrating how the mean of the confidence map changes with training rounds in an embodiment of this application. Figure 4 This is a grouped bar chart comparing the objective evaluation indicators of the above methods; Figure 5 This is a simulation diagram of the spatial distribution of difference-perceived weights. Detailed Implementation
[0019] This application provides a bimodal fusion method and system based on difference perception and cross-attention. The terms "first," "second," "third," "fourth," etc. (if present) in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments described herein can be implemented in a sequence other than that illustrated or described herein. Furthermore, the terms "comprising" or "having" and any variations thereof are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or devices.
[0020] For ease of understanding, the specific process of the embodiments of this application is described below. Please refer to [link / reference]. Figure 1 One embodiment of the bimodal fusion method based on difference perception and cross-attention in this application includes:
[0021] Step S1: Detect and mark the occlusion areas in the visible light image and infrared image as occlusion masks, and complete the visible light image and infrared image respectively according to the occlusion masks to obtain the completed visible light image and the completed infrared image.
[0022] Specifically, both the visible light image and the infrared image are timestamped images acquired from the same scene. Before being sent to subsequent processing, the two images must have the same spatial resolution and be cropped to the same size before pixel-by-pixel occlusion detection can be performed. The occlusion mask generation logic is as follows: after edge detection of the visible light image, closed regions with abnormally low gradient magnitudes and connected component areas exceeding the area threshold are identified. For the infrared image, connected components with abnormally low temperatures are identified by statistically analyzing the grayscale distribution of the entire image. The occlusion mask is obtained by taking the union of the two results. Then, pixel filling is performed from the outside to the inside along the boundary gradient direction within the occlusion mask range. The filling weight decreases exponentially with the increase of the number of pixels from the boundary, resulting in the completed visible light image and the completed infrared image.
[0023] Step S2: Extract the texture edge intensity map from the completed visible light image and the temperature change response map from the completed infrared image. Generate a difference-aware weight pair based on the pixel-by-pixel difference between the texture edge intensity map and the temperature change response map.
[0024] Specifically, the texture edge intensity map is obtained by convolving the completed visible light image with horizontal and vertical convolution kernels respectively, and then calculating the square root of the sum of squares of the pixel gradient values. The temperature change response map is obtained by smoothing the completed infrared image with Gaussian and then applying the Laplacian operator to normalize the absolute value. Both images are normalized to the range of zero to one before pixel-by-pixel difference calculation can be performed. The difference result directly determines the magnitude relationship of the weights in the two directions of the difference perception weight pair. The modality with higher response value receives higher weight, and the sum of the weights is always one at each pixel position.
[0025] Step S3: Input the completed visible light image and the completed infrared image into the dual-branch encoder to obtain the visible light feature map and the infrared feature map respectively. Construct a confidence map based on the cosine similarity between the visible light feature map and the infrared feature map. Correct the difference perception weight pair according to the confidence map to obtain the corrected weight pair.
[0026] Specifically, the dual-branch encoder has a symmetrical two-branch structure. Each branch is composed of dense residual blocks connected in series. The completed visible light image and the completed infrared image are extracted through the corresponding branches to obtain visible light feature maps and infrared feature maps, respectively. After extracting feature vectors from the two feature maps at each spatial location, the dot product is calculated and divided by the product of their respective L2 norms to obtain the cosine similarity value at that location. The cosine similarity value of the entire image is linearly mapped to the range of zero to one to obtain a confidence map. The confidence map then performs pixel-by-pixel correction on the difference-aware weight pairs. Locations with high confidence retain the original weight allocation, while locations with low confidence have their weights shrunk towards equal distribution. When the mean of the entire confidence map is lower than the preset confidence threshold, spatial offset correction is triggered, and the correction calculation is re-executed to obtain the corrected weight pairs.
[0027] Step S4: Apply cross-attention calculation to the infrared feature map guided by the visible light feature map to obtain the infrared enhancement feature; apply cross-attention calculation to the visible light feature map guided by the infrared feature map to obtain the visible light noise reduction feature; perform pixel-by-pixel weighting on the infrared enhancement feature and the visible light noise reduction feature according to the correction weight, and obtain the fused image through decoding and reconstruction.
[0028] Specifically, both cross-attention calculations employ a scaled dot product attention structure. In the first calculation, the query matrix is obtained by linear projection of the visible light feature map, and the key and value matrices are obtained by linear projection of the infrared feature map. The query matrix and key matrix are transposed and dot-producted, divided by the square root of the projection dimension, normalized, and multiplied with the value matrix. Then, the channel dimension is restored by linear projection to obtain the infrared enhancement feature. In the second calculation, the query matrix is obtained from the infrared feature map, and the key and value matrices are obtained from the visible light feature map. The visible light noise reduction feature is calculated using the same structure. The two feature paths are weighted and summed pixel-wise according to the correction weights and then input into the decoder consisting of three deconvolutional layers for layer-by-layer reconstruction to obtain the fused image.
[0029] Figure 2 This is a schematic diagram of the convergence curves of various loss functions during the training process in this embodiment of the application. The thick solid line in the figure represents the convergence curve of the total loss, and the thin solid line, long dashed line, short dashed line and dotted line represent the convergence curves of alignment loss, gradient loss, structure loss and confidence loss, respectively. The horizontal axis is the training round and the vertical axis is the loss value.
[0030] In this application, the cross-attention mechanism is implemented in a bidirectional asymmetric manner. First, the visible light feature map is used as a query guide to apply attention calculation to the infrared feature map, allowing the spatial distribution structure of the visible light texture to directly affect the enhancement direction of the infrared features. This introduces structural priors from visible light into regions where uniform thermal radiation causes blurred details in the infrared image. Second, the infrared feature map is used as a query guide to apply attention calculation to the visible light feature map, allowing temperature distribution features to directly participate in the weighted calculation of visible light features, suppressing background noise locations unrelated to thermal radiation. The two cross-attention calculations are designed separately for modal information defects in different directions, rather than using a uniform symmetric structure. This asymmetric design allows the contribution of visible light texture to the enhancement of infrared details and the contribution of infrared temperature to the suppression of visible light noise to be independently modeled at the algorithm level. Finally, the infrared enhancement features and visible light noise reduction features are weighted and fused pixel-by-pixel based on correction weights. The calculation results of the three stages—difference perception, reliability assessment, and cross-enhancement—are integrated under the same weighted framework. The output fused image is specifically enhanced in both the texture structure and thermal target representation dimensions.
[0031] In one specific embodiment, step S1 includes:
[0032] Edge detection is performed on visible light images, and closed regions with gradient magnitudes lower than the gradient magnitude threshold and connected region areas greater than the area threshold are marked as visible light occlusion candidate regions.
[0033] Calculate the mean gray level and standard deviation of all pixels in the infrared image, and mark the connected components with gray level values lower than the mean gray level minus twice the standard deviation as candidate regions for infrared occlusion.
[0034] The occlusion mask is obtained by taking the union of the visible light occlusion candidate region and the infrared occlusion candidate region.
[0035] The occlusion boundary pixels are determined by the occlusion mask. The gradient direction of the occlusion boundary pixels is propagated from the outside to the inside. The texture response value and temperature response value at the boundary are filled into the occlusion mask layer by layer. The filling weight decreases exponentially with the increase of the number of pixels from the occlusion boundary, thus obtaining the completed visible light image and the completed infrared image.
[0036] Specifically, the identification criteria for visible light occlusion candidate regions are based on abnormally low gradient magnitudes after edge detection. A gradient magnitude below a threshold indicates that the pixel grayscale changes are gradual and lacks normal texture structure. Combined with the constraint that the area of the connected components is greater than an area threshold, cases where flat background areas present in the image are mistakenly labeled as occlusion are excluded. Both conditions must be met simultaneously for a region to be labeled as a visible light occlusion candidate region. The identification criteria for infrared occlusion candidate regions are based on the overall image grayscale statistical distribution. The grayscale mean reflects the average level of thermal radiation across the entire image, while the grayscale standard deviation reflects the dispersion of thermal radiation. Connected components with grayscale values lower than the grayscale mean minus twice the grayscale standard deviation indicate that the thermal radiation intensity in this region is significantly lower than the normal level across the entire image, classifying it as an abnormal region where thermal radiation is occluded. The union of the two candidate regions yields an occlusion mask, covering all pixel locations with missing information in both modalities.
[0037] The occlusion boundary pixels are the pixels adjacent to the non-occluded area at the edge of the occlusion mask. The meaning of propagating from the outside to the inside along the gradient direction of the boundary pixels is: taking the known texture response value and temperature response value at the boundary as the starting point for filling, and advancing layer by layer into the occlusion mask according to the gradient direction at the boundary. Each layer is filled. The filling weight decreases exponentially with the increase of the number of pixels far from the occlusion boundary, that is, the pixel position farther from the boundary is assigned a smaller filling weight, so that the filling result presents a smooth transition from the boundary to the inside in space. The texture response value and temperature response value are respectively executed in the above filling process independently, and finally the complete visible light image and the complete infrared image are obtained.
[0038] In one specific embodiment, step S2, extracting the texture edge intensity map from the completed visible light image, includes:
[0039] Apply horizontal and vertical convolution kernels to the completed visible light image to obtain the horizontal gradient response map and the vertical gradient response map;
[0040] Based on the horizontal gradient response map and the vertical gradient response map, the square root of the sum of the squares of the horizontal gradient value and the vertical gradient value at each pixel location is calculated to obtain the texture edge intensity map.
[0041] Extracting a temperature jump response map from a completed infrared image includes: applying Gaussian smoothing to the completed infrared image to obtain a smoothed infrared image; applying a Laplacian operator to the smoothed infrared image to obtain a second derivative response map; and taking the absolute value of the second derivative response map and normalizing it to obtain the temperature jump response map.
[0042] The difference between the response values of the texture edge intensity map and the temperature change response map at each pixel position is calculated. When the response value of the texture edge intensity map at the corresponding position is higher than that of the temperature change response map, a higher weight is assigned to the visible light direction. When the response value of the temperature change response map at the corresponding position is higher than that of the texture edge intensity map, a higher weight is assigned to the infrared direction. The sum of the weights in the two directions remains one at each pixel position, thus obtaining the difference perception weight pair.
[0043] Specifically, the horizontal convolution kernel and the vertical convolution kernel are the horizontal kernel and vertical kernel of the Sobel operator, respectively. The horizontal kernel is sensitive to the pixel grayscale changes in the horizontal direction of the image, and the vertical kernel is sensitive to the pixel grayscale changes in the vertical direction of the image. After performing convolution operations with the completed visible light image, the gradient response values of each pixel position in the horizontal and vertical directions are obtained, namely the horizontal gradient response map and the vertical gradient response map. The value of each pixel position in the two maps represents the intensity of grayscale change in the corresponding direction at that position.
[0044] For each pixel in the horizontal and vertical gradient response maps, the horizontal and vertical gradient values are squared, summed, and then the square root is taken to obtain the comprehensive gradient magnitude at that location. The comprehensive gradient magnitude of the entire image constitutes the texture edge intensity map. Locations with higher values in the texture edge intensity map correspond to areas with rich texture or clear edges in the completed visible light image, while locations with lower values correspond to flat background areas. Gaussian smoothing is first applied to the completed infrared image to suppress high-frequency interference introduced by sensor noise. The smoothed image is then the smoothed infrared image. The Laplacian operator is then applied to the smoothed infrared image, calculating the second derivative at each pixel location. Locations with abrupt changes in grayscale values have a strong response, while locations with gradual changes in grayscale values have a response approaching 0, resulting in the second derivative response map. The absolute value of the second derivative response map is taken and normalized to the 0-1 range to obtain the temperature abrupt change response map. Locations with higher values in the temperature abrupt change response map correspond to the boundaries of hot targets or areas with abrupt temperature gradient changes in the infrared image.
[0045] Both the texture edge intensity map and the temperature change response map have been normalized to the 0-1 range. The response values of the two maps at the same pixel position are directly comparable. The absolute value of the difference after subtracting each pixel reflects the relative strength of the information of the two modes at that position. The larger the difference, the more unbalanced the information contribution of the two modes at that position. In the difference perception weight pair, the higher weight shifts towards the mode with the larger response value. Specifically, when the response value of the texture edge intensity map at that position is higher than that of the temperature change response map, the visible light direction weight is set to 0.5 plus half of the absolute value of the difference, and the infrared direction weight is 1 minus the visible light direction weight. When the response value of the temperature change response map at that position is higher than that of the texture edge intensity map, the infrared direction weight is set to 0.5 plus half of the absolute value of the difference, and the visible light direction weight is 1 minus the infrared direction weight. The sum of the weights in the two directions is always 1 at each pixel position, forming a difference perception weight pair.
[0046] In one specific embodiment, step S3 involves inputting the completed visible light image and the completed infrared image into a dual-branch encoder to obtain a visible light feature map and an infrared feature map, including:
[0047] The dual-branch encoder includes a structurally symmetrical visible light coding branch and an infrared coding branch. Each branch consists of three dense residual blocks connected in series. Each dense residual block contains three convolutional layers with a kernel size of three by three. The input of each layer is the result of concatenating the channel dimensions of the feature maps output by all previous layers.
[0048] The completed visible light image is input into the visible light coding branch, and the visible light feature map is obtained by extracting it layer by layer through three dense residual blocks; the completed infrared image is input into the infrared coding branch, and the infrared feature map is obtained by extracting it layer by layer through three dense residual blocks.
[0049] Specifically, the input to each layer in the dense residual block is the channel-dimensional concatenation result of the output feature maps of all preceding layers. Specifically, the input to layer 1 is the original input image of the encoding branch; the input to layer 2 is the result of concatenating the original input and the output feature map of layer 1 along the channel dimension; and the input to layer 3 is the result of concatenating the original input, the output of layer 1, and the output of layer 2 along the channel dimension. Channel-dimensional concatenation means stacking multiple feature maps along the channel axis to form a feature map with more channels, rather than summing pixel values. The concatenated feature map retains both shallow and deep feature information. The 3×3 kernel size means that each convolution operation covers a local receptive field of 9 pixels (3×3) centered on the current pixel, with a stride of 1 and padding of 1, ensuring that the spatial dimensions of the feature maps remain unchanged before and after convolution.
[0050] The visible light coding branch and the infrared coding branch are structurally symmetrical but have independent weights. The two branches do not share convolution parameters. After the visible light image is completed and extracted by concatenating three dense residual blocks of the visible light coding branch, the output visible light feature map retains the semantic information of the texture structure of the visible light image. After the infrared image is completed and extracted by concatenating three dense residual blocks of the infrared coding branch, the output infrared feature map retains the semantic information of the temperature distribution of the infrared image. The spatial size of the two feature maps is consistent with the input image, with 128 channels. Each spatial location corresponds to a 128-dimensional feature vector, which is the input unit for subsequent cosine similarity calculation.
[0051] In one specific embodiment, step S3, constructing a confidence map based on the cosine similarity between the visible light feature map and the infrared feature map, includes:
[0052] At each spatial location, extract the feature vectors of the visible light feature map and the infrared feature map at the corresponding locations, calculate the dot product of the two feature vectors and divide it by the product of the two feature vectors' respective L2 norms to obtain the cosine similarity value at the corresponding location.
[0053] The cosine similarity value of each spatial location in the entire map is linearly mapped from the range of negative one to positive one to the range of zero to one, thus obtaining the confidence map.
[0054] Specifically, the cosine similarity is calculated for two 128-dimensional feature vectors in the visible light feature map and the infrared feature map at the same spatial location. The dot product of the two feature vectors is the sum of the product of the corresponding dimension values. The L2 norm of each feature vector is the square root of the sum of the squares of all their dimension values. The dot product divided by the product of the two L2 norms gives the cosine similarity value at that location. The cosine similarity value ranges from -1 to 1. A value of 1 indicates that the two feature vectors are in the same direction, a value of -1 indicates that the two feature vectors are in opposite directions, and a value of 0 indicates that the two feature vectors are orthogonal and have no correlation in direction. Cosine similarity measures the consistency of the directions of the two feature vectors rather than their magnitude, so it is not affected by the difference in the numerical scale of the two feature maps. After repeating the above calculation for each spatial location in the entire image, a cosine similarity distribution map with the same spatial size as the feature map is obtained.
[0055] The value of each pixel in the cosine similarity distribution map is linearly mapped from the range of -1 to 1 to the range of 0 to 1. The mapping method is to add 1 to the original cosine similarity value and then divide by 2. After mapping, the position with a value of 1 corresponds to the spatial position where the two modal feature vectors are completely aligned, and the position with a value of 0 corresponds to the spatial position where the two modal feature vectors are completely opposite. The resulting full-map distribution is the confidence map. The confidence map has the same spatial size as the visible light feature map and the infrared feature map. Each pixel position stores a scalar value between 0 and 1, which directly participates in the subsequent pixel-by-pixel correction calculation of the difference perception weight pair.
[0056] In one specific embodiment, step S3 involves correcting the difference perception weight pairs based on the confidence map to obtain corrected weight pairs, including:
[0057] At each pixel location, the value of the confidence map at the corresponding location is multiplied by the visible light direction weight in the difference perception weight pair, and then added to the result of multiplying the difference between one and the value of the confidence map at the corresponding location by half to obtain the corrected visible light weight at the corresponding location.
[0058] The infrared direction weights corresponding to the difference perception weight pairs are calculated in the same way to obtain the corrected infrared weights at the corresponding positions; the sum of the corrected visible light weights and the corrected infrared weights at each pixel position is kept to be one, forming a correction weight pair;
[0059] When the mean of the entire confidence map is lower than the preset confidence threshold, a learnable spatial offset correction is applied to the infrared feature map. The mean of the entire confidence map is used as the supervision signal, and the offset is iteratively adjusted until the mean of the entire map is not lower than the preset confidence threshold. Then, the calculation of the correction weight pair is performed.
[0060] Specifically, the calculation process for the corrected visible light weight is as follows: multiply the value of the confidence map at the pixel position by the visible light direction weight in the difference perception weight pair to obtain the first term; then subtract the difference between the confidence map value at the position and 0.5 to obtain the second term; add the first term and the second term to obtain the corrected visible light weight at that position. When the confidence map value is 1 at this position, the second term is 0, and the corrected visible light weight is equal to the original visible light direction weight, which completely preserves the original allocation result of the difference perception weight pair; when the confidence map value is 0 at this position, the first term is 0, and the corrected visible light weight is equal to 0.5, which degenerates into equal weights for both modes; when the confidence map value is between 0 and 1, the corrected visible light weight is linearly interpolated between the original difference perception weight and 0.5, and the corrected infrared weight is calculated in the same way. Since the sum of the visible light direction weight and the infrared direction weight in the difference perception weight pair is 1, and the second term of each of them is 1 minus the confidence value multiplied by 0.5 after being corrected by the same confidence value, the sum of the two corrected weights is still always 1.
[0061] When the mean of the entire confidence map is lower than the preset confidence threshold, it indicates that there is a conflict in the feature vector direction between the visible light feature map and the infrared feature map in a large number of spatial locations. This situation is caused by sub-pixel-level spatial registration deviation between the two images. At this time, a learnable spatial offset correction is applied to the infrared feature map. The spatial offset correction is achieved by performing bilinear interpolation spatial offset on the infrared feature map. The offset is composed of two learnable parameters, horizontal and vertical. The mean of the entire confidence map is used as the supervision signal. The horizontal and vertical offsets are updated iteratively through backpropagation gradient. The mean of the entire confidence map is recalculated after each iteration until the mean of the entire map is not lower than the preset confidence threshold. After that, the corrected infrared feature map is used again to participate in the correction calculation of the difference perception weight pair to obtain the corrected weight pair.
[0062] Figure 3 This is a schematic diagram illustrating the change of the full-map mean of the confidence map with training rounds in an embodiment of this application. The solid line in the figure represents the curve of the change of the full-map mean of the confidence map with the spatial offset correction module, the dashed line represents the curve of the change of the full-map mean of the confidence map without the spatial offset correction module, the dotted line represents the preset confidence threshold of 0.4, the filled area is the numerical difference interval between the two curves, the horizontal axis is the training rounds, and the vertical axis is the full-map mean of the confidence map.
[0063] In one specific embodiment, in step S4, cross-attention calculation is applied to the infrared feature map guided by the visible light feature map to obtain infrared enhancement features; cross-attention calculation is applied to the visible light feature map guided by the infrared feature map to obtain visible light noise reduction features, including:
[0064] The query matrix is obtained by linearly projecting the visible light feature map, and the key matrix and value matrix are obtained by linearly projecting the infrared feature map respectively. The dot product of the transpose of the query matrix and the key matrix is divided by the square root of the projection dimension. After normalization, it is multiplied by the value matrix and restored to the original channel dimension by linear projection to obtain the infrared enhanced feature.
[0065] The query matrix is obtained by linearly projecting the infrared feature map, and the key matrix and value matrix are obtained by linearly projecting the visible light feature map respectively. The dot product of the transpose of the query matrix and the key matrix is divided by the square root of the projection dimension, normalized, and then multiplied with the value matrix. The result is restored to the original channel dimension by linear projection, thus obtaining the visible light noise reduction feature.
[0066] Based on the corrected visible light weight and corrected infrared weight at each pixel position in the correction weight pair, the visible light noise reduction features and infrared enhancement features are weighted and summed pixel by pixel. The weighted summation result is input into the decoder and reconstructed layer by layer through three deconvolution layers to obtain the fused image.
[0067] Specifically, linear projection means performing matrix multiplication on the feature vector at each pixel position in the input feature map, mapping the original 128-dimensional feature vector to a 64-dimensional projection space, and obtaining the corresponding query matrix, key matrix, or value matrix. The spatial dimensions of the query matrix, key matrix, and value matrix are related to the input feature map. Figure 1 The number of channels is 64. The dot product of the query matrix and the transpose of the key matrix is calculated as follows: A dot product operation is performed on the 64-dimensional vector of each query position and the 64-dimensional vector of each key position to obtain the correlation value between the query position and the key position. The correlation values of all query positions and key positions in the entire image constitute the attention correlation matrix. Each value in the attention correlation matrix is divided by the square root of the projection dimension 64 to scale the range of the dot product result and avoid the gradient vanishing after normalization due to excessively large values. After normalization, the attention weight matrix is obtained. The sum of the values in each row of the attention weight matrix is 1. This is then multiplied by the value matrix, i.e., the feature vectors of all value positions in the entire image are weighted and summed using the attention weight matrix to obtain the enhanced feature vector of each query position. The 64-dimensional vector is restored to 128-dimensional vectors through linear projection to obtain the infrared enhancement feature or visible light noise reduction feature.
[0068] The pixel-wise weighted summation is calculated as follows: At each pixel position, the infrared correction weight in the correction weight pair is multiplied by the 128-dimensional feature vector of the infrared enhancement feature at that position, and the visible light correction weight is multiplied by the 128-dimensional feature vector of the visible light noise reduction feature at that position. The corresponding dimensions of the two product vectors are added to obtain the fusion feature vector at that position. The fusion feature vectors at all positions in the entire image constitute the fusion feature map. The fusion feature map is input to a decoder consisting of three deconvolutional layers. The kernel size of each layer is 3×3, the stride is 1, and the padding is 1. The number of output channels is 64, 32, and 1 respectively. Finally, the pixel values are constrained to the range of 0 to 1 by the Sigmoid activation function to obtain a single-channel fusion image.
[0069] In one specific embodiment, to verify the fusion effect of the method of this application, tests were conducted on the publicly available TNO infrared and visible light image fusion dataset. DenseFuse, RFN-Nest, and U2Fusion were selected as comparison methods, and the evaluation metrics were information entropy (EN), average gradient (AG), spatial frequency (SF), and structural similarity (SSIM). In the experimental results, the EN of the method of this application was 7.31, which is higher than DenseFuse's 6.82, RFN-Nest's 6.94, and U2Fusion's 7.05; the AG was 5.12, which is higher than the comparison methods' 4.21, 4.53, and 4.67; the SF was 11.84, which is higher than the comparison methods' 9.43, 10.17, and 10.58; and the SSIM was 0.81, which is higher than the comparison methods' 0.71, 0.74, and 0.76. All four metrics are superior to the comparison methods.
[0070] Figure 4 The grouped bar chart comparing the objective evaluation indicators of the above methods verifies the advantages of the method in this application in terms of information richness, texture clarity and structural fidelity.
[0071] Figure 5 This is a simulation diagram of the spatial distribution of difference-aware weights. The grayscale depth in the diagram reflects the weight in the infrared direction. The white dashed ellipse marks the location of the thermal target, and the infrared weight in this area is approximately 0.72. The black dotted ellipse marks the location of the edge with rich texture, and the visible light weight in this area is approximately 0.72. The weights in both directions in the remaining flat background areas are close to 0.50. The above distribution pattern is completely consistent with the algorithm logic of the pixel-by-pixel adaptive generation of difference-aware weights in step S2.
[0072] The bimodal fusion method based on difference perception and cross-attention in the embodiments of this application has been described above. The bimodal fusion system based on difference perception and cross-attention in the embodiments of this application is described below. One embodiment of the bimodal fusion system based on difference perception and cross-attention in the embodiments of this application includes:
[0073] The detection module is used to detect and mark the occlusion areas in the visible light image and infrared image as occlusion masks, and to complete the visible light image and infrared image respectively according to the occlusion masks to obtain the completed visible light image and the completed infrared image.
[0074] The extraction module is used to extract a texture edge intensity map from the completed visible light image, extract a temperature change response map from the completed infrared image, and generate a difference-aware weight pair based on the pixel-by-pixel difference between the texture edge intensity map and the temperature change response map.
[0075] The input module is used to input the completed visible light image and the completed infrared image into the dual-branch encoder respectively to obtain a visible light feature map and an infrared feature map. A confidence map is constructed based on the cosine similarity between the visible light feature map and the infrared feature map. The difference perception weight pair is corrected according to the confidence map to obtain a corrected weight pair.
[0076] The weighting module is used to apply cross-attention calculation to the infrared feature map guided by the visible light feature map to obtain infrared enhancement features; apply cross-attention calculation to the visible light feature map guided by the infrared feature map to obtain visible light noise reduction features; and perform pixel-by-pixel weighting on the infrared enhancement features and the visible light noise reduction features according to the correction weight, and obtain a fused image after decoding and reconstruction.
[0077] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A dual-modal fusion method based on difference perception and cross attention, characterized in that, The method includes: Step S1: Detect and mark the occlusion areas in the visible light image and the infrared image as occlusion masks, and complete the visible light image and the infrared image according to the occlusion masks to obtain the completed visible light image and the completed infrared image. Step S2: Extract a texture edge intensity map from the completed visible light image, extract a temperature change response map from the completed infrared image, and generate a difference-aware weight pair based on the pixel-by-pixel difference between the texture edge intensity map and the temperature change response map; Step S3: Input the completed visible light image and the completed infrared image into the dual-branch encoder to obtain a visible light feature map and an infrared feature map. Construct a confidence map based on the cosine similarity between the visible light feature map and the infrared feature map. Correct the difference perception weight pair according to the confidence map to obtain a corrected weight pair. Step S4: Using the visible light feature map as a guide, apply cross-attention calculation to the infrared feature map to obtain infrared enhancement features; using the infrared feature map as a guide, apply cross-attention calculation to the visible light feature map to obtain visible light noise reduction features; weight the infrared enhancement features and the visible light noise reduction features pixel by pixel according to the correction weight, and obtain the fused image through decoding and reconstruction.
2. The dual-modal fusion method based on difference perception and cross attention according to claim 1, characterized in that, Step S1 includes: Edge detection is performed on the visible light image, and closed regions with gradient magnitudes lower than the gradient magnitude threshold and connected region areas greater than the area threshold are marked as visible light occlusion candidate regions. Calculate the mean gray level and standard deviation of all pixels in the infrared image, and mark the connected components with gray level values lower than the mean gray level minus twice the standard deviation of the gray level as candidate regions for infrared occlusion. The occlusion mask is obtained by taking the union of the visible light occlusion candidate region and the infrared occlusion candidate region; The occlusion boundary pixels are determined by the occlusion mask, and the gradient direction of the occlusion boundary pixels is propagated from the outside to the inside. The texture response value and temperature response value at the boundary are filled into the occlusion mask layer by layer. The filling weight decreases exponentially with the increase of the number of pixels from the occlusion boundary, so as to obtain the completed visible light image and the completed infrared image.
3. The dual-modal fusion method based on difference perception and cross attention according to claim 1, characterized in that, In step S2, extracting the texture edge intensity map from the completed visible light image includes: Apply horizontal and vertical convolution kernels to the completed visible light image to obtain a horizontal gradient response map and a vertical gradient response map. Based on the horizontal gradient response map and the vertical gradient response map, the square root of the sum of the squares of the horizontal gradient value and the vertical gradient value at each pixel position is calculated to obtain the texture edge intensity map. Extracting the temperature abrupt response map from the completed infrared image includes: applying Gaussian smoothing to the completed infrared image to obtain a smoothed infrared image; applying a Laplacian operator to the smoothed infrared image to obtain a second derivative response map; and normalizing the second derivative response map after taking its absolute value to obtain the temperature abrupt response map. The difference between the response value of the texture edge intensity map and the temperature change response map at each pixel position is calculated. When the response value of the texture edge intensity map at the corresponding position is higher than that of the temperature change response map, a higher weight is assigned to the visible light direction; when the response value of the temperature change response map at the corresponding position is higher than that of the texture edge intensity map, a higher weight is assigned to the infrared direction, and the sum of the weights in the two directions remains one at each pixel position, thus obtaining the difference perception weight pair.
4. The dual-modal fusion method based on difference perception and cross attention of claim 1, wherein, In step S3, the completed visible light image and the completed infrared image are respectively input into a dual-branch encoder to obtain the visible light feature map and the infrared feature map, including: The dual-branch encoder includes a symmetrical visible light coding branch and an infrared coding branch. Each branch consists of three dense residual blocks connected in series. Each dense residual block contains three convolutional layers with a kernel size of three by three. The input of each layer is the result of concatenating the channel dimensions of the feature maps output by all previous layers. The completed visible light image is input into the visible light coding branch, and extracted layer by layer by the three dense residual blocks to obtain the visible light feature map; the completed infrared image is input into the infrared coding branch, and extracted layer by layer by the three dense residual blocks to obtain the infrared feature map.
5. The bimodal fusion method based on difference perception and cross-attention according to claim 4, characterized in that, In step S3, constructing the confidence map based on the cosine similarity between the visible light feature map and the infrared feature map includes: At each spatial location, the feature vectors of the visible light feature map and the infrared feature map at the corresponding locations are extracted. The dot product of the two feature vectors is calculated and divided by the product of the two feature vectors' respective L2 norms to obtain the cosine similarity value at the corresponding location. The cosine similarity value of each spatial location in the entire map is linearly mapped from the range of negative one to positive one to the range of zero to one, thus obtaining the confidence map.
6. The bimodal fusion method based on difference perception and cross-attention according to claim 5, characterized in that, In step S3, the difference perception weight pair is corrected according to the confidence map to obtain the corrected weight pair, including: At each pixel location, the value of the confidence map at the corresponding location is multiplied by the visible light direction weight in the difference perception weight pair, and then added to the result of multiplying the difference between one and the value of the confidence map at the corresponding location by half to obtain the corrected visible light weight at the corresponding location. The infrared direction weights corresponding to the difference perception weight pairs are calculated in the same way to obtain the corrected infrared weights at the corresponding positions; the sum of the corrected visible light weights and the corrected infrared weights at each pixel position is kept to be one, forming the corrected weight pairs; When the mean of the entire confidence map is lower than the preset confidence threshold, a learnable spatial offset correction is applied to the infrared feature map. The mean of the entire confidence map is used as the supervision signal, and the offset is iteratively adjusted until the mean of the entire map is not lower than the preset confidence threshold. Then, the calculation of the correction weight pair is performed.
7. The bimodal fusion method based on difference perception and cross-attention according to claim 1, characterized in that, In step S4, cross-attention calculation is applied to the infrared feature map guided by the visible light feature map to obtain the infrared enhancement feature; The visible light feature map is subjected to cross-attention calculation guided by the infrared feature map to obtain the visible light noise reduction feature, including: The visible light feature map is linearly projected to obtain a query matrix, and the infrared feature map is linearly projected to obtain a key matrix and a value matrix, respectively. The query matrix and the transpose of the key matrix are multiplied by the square root of the projection dimension, normalized, and then multiplied by the value matrix. The result is then linearly projected back to the original channel dimension to obtain the infrared enhancement feature. The query matrix is obtained by linearly projecting the infrared feature map, and the key matrix and value matrix are obtained by linearly projecting the visible light feature map. The dot product of the transpose of the query matrix and the key matrix is divided by the square root of the projection dimension, normalized, and then multiplied by the value matrix. The result is then restored to the original channel dimension by linear projection to obtain the visible light noise reduction feature. Based on the corrected visible light weight and corrected infrared weight at each pixel position in the correction weight pair, the visible light noise reduction feature and the infrared enhancement feature are weighted and summed pixel by pixel. The weighted summation result is input into the decoder and reconstructed layer by layer through three deconvolution layers to obtain the fused image.
8. A bimodal fusion system based on difference perception and cross-attention, characterized in that, For implementing the bimodal fusion method based on difference perception and cross-attention as described in any one of claims 1-7, the bimodal fusion system based on difference perception and cross-attention comprises: The detection module is used to detect and mark the occlusion areas in the visible light image and infrared image as occlusion masks, and to complete the visible light image and infrared image respectively according to the occlusion masks to obtain the completed visible light image and the completed infrared image. The extraction module is used to extract a texture edge intensity map from the completed visible light image, extract a temperature change response map from the completed infrared image, and generate a difference-aware weight pair based on the pixel-by-pixel difference between the texture edge intensity map and the temperature change response map. The input module is used to input the completed visible light image and the completed infrared image into the dual-branch encoder respectively to obtain a visible light feature map and an infrared feature map. A confidence map is constructed based on the cosine similarity between the visible light feature map and the infrared feature map. The difference perception weight pair is corrected according to the confidence map to obtain a corrected weight pair. The weighting module is used to apply cross-attention calculation to the infrared feature map guided by the visible light feature map to obtain infrared enhancement features; apply cross-attention calculation to the visible light feature map guided by the infrared feature map to obtain visible light noise reduction features; and perform pixel-by-pixel weighting on the infrared enhancement features and the visible light noise reduction features according to the correction weight, and obtain a fused image after decoding and reconstruction.
9. The system according to claim 8, characterized in that, The process involves detecting and marking occlusion regions in visible light and infrared images as occlusion masks, and then completing the visible light and infrared images based on these masks to obtain completed visible light and infrared images. Edge detection is performed on the visible light image, and closed regions with gradient magnitudes lower than the gradient magnitude threshold and connected region areas greater than the area threshold are marked as visible light occlusion candidate regions. Calculate the mean gray level and standard deviation of all pixels in the infrared image, and mark the connected components with gray level values lower than the mean gray level minus twice the standard deviation of the gray level as candidate regions for infrared occlusion. The occlusion mask is obtained by taking the union of the visible light occlusion candidate region and the infrared occlusion candidate region; The occlusion boundary pixels are determined by the occlusion mask, and the gradient direction of the occlusion boundary pixels is propagated from the outside to the inside. The texture response value and temperature response value at the boundary are filled into the occlusion mask layer by layer. The filling weight decreases exponentially with the increase of the number of pixels from the occlusion boundary, so as to obtain the completed visible light image and the completed infrared image.
10. The system according to claim 9, characterized in that, Extracting the texture edge intensity map from the completed visible light image includes: Apply horizontal and vertical convolution kernels to the completed visible light image to obtain a horizontal gradient response map and a vertical gradient response map. Based on the horizontal gradient response map and the vertical gradient response map, the square root of the sum of the squares of the horizontal gradient value and the vertical gradient value at each pixel position is calculated to obtain the texture edge intensity map. Extracting the temperature abrupt response map from the completed infrared image includes: applying Gaussian smoothing to the completed infrared image to obtain a smoothed infrared image; applying a Laplacian operator to the smoothed infrared image to obtain a second derivative response map; and normalizing the second derivative response map after taking its absolute value to obtain the temperature abrupt response map. The difference between the response value of the texture edge intensity map and the temperature change response map at each pixel position is calculated. When the response value of the texture edge intensity map at the corresponding position is higher than that of the temperature change response map, a higher weight is assigned to the visible light direction; when the response value of the temperature change response map at the corresponding position is higher than that of the texture edge intensity map, a higher weight is assigned to the infrared direction, and the sum of the weights in the two directions remains one at each pixel position, thus obtaining the difference perception weight pair.