Image processing method based on infrared and visible light fusion for dark scenes

By constructing a low-light enhancement and infrared restoration network based on the Restormer model, and combining multimodal feature fusion and loss function optimization, the problem of poor fusion effect of infrared and visible light images in dark scenes is solved, and high-quality fused images are generated.

CN118887101BActive Publication Date: 2026-06-23CENT SOUTH UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CENT SOUTH UNIV
Filing Date
2024-07-11
Publication Date
2026-06-23

Smart Images

  • Figure CN118887101B_ABST
    Figure CN118887101B_ABST
Patent Text Reader

Abstract

The application discloses an image processing method based on infrared and visible light fusion for dark scenes, comprising the following steps: obtaining existing image data to obtain a faint light data set and a dual light fusion data set; performing anti-noise processing on the faint light data set to obtain a faint light training data set; constructing a dark scene image processing initial model based on infrared and visible light fusion; training the dark scene image processing initial model using the dual light fusion data set and the faint light training data set to obtain a dark scene image processing model; and performing actual image processing using the dark scene image processing model. The method can generate a fusion image with high contrast and color fidelity under various degradation conditions. The method combines the maximum selection and visual fidelity strategy in the loss function of model training, so that the fusion result has good visual quality.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of image processing, specifically relating to an image processing method for dark scenes based on the fusion of infrared and visible light. Background Technology

[0002] Due to inherent technological limitations of different devices and the complexity of real-world scenarios, images captured by a single sensor often lack the ability to comprehensively represent a scene. Therefore, in order to obtain images with high-quality visual effects even in extreme environments, researchers fuse original images from different modalities to obtain fused images containing more scene information. Infrared and visible light fusion is the most widely used image fusion method.

[0003] Infrared images can preserve the thermal radiation information of a scene in extreme environments, but they lack texture details and have low spatial resolution. Conversely, visible light images contain rich texture information and conform to human visual perception, but their image quality is greatly affected by the environment. Combining the high-contrast pixel information of infrared images with the texture information of visible light can improve the visual effect of the image and enhance the scene representation capability.

[0004] Infrared and visible light fusion techniques can be divided into two categories: traditional methods and deep learning-based methods. Traditional methods map source images to the transform domain using manually designed mathematical models for fusion, and finally use inverse transform to obtain the final fused result. Representative traditional methods include multi-scale transform-based methods, sparse representation-based methods, subspace-based methods, saliency-based methods, and hybrid methods. Although the development of traditional methods has become quite comprehensive, the following problems still exist: manually designed models cannot meet the needs of complex scenes; and the high time complexity of fusion algorithms cannot achieve real-time fusion.

[0005] In recent years, deep learning-based infrared-visible light fusion methods have become mainstream. By learning the latent connections between images, they achieve efficient feature extraction and yield higher visual results than traditional methods. In the field of infrared and visible light fusion technology, deep learning-based methods include convolutional neural network-based methods, autoencoder-based methods, and generative adversarial network-based methods. Although existing deep learning algorithms have surpassed traditional methods in image fusion, the fusion effect is often poor when the scene is dark. Insufficient illumination and significant sensor noise in low-light conditions damage the texture of visible light images, causing algorithms to extract information only from existing degraded scenes, resulting in texture blurring and color distortion. Summary of the Invention

[0006] The purpose of this invention is to provide an image processing method based on the fusion of infrared and visible light for dark scenes.

[0007] This invention provides an image processing method for dark scenes based on infrared and visible light fusion, comprising the following steps:

[0008] S1. Obtain existing image data to obtain a low-light dataset and a dual-light fusion dataset;

[0009] S2. Perform noise reduction processing on the low-light dataset obtained in step S1 to obtain the low-light training dataset;

[0010] S3. Based on the fusion of infrared and visible light, construct an initial model for image processing in dark scenes;

[0011] S4. Using the dual-light fusion dataset obtained in step S1 and the low-light training dataset obtained in step S2, train the initial dark scene image processing model obtained in step S3 to obtain the dark scene image processing model.

[0012] S5. Use the dark scene image processing model obtained in step S4 to perform actual image processing.

[0013] The noise reduction process described in step S2 is expressed using the following formula:

[0014]

[0015] Where V is the low-light dataset; n is additive white Gaussian noise; This is a training dataset for low-light conditions.

[0016] The initial model for dark scene image processing described in step S3 includes a first-stage model and a second-stage model;

[0017] The initial model for dark scene image processing enhances the low-light performance and infrared contrast of the input image data through a first-stage model. The results are then input into a second-stage model for multimodal feature fusion and reconstruction, ultimately yielding the output of the initial model for dark scene image processing.

[0018] The first-stage model includes a low-light enhancement network submodule and an infrared restoration network submodule;

[0019] A low-light enhancement network submodule and an infrared restoration network submodule are constructed based on the Restormer model. The low-light enhancement network submodule and the infrared restoration network submodule have the same structure. Each submodule includes an encoder part and a decoder part. The encoder part includes a first convolutional layer module, a first Restormer module, a first downsampling module, a second Restormer module, a second downsampling module, a third Restormer module, and a third downsampling module, all connected in series. The output of the third downsampling module serves as the input to the decoder part. The decoder part includes a fourth Restormer module, a first upsampling module, a fifth Restormer module, a second upsampling module, and a sixth Restormer module, all connected in series. The system consists of a third upsampling module, a seventh Restormer module, and a second convolutional layer module. The fifth Restormer module uses the outputs of the first and third upsampling modules as inputs; the sixth Restormer module uses the outputs of the second and third upsampling modules as inputs; and the seventh Restormer module uses the outputs of the third upsampling module and the first Restormer module as inputs. The outputs of the first, second, third, and third downsampling modules in the encoder section of each submodule are extracted to obtain the infrared multi-scale features. With low-light multi-scale features The output of the decoder section in the low-light enhancement network submodule is used as the output of the low-light enhancement network submodule; the output of the decoder section in the infrared restoration network submodule is used as the output of the infrared restoration network submodule.

[0020] The first downsampling module to the third downsampling module have the same structure, including convolutional layers, instance normalization and GRLU activation function connected in series; the first upsampling module to the third upsampling module have the same structure, including convolutional layers, instance normalization and GRLU activation function connected in series.

[0021] The first convolutional layer module has the same structure as the second convolutional layer module, including convolutional blocks, instance normalization and GRLU activation function in sequence; the convolutional block is composed of convolutional kernels, which move on the input image by sliding window to calculate the dot product at each position, thereby generating a feature map;

[0022] The first through seventh Restormer modules have the same structure;

[0023] The input to the low-light enhancement network submodule is low-light image data; the input to the infrared restoration network submodule is infrared image data.

[0024] The second-stage model includes an image fusion module and a fused image decoding module;

[0025] The image fusion module includes a frozen infrared and visible light encoder and a channel attention space intercalation module; the image fusion module employs low-light multi-scale features. With infrared multi-scale features As input;

[0026] The frozen infrared and visible light encoders are identical to the encoder portions of the low-light enhancement network submodule and the infrared restoration network submodule in the first-stage trained model; the frozen infrared and visible light encoders handle the input low-light multi-scale features. With infrared multi-scale features Feature extraction is performed to obtain deep features. The frozen infrared and visible light encoders only participate in the model training process;

[0027] The channel attention spatial embedding module includes an average pooling branch, a max pooling branch, and a spatial embedding submodule; the input micro-light multi-scale features With infrared multi-scale features Extended features are obtained through channel splicing. Extended features The results are processed using max pooling and average pooling branches respectively, then summed pixel by pixel, and finally combined with the extended features. Perform pixel-by-pixel multiplication to obtain channel attention features.

[0028] The spatial embedding submodule will input the micro-light multi-scale features With infrared multi-scale features The spatial splicing features are obtained through processing. Spatial embedding is achieved by calculating the difference matrix between features, expressed using the following formula:

[0029]

[0030] Where i represents different scales of the feature vector and i = 1, 2, 3, 4; ⊙ represents pixel-by-pixel multiplication; A i The difference matrix is ​​represented by the following formula:

[0031] A i =tanh(M i )⊙sigmoid(C i )

[0032] Where tanh(·) is the tanh activation function; sigmoid(·) is the sigmoid activation function; ⊙ represents pixel-wise multiplication; M i Let C be the Manhattan distance matrix.i The cosine similarity matrix is ​​calculated using the following formula:

[0033]

[0034] in, The infrared feature vector at position (x,y); Let (x, y) be the luminous feature vector at position (x, y). Let (x, y) be the pixel value at position (x, y) in the c-th dimension of the infrared multi-scale feature. The pixel value at position (x, y) in the c-th dimension of the low-light multi-scale feature; M i (x,y) is and Manhattan distance; C i (x,y) is and The cosine similarity; N is the number of feature channels; ||·||1 is the absolute value constraint;

[0035] Finally, the channel attention spatial integration module integrates channel attention features. Spatial integration features By combining infrared and visible light features at different spatial distances, the network can better aggregate complementary information from different modalities to obtain fused features. Express it using the following formula:

[0036]

[0037] Where i represents different scales of the feature vector and i = 1, 2, 3, 4;

[0038] The structure of the fusion image decoding module is the same as that of the decoder part of the low-light enhancement network submodule or infrared restoration network submodule in the first-stage model; the input of the fusion image decoding module is the fusion feature. The fused image with normal illumination is reconstructed by the decoder and used as the output of the model.

[0039] The training described in step S4 includes pre-training the first-stage model, fine-tuning the pre-trained low-light enhancement network sub-module, and training the second-stage model. The initial model for dark scene image processing first completes the pre-training of the first-stage model, then fine-tunes the pre-trained low-light enhancement network sub-module, and finally completes the training of the second-stage model to obtain the dark scene image processing model.

[0040] The pre-training of the first-stage model includes pre-training of the low-light enhancement network submodule and pre-training of the infrared restoration network submodule.

[0041] The low-light enhancement network submodule was pre-trained in a supervised manner, using a normal illumination image as the ground truth. The loss function was expressed by the following formula:

[0042]

[0043] Wherein, λ1, λ2 and λ3 are weighting coefficients; To guide the learning of the intensity distribution of the reference image, the following formula is used:

[0044]

[0045] in, For the output of the low-light enhancement network submodule; V normal This is the corresponding normal illumination image;

[0046] The pre-training of the micro-light enhancement network submodules is constrained at the structural level, and is expressed by the following formula:

[0047]

[0048] Wherein, SSIM(·) is the structural similarity loss, calculated using the following formula:

[0049]

[0050] Where (x,y) are the image pairs to be calculated; μ k ,k∈{x,y} is the mean of the image pair (x,y); σ k ,k∈{x,y} is the variance of the image pair (x,y); σ xy c1 and c2 are the covariance between the images; c1 and c2 are constants.

[0051] The gradient smoothing of the output image is constrained to suppress noise, and is expressed by the following formula:

[0052]

[0053] Where Y is the output of the low-light enhancement network submodule; ▽ x and ▽ y is the first-order gradient operator; H is the feature height; W is the feature width; c is the color channel, c∈{R,G,B}, where R is the red channel, G is the green channel, and B is the blue channel;

[0054] The infrared restoration network submodule was pre-trained in a supervised manner, using normal illumination images as the ground truth. The loss function was expressed by the following formula:

[0055]

[0056] Wherein, λ1, λ2 and λ3 are weighting coefficients; To guide the learning of the intensity distribution of the reference image, the following formula is used:

[0057]

[0058] in, For the output of the infrared restoration network submodule; I stretch To stretch the contrast of an infrared image, the following formula is used:

[0059] I stretch =αI+(1-α)I mean

[0060] Among them, I mean α is the average value of the infrared image; I is the intensity adjustment parameter; and I is the input to the infrared restoration network submodule.

[0061] The pre-training of the micro-light enhancement network submodules is constrained at the structural level, and is expressed by the following formula:

[0062]

[0063] The gradient smoothing of the output image is constrained to suppress noise, and is expressed by the following formula:

[0064]

[0065] Where Y is the output of the infrared restoration network submodule; ▽ x and ▽ y It is a first-order gradient operator;

[0066] The pre-trained micro-light enhancement network submodules are fine-tuned, and the loss function is expressed by the following formula:

[0067]

[0068] Wherein, λ4 and λ5 are weighting coefficients; To fine-tune the intensity loss, The structural loss is expressed using the following formula:

[0069]

[0070] in, This is the output of the low-light enhancement network submodule; The adjusted pseudo-true value;

[0071] The second-stage model is trained, and its total loss function is l.fuse As shown below:

[0072]

[0073] Among them, λ6, λ7, λ8 and λ9 are weighting coefficients; This represents the second stage of intensity loss. For visual fidelity loss, For gradient term loss, Color loss is expressed using the following formula:

[0074]

[0075] in, P represents the output of the low-light enhancement network submodule; Cb represents the fused image output by the model. p For reference, the Cb channel in the YCbCr color space; Cr p For reference, the Cr channel in the YCbCr color space; Output the Cb channel in the YCbCr color space for the network; This is the output of the Cr channel in the YCbCr color space; max(·) is the maximization operation; ▽P is the gradient of the output image; ▽I is the gradient of the infrared image; ▽V is the gradient of the low-light image; The gradient output by the low-light enhancement network submodule;

[0076] and This ensures that the fusion result contains information from both visible and infrared modes; Ensure that the network retains all the details and textures in the source image; The model is constrained to learn color fidelity; by constraining the loss of the second-stage model, complementary information from infrared and visible light modes is effectively fused, making the fusion result visually consistent with human perception.

[0077] After training, a dark scene image processing model is obtained.

[0078] This invention discloses an image processing method based on infrared and visible light fusion for dark scenes, capable of producing fused images with high contrast and color fidelity under various degradation conditions. The method incorporates maximum selection and visual fidelity strategies into the loss function during model training, resulting in fusion results with excellent visual quality. Attached Figure Description

[0079] Figure 1 This is a schematic flowchart of the method of the present invention;

[0080] Figure 2This example compares the results of the method of the present invention with those of existing direct fusion methods in the LLVIP scenario.

[0081] Figure 3 This example compares the results of the method of the present invention with those of existing direct fusion methods in the MSRS scenario.

[0082] Figure 4 This example compares the results of the method of the present invention with those of existing enhanced fusion methods in the LLVIP scenario.

[0083] Figure 5 This example compares the results of the method of the present invention with those of existing enhanced fusion methods in the MSRS scenario. Detailed Implementation

[0084] This invention provides an image processing method for dark scenes based on the fusion of infrared and visible light, the flowchart of which is shown below. Figure 1 As shown, it includes the following steps:

[0085] S1. Obtain existing image data to obtain a low-light dataset and a dual-light fusion dataset;

[0086] S2. Perform noise reduction processing on the low-light dataset obtained in step S1 to obtain the low-light training dataset;

[0087] The noise reduction process described above is expressed using the following formula:

[0088]

[0089] Where V is the low-light dataset; n is additive white Gaussian noise; This is a training dataset for low-light conditions. (Please confirm the accuracy of this part, inventor.)

[0090] S3. Based on the fusion of infrared and visible light, construct an initial model for image processing in dark scenes;

[0091] The initial model for dark scene image processing includes a first-stage model and a second-stage model;

[0092] The initial model for dark scene image processing enhances the low-light performance and infrared contrast of the input image data through a first-stage model. The results are then input into a second-stage model for multimodal feature fusion and reconstruction, ultimately yielding the output of the initial model for dark scene image processing.

[0093] The first-stage model includes a low-light enhancement network submodule and an infrared restoration network submodule;

[0094] A low-light enhancement network submodule and an infrared restoration network submodule are constructed based on the Restormer model. The low-light enhancement network submodule and the infrared restoration network submodule have the same structure. Each submodule includes an encoder part and a decoder part. The encoder part includes a first convolutional layer module, a first Restormer module, a first downsampling module, a second Restormer module, a second downsampling module, a third Restormer module, and a third downsampling module, all connected in series. The output of the third downsampling module serves as the input to the decoder part. The decoder part includes a fourth Restormer module, a first upsampling module, a fifth Restormer module, a second upsampling module, and a sixth Restormer module, all connected in series. The system consists of a third upsampling module, a seventh Restormer module, and a second convolutional layer module. The fifth Restormer module uses the outputs of the first and third upsampling modules as inputs; the sixth Restormer module uses the outputs of the second and third upsampling modules as inputs; and the seventh Restormer module uses the outputs of the third upsampling module and the first Restormer module as inputs. The outputs of the first, second, third, and third downsampling modules in the encoder section of each submodule are extracted to obtain the infrared multi-scale features. With low-light multi-scale features The output of the decoder section in the low-light enhancement network submodule is used as the output of the low-light enhancement network submodule; the output of the decoder section in the infrared restoration network submodule is used as the output of the infrared restoration network submodule.

[0095] The first downsampling module to the third downsampling module have the same structure, including convolutional layers, instance normalization and GRLU activation function connected in series; the first upsampling module to the third upsampling module have the same structure, including convolutional layers, instance normalization and GRLU activation function connected in series.

[0096] The first convolutional layer module has the same structure as the second convolutional layer module, including convolutional blocks, instance normalization and GRLU activation function in sequence; the convolutional block is composed of convolutional kernels, which move on the input image by sliding window to calculate the dot product at each position, thereby generating a feature map;

[0097] The first through seventh Restormer modules have the same structure; the Restormer module is existing technology.

[0098] The input to the low-light enhancement network submodule is low-light image data; the input to the infrared restoration network submodule is infrared image data.

[0099] The second-stage model includes an image fusion module and a fused image decoding module;

[0100] The image fusion module includes a frozen infrared and visible light encoder and a channel attention space intercalation module; the image fusion module employs low-light multi-scale features. With infrared multi-scale features As input;

[0101] The frozen infrared and visible light encoders are identical to the encoder portions of the low-light enhancement network submodule and the infrared restoration network submodule in the first-stage trained model; the frozen infrared and visible light encoders handle the input low-light multi-scale features. With infrared multi-scale features Feature extraction is performed to obtain deep features.

[0102] The channel attention spatial embedding module includes an average pooling branch, a max pooling branch, and a spatial embedding submodule; the input micro-light multi-scale features With infrared multi-scale features Extended features are obtained through channel splicing. Extended features The results are processed using max pooling and average pooling branches respectively, then summed pixel by pixel, and finally combined with the extended features. Perform pixel-by-pixel multiplication to obtain channel attention features.

[0103] The spatial embedding submodule will input the micro-light multi-scale features With infrared multi-scale features The spatial splicing features are obtained through processing. Spatial embedding is achieved by calculating the difference matrix between features, expressed using the following formula:

[0104]

[0105] Where i represents different scales of the feature vector and i = 1, 2, 3, 4; ⊙ represents pixel-by-pixel multiplication; A i The difference matrix is ​​represented by the following formula:

[0106] A i =tanh(M i )⊙sigmoid(C i )

[0107] Where tanh(·) is the tanh activation function; sigmoid(·) is the sigmoid activation function; ⊙ represents pixel-wise multiplication; M i Let C be the Manhattan distance matrix. iThe cosine similarity matrix is ​​calculated using the following formula:

[0108]

[0109] in, The infrared feature vector at position (x,y); Let (x, y) be the luminous feature vector at position (x, y). Let (x, y) be the pixel value at position (x, y) in the c-th dimension of the infrared multi-scale feature. The pixel value at position (x, y) in the c-th dimension of the low-light multi-scale feature; M i (x,y) is and Manhattan distance; C i (x,y) is and The cosine similarity; N is the number of feature channels; ||·||1 is the absolute value constraint;

[0110] Finally, the channel attention spatial integration module integrates channel attention features. Spatial integration features By combining infrared and visible light features at different spatial distances, the network can better aggregate complementary information from different modalities to obtain fused features. Express it using the following formula:

[0111]

[0112] Where i represents different scales of the feature vector and i = 1, 2, 3, 4;

[0113] The structure of the fusion image decoding module is the same as that of the decoder part of the low-light enhancement network submodule or infrared restoration network submodule in the first-stage model; the input of the fusion image decoding module is the fusion feature. The fused image with normal illumination is reconstructed by the decoder and used as the output of the model.

[0114] S4. Using the dual-light fusion dataset obtained in step S1 and the low-light training dataset obtained in step S2, train the initial model for dark scene image processing obtained in step S3 to obtain the dark scene image processing model. The initial model for dark scene image processing first completes the pre-training of the first-stage model, then fine-tunes the pre-trained low-light enhancement network sub-module, and finally completes the training of the second-stage model to obtain the dark scene image processing model.

[0115] The training includes pre-training the first-stage model, fine-tuning the pre-trained low-light enhancement network sub-module, and training the second-stage model.

[0116] The pre-training of the first-stage model includes pre-training of the low-light enhancement network submodule and pre-training of the infrared restoration network submodule.

[0117] The low-light enhancement network submodule was pre-trained in a supervised manner, using a normal illumination image as the ground truth. The loss function was expressed by the following formula:

[0118]

[0119] Wherein, λ1, λ2 and λ3 are weighting coefficients; To guide the learning of the intensity distribution of the reference image, the following formula is used:

[0120]

[0121] in, For the output of the low-light enhancement network submodule; V normal This is the corresponding normal illumination image;

[0122] The pre-training of the micro-light enhancement network submodules is constrained at the structural level, and is expressed by the following formula:

[0123]

[0124] Wherein, SSIM(·) is the structural similarity loss, calculated using the following formula:

[0125]

[0126] Where (x,y) are the image pairs to be calculated; μ k ,k∈{x,y} is the mean of the image pair (x,y); σ k ,k∈{x,y} is the variance of the image pair (x,y); σ xy c1 and c2 are the covariance between the images; c1 and c2 are constants.

[0127] The gradient smoothing of the output image is constrained to suppress noise, and is expressed by the following formula:

[0128]

[0129] Where Y is the output of the low-light enhancement network submodule; ▽ x and ▽ y is the first-order gradient operator; H is the feature height; W is the feature width; c is the color channel, c∈{R,G,B}, where R is the red channel, G is the green channel, and B is the blue channel;

[0130] The infrared restoration network submodule was pre-trained in a supervised manner, using normal illumination images as the ground truth. The loss function was expressed by the following formula:

[0131]

[0132] Wherein, λ1, λ2 and λ3 are weighting coefficients; To guide the learning of the intensity distribution of the reference image, the following formula is used:

[0133]

[0134] in, For the output of the infrared restoration network submodule; I stretch To stretch the contrast of an infrared image, the following formula is used:

[0135] I stretch =αI+(1-α)I mean

[0136] Among them, I mean α is the average value of the infrared image; I is the intensity adjustment parameter; and I is the input to the infrared restoration network submodule.

[0137] The pre-training of the micro-light enhancement network submodules is constrained at the structural level, and is expressed by the following formula:

[0138]

[0139] The gradient smoothing of the output image is constrained to suppress noise, and is expressed by the following formula:

[0140]

[0141] Where Y is the output of the infrared restoration network submodule; ▽ x and ▽ y It is a first-order gradient operator;

[0142] The pre-trained micro-light enhancement network submodules are fine-tuned, and the loss function is expressed by the following formula:

[0143]

[0144] Wherein, λ4 and λ5 are weighting coefficients; To fine-tune the intensity loss, The structural loss is expressed using the following formula:

[0145]

[0146] in, This is the output of the low-light enhancement network submodule; The adjusted pseudo-true value;

[0147] The second-stage model is trained, and its total loss function is l. fuse As shown below:

[0148]

[0149] Among them, λ6, λ7, λ8 and λ9 are weighting coefficients; This represents the second stage of intensity loss. For visual fidelity loss, For gradient term loss, Color loss is expressed using the following formula:

[0150]

[0151]

[0152] in, P represents the output of the low-light enhancement network submodule; Cb represents the fused image output by the model. p For reference, the Cb channel in the YCbCr color space; Cr p For reference, the Cr channel in the YCbCr color space; Output the Cb channel in the YCbCr color space for the network; This is the output of the Cr channel in the YCbCr color space; max(·) is the maximization operation; ▽P is the gradient of the output image; ▽I is the gradient of the infrared image; ▽V is the gradient of the low-light image; The gradient output by the low-light enhancement network submodule;

[0153] and This ensures that the fusion result contains information from both visible and infrared modes; Ensure that the network retains all the details and textures in the source image; The model is constrained to learn color fidelity; by constraining the loss of the second-stage model, complementary information from infrared and visible light modes is effectively fused, making the fusion result visually consistent with human perception.

[0154] After training, a dark scene image processing model is obtained.

[0155] S5. Use the dark scene image processing model obtained in step S4 to perform actual image processing.

[0156] The method of the present invention will be further described below with reference to an embodiment:

[0157] To verify the fusion performance of the method of the present invention in low-light environments, images in LLVIP and MSRS scenes were processed using existing direct fusion and enhanced fusion methods, and the resulting images were compared. The direct fusion method does not consider the degradation of the source image, but only considers the algorithm's ability to fuse complementary information. In this embodiment, CDDFuse, U2Fusion, PIAFusion, MetaFusion, DDFM, Dif-Fusion, and SwinFusion were used as comparison benchmarks. The comparison results are shown in the figure below. Figure 2 , Figure 3 As shown in the figure. The enhancement fusion method obtains a fused image with high-quality visual effects through image enhancement operations. In this embodiment, the comparison uses MHLP, DIVFusion, DDBF, Text-IF, and three latest low-light enhancement networks, URetinex-Net, PairLIE, and MIRNet-v2, as the benchmarks for comparison. The comparison results are shown in the figure. Figure 4 , Figure 5 As shown.

Claims

1. An image processing method for dark scenes based on infrared and visible light fusion, characterized in that, Includes the following steps: S1. Acquire existing image data to obtain a low-light dataset and a two-light fusion dataset; S2. Perform noise reduction processing on the low-light dataset obtained in step S1 to obtain the low-light training dataset; S3. Construct an initial model for image processing in dark scenes based on the fusion of infrared and visible light; S4. Using the dual-light fusion dataset obtained in step S1 and the low-light training dataset obtained in step S2, train the initial dark scene image processing model obtained in step S3 to obtain the dark scene image processing model. S5. Use the dark scene image processing model obtained in step S4 to perform actual image processing; The noise reduction process described in step S2 is expressed using the following formula: in, For low-light datasets; It is additive white Gaussian noise; This is a training dataset for low-light conditions; The initial model for dark scene image processing described in step S3 includes a first-stage model and a second-stage model; The initial model for dark scene image processing enhances the low-light performance and infrared contrast of the input image data through the first-stage model. The results are then input into the second-stage model for multimodal feature fusion and reconstruction, ultimately yielding the output of the initial model for dark scene image processing. The first-stage model includes a low-light enhancement network submodule and an infrared restoration network submodule; A low-light enhancement network submodule and an infrared restoration network submodule are constructed based on the Restormer model. The low-light enhancement network submodule and the infrared restoration network submodule have the same structure. Each submodule includes an encoder part and a decoder part. The encoder part includes a first convolutional layer module, a first Restormer module, a first downsampling module, a second Restormer module, a second downsampling module, a third Restormer module, and a third downsampling module, all connected in series. The output of the third downsampling module serves as the input to the decoder part. The decoder part includes a fourth Restormer module, a first upsampling module, a fifth Restormer module, a second upsampling module, and a sixth Restormer module, all connected in series. The system consists of a third upsampling module, a seventh Restormer module, and a second convolutional layer module. The fifth Restormer module uses the outputs of the first and third upsampling modules as inputs; the sixth Restormer module uses the outputs of the second and third upsampling modules as inputs; and the seventh Restormer module uses the outputs of the third upsampling module and the first Restormer module as inputs. The outputs of the first, second, third, and third downsampling modules in the encoder section of each submodule are extracted to obtain the infrared multi-scale features. With low-light multi-scale features The output of the decoder section in the low-light enhancement network submodule is used as the output of the low-light enhancement network submodule; the output of the decoder section in the infrared restoration network submodule is used as the output of the infrared restoration network submodule. The first downsampling module to the third downsampling module have the same structure, including convolutional layers, instance normalization and GRLU activation function connected in series; the first upsampling module to the third upsampling module have the same structure, including convolutional layers, instance normalization and GRLU activation function connected in series. The first convolutional layer module has the same structure as the second convolutional layer module, including convolutional blocks, instance normalization and GRLU activation function in sequence; the convolutional block is composed of convolutional kernels, which move on the input image by sliding window to calculate the dot product at each position, thereby generating a feature map; The first through seventh Restormer modules have the same structure; The input to the low-light enhancement network submodule is low-light image data; the input to the infrared restoration network submodule is infrared image data. The second-stage model includes an image fusion module and a fused image decoding module; The image fusion module includes a frozen infrared and visible light encoder and a channel attention space intercalation module; the image fusion module employs low-light multi-scale features. With infrared multi-scale features As input; The frozen infrared and visible light encoders are identical to the encoder portions of the low-light enhancement network submodule and the infrared restoration network submodule in the first-stage trained model; the frozen infrared and visible light encoders handle the input low-light multi-scale features. With infrared multi-scale features Feature extraction is performed to obtain deep features. The frozen infrared and visible light encoders only participate in the model training process; The channel attention spatial embedding module includes an average pooling branch, a max pooling branch, and a spatial embedding submodule; the input micro-light multi-scale features With infrared multi-scale features Extended features are obtained through channel splicing. Extended features The results are processed using max pooling and average pooling branches respectively, then summed pixel by pixel, and finally combined with the extended features. Perform pixel-by-pixel multiplication to obtain channel attention features. ; The spatial embedding submodule will input the micro-light multi-scale features With infrared multi-scale features The spatial splicing features are obtained through processing. Spatial embedding is achieved by calculating the difference matrix between features, expressed by the following formula: in, For different scales of feature vectors and ; This is pixel-by-pixel multiplication; The difference matrix is ​​represented by the following formula: in, The tanh activation function; It is the sigmoid activation function; This is pixel-by-pixel multiplication; The Manhattan distance matrix is... The cosine similarity matrix is ​​calculated using the following formula: in, For position Infrared feature vectors; For position The feature vector of low light; For position Located at the first in infrared multiscale features Pixel values ​​in each dimension; Location It is located at the first in the multi-scale characteristics of low light. Pixel values ​​in each dimension; for and Manhattan distance; for and Cosine similarity; The number of feature channels; Absolute value constraint; Finally, the channel attention spatial integration module integrates channel attention features. Spatial integration features By combining infrared and visible light features at different spatial distances, the network can better aggregate complementary information from different modalities to obtain fused features. It can be expressed using the following formula: ; in, For different scales of feature vectors and ; The structure of the fusion image decoding module is the same as that of the decoder part of the low-light enhancement network submodule or the infrared restoration network submodule in the first-stage model; the input of the fusion image decoding module is the fusion feature. The normal illumination fused image is reconstructed through the decoder and used as the output of the model.

2. The image processing method for dark scenes based on infrared and visible light fusion according to claim 1, characterized in that, The training includes pre-training the first-stage model, fine-tuning the pre-trained low-light enhancement network sub-module, and training the second-stage model. The initial model for dark scene image processing first completes the pre-training of the first-stage model, then fine-tunes the pre-trained low-light enhancement network sub-module, and finally completes the training of the second-stage model to obtain the dark scene image processing model.

3. The image processing method for dark scenes based on infrared and visible light fusion according to claim 2, characterized in that, The pre-training of the first-stage model includes pre-training of the low-light enhancement network submodule and pre-training of the infrared restoration network submodule. The low-light enhancement network submodule was pre-trained in a supervised manner, using a normal illumination image as the ground truth. The loss function was expressed by the following formula: in, , and These are the weighting coefficients; To guide the learning of the intensity distribution of the reference image, the following formula is used: in, This is the output of the low-light enhancement network submodule; This is the corresponding normal illumination image; The pre-training of the micro-light enhancement network submodules is constrained at the structural level, and is expressed by the following formula: in, The structural similarity loss is calculated using the following formula: in, For the image pairs that need to be calculated; For image pairs The mean; For image pairs The variance; Covariance between images; , It is a constant; The gradient smoothing of the output image is constrained to suppress noise, and is expressed by the following formula: in, This is the output of the low-light enhancement network submodule; and It is a first-order gradient operator; For feature height; The feature width; For color channels, R is the red channel, G is the green channel, and B is the blue channel; The infrared restoration network submodule was pre-trained in a supervised manner, using normal illumination images as the ground truth. The loss function was expressed by the following formula: in, , and These are the weighting coefficients; To guide the learning of the intensity distribution of the reference image, the following formula is used: in, This is the output of the infrared restoration network submodule; To stretch the contrast of an infrared image, the following formula is used: in, This represents the average value of the infrared image. For intensity adjustment parameters; This is the input for the infrared restoration network submodule; The pre-training of the micro-light enhancement network submodules is constrained at the structural level, and is expressed by the following formula: The gradient smoothing of the output image is constrained to suppress noise, and is expressed by the following formula: in, This is the output of the infrared restoration network submodule; and It is a first-order gradient operator; For feature height; The feature width; For color channels, R is the red channel, G is the green channel, and B is the blue channel.

4. The image processing method for dark scenes based on infrared and visible light fusion according to claim 3, characterized in that, The pre-trained micro-light enhancement network submodules are fine-tuned, and the loss function is expressed by the following formula: in, and These are the weighting coefficients; To fine-tune the intensity loss, The structural loss is expressed using the following formula: in, This is the output of the low-light enhancement network submodule; This is the adjusted pseudo-true value.

5. The image processing method for dark scenes based on infrared and visible light fusion according to claim 4, characterized in that, The second-stage model is trained, and its total loss function is... As shown below: in, , , and These are the weighting coefficients; This represents the second stage of intensity loss. For visual fidelity loss, For gradient term loss, Color loss is expressed using the following formula: in, This is the output of the low-light enhancement network submodule; The fused image output by the model; For reference, the Cb channel in the YCbCr color space; For reference, the Cr channel in the YCbCr color space; Output the Cb channel in the YCbCr color space for the network; Output the Cr channel in the YCbCr color space for the network; This is an operation to find the maximum value; The gradient of the output image; The gradient of the infrared image; The gradient of the low-light image; The gradient output by the low-light enhancement network submodule; and This ensures that the fusion result contains information from both visible and infrared modes; Ensure that the network retains all the details and textures in the source image; The model is constrained to learn color fidelity; by constraining the loss of the second-stage model, complementary information from infrared and visible light modes is effectively fused, making the fusion result visually consistent with human perception; after training, a dark scene image processing model is obtained.