A multispectral image fusion method, device and equipment based on multi-target segmentation
By combining a multi-target segmentation network and a multispectral image fusion network, the problem of the inability to effectively highlight salient targets in existing technologies is solved, generating high-quality multispectral image fusion results that meet the needs of human visual perception.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHEJIANG UNIV
- Filing Date
- 2023-03-24
- Publication Date
- 2026-06-23
AI Technical Summary
Existing multispectral image fusion methods cannot effectively highlight salient targets and cannot select different fusion strategies according to the target category, resulting in poor quality of the generated fused images.
A multi-target segmentation network is used to perform multi-target semantic segmentation on visible light and infrared images to generate multi-target segmented images. Deep features are extracted and fused through a multispectral image fusion network. An adaptive fusion is performed using a multi-target enhanced feature fusion module to generate high-quality target fusion images.
The generated fused image effectively highlights prominent targets in the infrared image while preserving the natural modality of the visible light image, thus improving image quality.
Smart Images

Figure CN116385326B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of image processing technology, and in particular to a multispectral image fusion method, apparatus and device based on multi-target segmentation. Background Technology
[0002] Multispectral image fusion technology plays a crucial role in remote sensing, autonomous driving, and medical diagnosis. Invention patent CN113033630A discloses a deep learning fusion method for infrared and visible light images based on a dual non-local attention model. This method extracts depth features from both types of images by constructing a multi-scale deep network. The fusion layer enhances and merges the extracted depth features using a spatial and channel dual non-local attention model, and obtains the fused image through feature reconstruction. While this method considers the saliency of infrared and visible light image features, the resulting fused image still fails to highlight salient targets and cannot select different fusion strategies based on target categories. Therefore, there is an urgent need in the industry for an image fusion method to generate high-quality fused images. Summary of the Invention
[0003] In view of this, this application provides a multispectral image fusion method, apparatus and device for multi-target segmentation, the main purpose of which is to solve the problems that current fused images cannot highlight significant targets and cannot select different fusion strategies according to target categories.
[0004] According to a first aspect of this application, a multispectral image fusion method based on multi-target segmentation is provided, the method comprising:
[0005] Acquire visible light and infrared images, and perform image registration processing on the visible light and infrared images to obtain registered target visible light and target infrared images;
[0006] A multi-target segmentation network is used to perform multi-target semantic segmentation on the target visible light image and the target infrared image to generate a multi-target segmentation image. The multi-target segmentation image contains at least one target subset, and the at least one target subset is used to indicate the pixel region corresponding to at least one salient target class.
[0007] The first depth feature corresponding to the visible light image of the target and the second depth feature corresponding to the infrared image of the target are extracted based on a multispectral image fusion network. The target fusion image is generated by fusing the first depth feature, the second depth feature and the multi-target segmentation image.
[0008] Optionally, the step of using a multi-target segmentation network to perform multi-target semantic segmentation on the visible light image and the infrared image of the target to generate a multi-target segmented image includes:
[0009] The visible light image and the infrared image of the target are input into the multi-target segmentation network, which includes an encoding sub-network and a decoding sub-network, wherein the encoding sub-network includes a visible light image encoding stream and an infrared image encoding stream;
[0010] The visible light image is used to extract features from the target visible light image using the visible light image encoding stream to obtain visible light image features at multiple scales. The infrared image is used to extract features from the target infrared image using the infrared image encoding stream to obtain infrared image features at multiple scales.
[0011] The visible light image features at each scale are added and fused with the infrared image features at the corresponding scale according to the scale identifier to obtain fused features at multiple scales. The fused features at each scale are then added to the decoding convolutional block at the corresponding scale through skip connections and channel merging. The decoding convolutional block is located in the decoding sub-network.
[0012] The decoding subnetwork includes multiple scales of decoding convolutional blocks. Each scale of decoding convolutional block reconstructs features based on the received fused features and the reconstruction features passed from the previous scale of decoding convolutional block to obtain features to be constrained. The cross-entropy loss function is used to constrain the features to be constrained to obtain target reconstruction features. The target reconstruction features are then passed to the next scale of decoding convolutional block until the last scale of decoding convolutional block outputs the feature map.
[0013] The feature map is activated using a preset activation function, the predicted values in the feature map are converted into probability values, and the predicted categories are generated to obtain the multi-target segmentation image.
[0014] Optionally, the step of extracting features from the target visible light image using the visible light image encoding stream to obtain visible light image features at multiple scales includes:
[0015] Optionally, the step of extracting features from the target visible light image using the visible light image encoding stream to obtain visible light image features at multiple scales includes:
[0016] The visible light image encoding stream is composed of multiple scale-based encoding convolutional modules. Each scale-based encoding convolutional module is connected to an attention enhancement module, which is one of a spatial attention enhancement module, a spatial and channel attention enhancement module, or a channel attention enhancement module. The encoding convolutional modules are used for feature extraction, and the attention enhancement modules are used for feature enhancement and suppression of redundant features.
[0017] The target visible light image is feature extracted using a first-scale encoding convolution module, and weighted feature enhancement is performed using a first-scale attention enhancement module to obtain first-scale visible light image features. The first-scale visible light image features are then input into a second-scale encoding convolution module and a second-scale attention enhancement module to generate second-scale visible light image features, until the last-scale attention enhancement module outputs the last-scale visible light image features.
[0018] The visible light image features output by the attention enhancement module at each scale are determined to obtain the visible light image features at the multiple scales.
[0019] Optionally, the step of extracting features from the target infrared image using the infrared image encoding stream to obtain infrared image features at multiple scales includes:
[0020] The infrared image encoding stream is composed of multiple scale-based encoding convolutional modules. Each scale-based encoding convolutional module is connected to an attention enhancement module, which is one of a spatial attention enhancement module, a spatial and channel attention enhancement module, or a channel attention enhancement module. The encoding convolutional modules are used for feature extraction, and the attention enhancement modules are used for feature enhancement and suppression of redundant features.
[0021] The target infrared image is feature extracted using a first-scale encoding convolution module, and weighted feature enhancement is performed using a first-scale attention enhancement module to obtain first-scale infrared image features. The first-scale infrared image features are then input into a second-scale encoding convolution module and a second-scale attention enhancement module to generate second-scale infrared image features, until the last-scale attention enhancement module outputs the last-scale infrared image features.
[0022] The infrared image features output by the attention enhancement module at each scale are determined to obtain the infrared image features at the multiple scales.
[0023] Optionally, the step of extracting a first depth feature corresponding to the visible light image of the target and a second depth feature corresponding to the infrared image of the target based on a multispectral image fusion network, and generating a target fused image by fusing the first depth feature, the second depth feature, and the multi-target segmentation image, includes:
[0024] The visible light image and the infrared image of the target are input into the coding sub-network of the multispectral image fusion network. The coding sub-network extracts features from the visible light image and the infrared image of the target to obtain the first depth feature and the second depth feature.
[0025] The first depth feature, the second depth feature, and the multi-target segmentation image are passed to the fusion layer of the multispectral image fusion network. The first depth feature, the second depth feature, and the multi-target segmentation image are fused by the multi-target enhancement feature fusion module in the fusion layer to obtain the fused feature.
[0026] The target image is obtained by reconstructing the fused features using the decoding subnetwork of the multispectral image fusion network.
[0027] Optionally, the step of fusing the first depth feature, the second depth feature, and the multi-target segmentation image through the multi-target enhancement feature fusion module in the fusion layer to obtain fused features includes:
[0028] The multi-target enhancement feature fusion module is used to fuse the first depth feature, the second depth feature, and the multi-target segmentation image according to the salient target class to obtain background features, secondary salient target features, and primary salient target features.
[0029] The background features, the secondary salient target features, and the primary salient target features are added together to obtain the fused features.
[0030] Optionally, the step of fusing features of the first depth feature, the second depth feature, and the multi-target segmentation image according to the salient target class to obtain background features, secondary salient target features, and primary salient target features includes:
[0031] When the salient target class is the background class, the pixel region of the background class is determined based on the target subset corresponding to the multi-target segmentation image. The background binary mask corresponding to the pixel region and the first depth feature are used to perform feature fusion to obtain the background feature.
[0032] When the salient target class is a secondary salient target class, the pixel region of the secondary salient target class is determined based on the target subset, and the secondary salient target binary mask corresponding to the pixel region and the first depth feature are used for feature fusion to obtain the secondary salient target feature;
[0033] When the salient target class is a primary salient target class, the pixel region of the primary salient target class is determined based on the target subset. The first pixel region of the primary salient target class is determined using the primary salient target binary mask corresponding to the pixel region and the first depth feature. The second pixel region of the primary salient target class is determined using the primary salient target binary mask and the second depth feature. Feature fusion is performed based on the first pixel region and the second pixel region to obtain the primary salient target feature.
[0034] According to a second aspect of this application, a multispectral image fusion apparatus based on multi-target segmentation is provided, the apparatus comprising:
[0035] The acquisition module is used to acquire visible light images and infrared images, and to perform image registration processing on the visible light images and infrared images to obtain registered target visible light images and target infrared images;
[0036] The segmentation module is used to perform multi-target semantic segmentation on the target visible light image and the target infrared image using a multi-target segmentation network to generate a multi-target segmentation image. The multi-target segmentation image contains at least one target subset, and the at least one target subset is used to indicate the pixel region corresponding to at least one salient target class.
[0037] The fusion module is used to extract the first depth feature corresponding to the visible light image of the target and the second depth feature corresponding to the infrared image of the target based on the multispectral image fusion network, and generate a target fused image by fusing the first depth feature, the second depth feature and the multi-target segmentation image.
[0038] Optionally, the segmentation module is used to input the target visible light image and the target infrared image into the multi-target segmentation network. The multi-target segmentation network includes an encoding sub-network and a decoding sub-network, wherein the encoding sub-network includes a visible light image encoding stream and an infrared image encoding stream; the visible light image encoding stream is used to extract features from the target visible light image to obtain visible light image features at multiple scales; the infrared image encoding stream is used to extract features from the target infrared image to obtain infrared image features at multiple scales; the visible light image features at each scale are added and fused with the corresponding infrared image features according to the scale identifier to obtain fused features at multiple scales; and the fused features at each scale are then processed by skipping steps. The convolutional features are added to the corresponding scale's decoding convolutional blocks using a connection and channel merging method. These decoding convolutional blocks are located in the decoding sub-network. The decoding sub-network includes multiple scales of decoding convolutional blocks. Each scale's decoding convolutional block reconstructs features based on the received fused features and the reconstructed features passed from the previous scale's decoding convolutional block, obtaining features to be constrained. The cross-entropy loss function is then used to constrain these features, obtaining target reconstructed features. These target reconstructed features are passed to the next scale's decoding convolutional block until the last scale's decoding convolutional block outputs a feature map. A preset activation function is used to activate the feature map, converting the predicted values in the feature map into probability values and generating predicted categories, thus obtaining the multi-target segmentation image.
[0039] Optionally, the segmentation module is used to concatenate multiple scale-based coding convolutional modules in the visible light image coding stream. Each scale-based coding convolutional module is connected to an attention enhancement module, which is one of a spatial attention enhancement module, a spatial and channel attention enhancement module, or a channel attention enhancement module. The coding convolutional modules are used for feature extraction, and the attention enhancement modules are used for feature enhancement and suppressing redundant features. The target visible light image is feature extracted using the coding convolutional module at the first scale, and weighted feature enhancement is performed using the attention enhancement module at the first scale to obtain visible light image features at the first scale. The visible light image features at the first scale are then input into the coding convolutional module at the second scale and the attention enhancement module at the second scale to generate visible light image features at the second scale, until the attention enhancement module at the last scale outputs the visible light image features at the last scale. The visible light image features output by the attention enhancement module at each scale are determined to obtain the visible light image features at the multiple scales.
[0040] Optionally, the segmentation module is used to concatenate multiple scale-based coding convolutional modules in the infrared image coding stream. Each scale-based coding convolutional module is connected to an attention enhancement module, which is one of a spatial attention enhancement module, a spatial and channel attention enhancement module, or a channel attention enhancement module. The coding convolutional modules are used for feature extraction, and the attention enhancement modules are used for feature enhancement and suppressing redundant features. The target infrared image is feature extracted using the coding convolutional module at the first scale, and weighted feature enhancement is performed using the attention enhancement module at the first scale to obtain infrared image features at the first scale. The infrared image features at the first scale are then input into the coding convolutional module at the second scale and the attention enhancement module at the second scale to generate infrared image features at the second scale, until the attention enhancement module at the last scale outputs the infrared image features at the last scale. The infrared image features output by the attention enhancement module at each scale are determined to obtain the infrared image features at the multiple scales.
[0041] Optionally, the fusion module is configured to input the target visible light image and the target infrared image into the encoding subnetwork of the multispectral image fusion network, extract features from the target visible light image and the target infrared image through the encoding subnetwork to obtain the first depth feature and the second depth feature; pass the first depth feature, the second depth feature and the multi-target segmented image to the fusion layer of the multispectral image fusion network, fuse the first depth feature, the second depth feature and the multi-target segmented image through the multi-target enhancement feature fusion module in the fusion layer to obtain the fused feature; and use the decoding subnetwork of the multispectral image fusion network to perform feature reconstruction on the fused feature to obtain the target image.
[0042] Optionally, the fusion module is used to perform feature fusion on the first depth feature, the second depth feature, and the multi-target segmentation image according to the salient target class, using the multi-target enhanced feature fusion module, to obtain background features, secondary salient target features, and primary salient target features; and to add the background features, the secondary salient target features, and the primary salient target features to obtain the fused features.
[0043] Optionally, the fusion module is configured to: when the salient target class is a background class, determine the pixel region of the background class based on the target subset corresponding to the multi-target segmentation image, and perform feature fusion using the background binary mask corresponding to the pixel region and the first depth feature to obtain the background feature; when the salient target class is a secondary salient target class, determine the pixel region of the secondary salient target class based on the target subset, and perform feature fusion using the secondary salient target binary mask corresponding to the pixel region and the first depth feature to obtain the secondary salient target feature; when the salient target class is a primary salient target class, determine the pixel region of the primary salient target class based on the target subset, determine the first pixel region of the primary salient target class using the primary salient target binary mask corresponding to the pixel region and the first depth feature, and determine the second pixel region of the primary salient target class using the primary salient target binary mask and the second depth feature, and perform feature fusion based on the first pixel region and the second pixel region to obtain the primary salient target feature.
[0044] According to a third aspect of this application, a computer device is provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the method described in any of the first aspects above.
[0045] According to a fourth aspect of this application, a computer-readable storage medium is provided, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of the method described in any one of the first aspects above.
[0046] Using the above technical solution, this application provides a multispectral image fusion method, apparatus, and device based on multi-target segmentation. First, visible light and infrared images are acquired, and image registration is performed on the visible light and infrared images to obtain registered target visible light and target infrared images. Then, a multi-target segmentation network is used to perform multi-target semantic segmentation on the target visible light and target infrared images, generating a multi-target segmentation image. The multi-target segmentation image contains at least one target subset, which indicates the pixel region corresponding to at least one salient target class. Finally, a first depth feature corresponding to the target visible light image and a second depth feature corresponding to the target infrared image are extracted based on the multispectral image fusion network. By fusing the first depth feature, the second depth feature, and the multi-target segmentation image, a target fusion image is generated. The multi-target segmentation network in this application extracts and fuses features from the target visible light and target infrared images at multiple scales, generating a high-quality multi-target segmentation image with sharp edges. The multispectral image fusion network adaptively fuses multiple target categories using a proposed multi-target enhanced feature fusion module, and reconstructs the final fusion image based on the fused features. Based on the category of multi-target segmentation, different fusion methods are used for different targets in the feature domain, so that the generated target fusion image can not only have the natural modal appearance of visible light image and produce an image that conforms to human visual perception, but also effectively highlight the salient targets in infrared image.
[0047] The above description is only an overview of the technical solution of this application. In order to better understand the technical means of this application and to implement it in accordance with the contents of the specification, and to make the above and other objects, features and advantages of this application more obvious and understandable, the following are specific embodiments of this application. Attached Figure Description
[0048] Various other advantages and benefits will become apparent to those skilled in the art upon reading the following detailed description of preferred embodiments. The accompanying drawings are for illustrative purposes only and are not intended to limit the scope of this application. Furthermore, the same reference numerals denote the same parts throughout the drawings. In the drawings:
[0049] Figure 1 This illustration shows a flowchart of a multispectral image fusion method based on multi-target segmentation provided in an embodiment of this application.
[0050] Figure 2AThis illustration shows a flowchart of a multispectral image fusion method based on multi-target segmentation provided in an embodiment of this application.
[0051] Figure 2B This illustration shows a segmentation process of a multispectral image fusion method based on multi-target segmentation provided in an embodiment of this application;
[0052] Figure 2C This illustration shows a feature enhancement diagram of a multispectral image fusion method based on multi-target segmentation provided in an embodiment of this application;
[0053] Figure 2D This illustration shows a schematic diagram of the fusion process of a multispectral image fusion method based on multi-target segmentation provided in an embodiment of this application;
[0054] Figure 2E This illustration shows a schematic diagram of the fusion result of a multispectral image fusion method based on multi-target segmentation provided in an embodiment of this application;
[0055] Figure 3 This illustration shows a schematic diagram of a multispectral image fusion device based on multi-target segmentation provided in an embodiment of this application;
[0056] Figure 4 A schematic diagram of the device structure of a computer device provided in an embodiment of this application is shown. Detailed Implementation
[0057] Exemplary embodiments of the present application will now be described in more detail with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this application will be thorough and complete, and will fully convey the scope of the present application to those skilled in the art.
[0058] Multispectral image fusion technology plays a crucial role in remote sensing, autonomous driving, and medical diagnosis. Visible light sensors describe rich texture details but are susceptible to light variations, exhibiting high noise and difficulty capturing useful feature information in dark, low-light, and foggy environments. Infrared sensors, while less adept at describing detailed textures, possess strong penetrating power, enabling them to capture feature information of potential targets in dark and foggy conditions. Therefore, fusion technology based on visible light and infrared sensors can compensate for the shortcomings of individual sensors, providing high-quality input images for subsequent visual perception and generating more accurate, reliable, and comprehensive decisions. Invention patent CN113033630A discloses a deep learning fusion method for infrared and visible light images based on a dual non-local attention model. This method extracts depth features from both types of images by constructing a multi-scale deep network. The fusion layer enhances and merges the extracted depth features using a spatial and channel dual non-local attention model, and obtains the fused image through feature reconstruction. Although this method considers the saliency of infrared and visible light image features, the resulting fused image still fails to highlight salient targets and cannot select different fusion strategies based on target category. Chinese patent CN111539902A discloses an image processing method, system, device, and computer-readable storage medium. The method first divides the visible light image by its low-pass filtered image and subtracts 1 to obtain visible light image detail information. It then calculates the visible light detail intensity using the standard deviation of a window, takes the reciprocal of the detail intensity to obtain the detail gain, and uses this gain to weight the visible light detail intensity onto the infrared image to generate a fused image. This method has a complex computational process and produces a poor-quality fused image that fails to highlight salient targets. There is an urgent need in the industry for an image fusion method to generate high-quality fused images. Therefore, this application provides a multispectral image fusion method based on multi-target segmentation. This application first acquires visible light and infrared images, performs image registration processing on the visible light and infrared images to obtain registered target visible light and target infrared images. Subsequently, a multi-target segmentation network is used to perform multi-target semantic segmentation on the target visible light and target infrared images to generate a multi-target segmented image. The multi-target segmented image contains at least one target subset, which indicates the pixel region corresponding to at least one salient target class. Finally, a first depth feature corresponding to the visible light image of the target and a second depth feature corresponding to the infrared image of the target are extracted based on the multispectral image fusion network. The first depth feature, the second depth feature, and the multi-target segmentation image are then fused to generate a target fused image. The multi-target segmentation network in this application extracts and fuses features from the visible light and infrared images of the target at multiple scales, generating a high-quality multi-target segmentation image with sharp edges. The multispectral image fusion network adaptively fuses multiple target categories using the proposed multi-target enhanced feature fusion module, and reconstructs the final fused image based on the fused features.Based on the category of multi-target segmentation, different fusion methods are used for different targets in the feature domain, so that the generated target fusion image can not only have the natural modal appearance of visible light image and produce an image that conforms to human visual perception, but also effectively highlight the salient targets in infrared image.
[0059] This application provides a multispectral image fusion method based on multi-target segmentation, such as... Figure 1 As shown, the method includes:
[0060] 101. Acquire visible light and infrared images, perform image registration processing on the visible light and infrared images to obtain the registered target visible light and infrared images.
[0061] Visible light refers to the wavelength range visible to the human eye, and a visible light image is a photograph taken within this range. Infrared images, also known as thermal images, are formed by a thermal infrared scanner receiving and recording the thermal radiation emitted by a target object. Visible light images describe rich texture details but are easily affected by lighting conditions, exhibiting high noise and difficulty in capturing useful feature information in dark, low-light, or foggy environments. Infrared sensors, while less adept at describing detailed textures, possess strong penetrating power and can capture feature information of potential targets in dark or foggy environments.
[0062] Specifically, the infrared image of the visible light image acquired by the binocular camera is obtained. Considering the different positions and viewing angles of the binocular camera, the infrared image of the directly acquired visible light image has a large parallax. Therefore, it is necessary to perform preliminary image registration processing on the infrared image of the directly acquired visible light image to reduce the positional error of pixels containing significant target classes in the two images, thereby improving the accuracy of subsequent feature extraction and the success rate of feature fusion.
[0063] 102. A multi-target segmentation network is used to perform multi-target semantic segmentation on the visible light image and the infrared image of the target, generating a multi-target segmentation image. The multi-target segmentation image contains at least one target subset, which is used to indicate the pixel region corresponding to at least one salient target class.
[0064] The multi-target segmentation network, a deep network for multi-target segmentation, includes an encoding subnetwork and a decoding subnetwork. The encoding subnetwork consists of two identical, weight-shared encoding streams (a visible light image encoding stream and an infrared image encoding stream) used to extract image features from the visible light and infrared images of the target at different resolutions, i.e., multi-scale visible light image features and multi-scale infrared image features. The decoding subnetwork generates the multi-target segmented image through multiple cascaded decoding convolutional blocks.
[0065] 103. Based on a multispectral image fusion network, extract the first depth feature corresponding to the visible light image of the target and the second depth feature corresponding to the infrared image of the target. By fusing the first depth feature, the second depth feature and the multi-target segmentation image, generate a target fused image.
[0066] The multispectral image fusion network includes an encoding subnetwork, a fusion layer subnetwork, and a decoding subnetwork.
[0067] Specifically, the encoding subnetwork extracts the first depth features corresponding to the visible light image of the target and the second depth features corresponding to the infrared image of the target. Then, the fusion layer subnetwork fuses the extracted first and second depth features with features from the multi-target segmentation image to obtain fused features. Finally, the decoding subnetwork performs feature reconstruction on the fused features to generate the target fused image.
[0068] The method provided in this application first acquires visible light and infrared images, and performs image registration processing on the visible light and infrared images to obtain registered target visible light and target infrared images. Then, a multi-target segmentation network is used to perform multi-target semantic segmentation on the target visible light and target infrared images to generate a multi-target segmentation image. The multi-target segmentation image contains at least one target subset, which indicates the pixel region corresponding to at least one salient target class. Finally, a first depth feature corresponding to the target visible light image and a second depth feature corresponding to the target infrared image are extracted based on a multispectral image fusion network. By fusing the first depth feature, the second depth feature, and the multi-target segmentation image, a target fused image is generated. The multi-target segmentation network in this application extracts and fuses features from the target visible light and target infrared images at multiple scales to generate a high-quality multi-target segmentation image with sharp edges. The multispectral image fusion network adaptively fuses multiple target categories using a proposed multi-target enhanced feature fusion module, and reconstructs the final fused image based on the fused features. Based on the category of multi-target segmentation, different fusion methods are used for different targets in the feature domain, so that the generated target fusion image can not only have the natural modal appearance of visible light image and produce an image that conforms to human visual perception, but also effectively highlight the salient targets in infrared image.
[0069] This application provides a multispectral image fusion method based on multi-target segmentation, such as... Figure 2A As shown, the method includes:
[0070] 201. Acquire visible light and infrared images, perform image registration processing on the visible light and infrared images to obtain the registered target visible light and infrared images.
[0071] Specifically, the infrared image of the visible light image acquired by the binocular camera is obtained. Considering the different positions and viewing angles of the binocular camera, the infrared image of the directly acquired visible light image has a large parallax. Therefore, it is necessary to perform preliminary image registration processing on the infrared image of the directly acquired visible light image to reduce the positional error of pixels containing significant target classes in the two images, thereby improving the accuracy of subsequent feature extraction and the success rate of feature fusion.
[0072] Specifically, image registration can be performed by calibrating the camera. In this embodiment, a black and white checkerboard calibration method can be used to calibrate the binocular camera and obtain a transmission transformation matrix. This transmission transformation matrix is then used to perform a transmission transformation on the acquired infrared image, enabling preliminary registration with the visible light image.
[0073] 202. A multi-target segmentation network is used to perform multi-target semantic segmentation on the visible light image and the infrared image of the target, generating a multi-target segmented image.
[0074] The multi-target segmentation network, a deep network for multi-target segmentation, includes an encoding subnetwork and a decoding subnetwork. The encoding subnetwork consists of two identical, weight-shared encoding streams (a visible light image encoding stream and an infrared image encoding stream) used to extract image features from the visible light and infrared images of the target at different resolutions. Due to the resolution variation, multiple scales exist; therefore, the image features at different resolutions are essentially multi-scale visible light image features and multi-scale infrared image features. The decoding subnetwork generates the multi-target segmented image through multiple cascaded decoding convolutional blocks.
[0075] I. Encoded Subnetwork
[0076] like Figure 2B As shown, the coding sub-network includes a visible light image coding stream and an infrared image coding stream, both of which use a ResNet (Residual Network) with multiple coding convolutional blocks as their skeleton. This application uses five coding convolutional blocks as an example for illustration; these five coding convolutional blocks are referred to as... Where i is used to identify the encoding stream, i = vis or i = ir, vis is the visible light image encoding stream, and ir is the infrared image encoding stream; j = 1, 2, 3, 4, 5, representing the j-th convolutional block in the encoding stream; in Each convolutional block is followed by an attention enhancement module. Typically, ResNet is designed to extract features from three-channel RGB images, but the visible light and infrared images used in this invention are both single-channel grayscale images. Therefore, the first convolutional layer of the encoding network was modified to change the number of channels of the input image from three to one, so that the target visible light and infrared images can be input into the network.
[0077] This application takes into account the influence of environmental factors such as lighting and weather on the target visible light image I. VIS and target infrared image I IR The saliency of the feature maps should be different. Therefore, each scale of the five encoding convolutional blocks is followed by an attention enhancement module, such as... Figure 2C As shown, the attention enhancement module is one of the following: spatial attention enhancement module, spatial and channel attention enhancement module, and channel attention enhancement module. Furthermore, due to the encoding convolutional block... The extracted features have high resolution and therefore rich spatial features. As the network deepens, the resolution of the feature images gradually decreases, their space is continuously compressed, but the number of channels continues to increase. Therefore, channel attention mechanisms are used. Feature enhancement is then performed. Specifically, in Then, the Spatial Attention Enhancement Module (SAT) is used to enhance features. Subsequently, the Spatial and Channel Attention Enhancement Module (SCAT) was used to enhance spatial and channel attention. Then, the channel attention enhancement module is used, as shown in Formula 1 below:
[0078] Formula 1::
[0079] In this embodiment, both the visible light image encoding stream and the infrared image encoding stream are concatenated with multiple scale-based encoding convolutional modules. Each scale-based encoding convolutional module is connected to an attention enhancement module, which can be one of a spatial attention enhancement module, a spatial and channel attention enhancement module, or a channel attention enhancement module. The encoding convolutional modules are used for feature extraction, while the attention enhancement modules are used for feature enhancement and suppressing redundant features. After the target visible light image and the target infrared image are input into the multi-target segmentation network, feature extraction is first performed through the encoding sub-network. Specifically, the visible light image encoding stream is used to extract features from the target visible light image to obtain visible light image features at multiple scales, and the infrared image encoding stream is used to extract features from the target infrared image to obtain infrared image features at multiple scales.
[0080] In other words, for visible light image coding streams, a first-scale coding convolution module is used. Feature extraction is performed on the target visible light image. A first-scale attention enhancement module (SAT) is used for weighted feature enhancement to obtain the first-scale visible light image features. Next, these first-scale visible light image features are input into a second-scale encoding convolution module. The second-scale attention enhancement module (SAT) generates second-scale visible light image features. These second-scale visible light image features are then input into the third-scale encoding convolution module. The third-scale attention enhancement module SCAT is used to generate visible light image features at the third scale. These third-scale visible light image features are then input into the fourth-scale encoding convolution module. The fourth-scale attention enhancement module (CAT) generates fourth-scale visible light image features. These fourth-scale visible light image features are then input into the fifth-scale encoding convolution module. The fifth-scale attention enhancement module (CAT) generates and outputs visible light image features at the final scale. By determining the visible light image features output by the attention enhancement modules at each scale, visible light image features at multiple scales are obtained.
[0081] For the infrared image encoding stream, a first-scale encoding convolution module is used. Feature extraction is performed on the target infrared image. A first-scale attention enhancement module (SAT) is used for weighted feature enhancement to obtain the first-scale infrared image features. Next, these first-scale infrared image features are input into a second-scale encoding convolution module. The second-scale attention enhancement module (SAT) generates second-scale infrared image features. These second-scale infrared image features are then input into the third-scale encoding convolution module. The third-scale attention enhancement module SCAT is used to generate third-scale infrared image features. These third-scale infrared image features are then input into the fourth-scale encoding convolution module. The fourth-scale attention enhancement module (CAT) generates fourth-scale infrared image features. These features are then input into the fifth-scale encoding convolution module. The fifth-scale attention enhancement module (CAT) generates and outputs the infrared image features at the final scale. By determining the infrared image features output by the attention enhancement modules at each scale, infrared image features at multiple scales are obtained. It should be noted that the visible light image encoding stream and the infrared image encoding stream undergo feature enhancement through a weighted approach after passing through each attention enhancement module, and the enhanced features are then fed into the next scale of their respective encoding streams. The weights here refer to the parameters learned by the attention enhancement modules. The attention modules continuously learn and converge through a neural network to obtain the weights, and the visible light image encoding stream or the infrared image encoding stream undergoes feature enhancement by being weighted with the learned weights when passing through the attention enhancement module.
[0082] In actual operation, such as Figure 2BAs shown, visible light images are processed through a series of steps: V1-SAT (first scale), V2-SAT (second scale), V3-SCAT (third scale), V4-CAT (fourth scale), and V5-CAT (fifth scale) to obtain visible light image features at multiple scales. Similarly, infrared images are processed through a series of steps: T1-SAT (first scale), T2-SAT (second scale), T3-SCAT (third scale), T4-CAT (fourth scale), and T5-CAT (fifth scale) to obtain infrared image features at multiple scales.
[0083] Furthermore, the visible light image features at each scale are added and fused with the corresponding infrared image features according to their scale identifiers to obtain fused features at multiple scales. To compensate for the spatial information loss caused by the decoding network, the fused features at each scale are added to the corresponding scale's decoding convolutional block through skip connections and channel merging. Skip connection is a specific term in deep learning, referring to a feature connection method where features at one scale skip certain intermediate modules and directly connect to features at another scale. The connection method is concat (a specific term, i.e., channel merging / feature concatenation here). For example, feature A is 256*256*3 (width*height*number of channels), and feature B is 256*256*5 (width*height*number of channels). Channel merging refers to concatenation along the channel dimension. After connecting A to feature B, feature C is obtained, which is 256*256*8 (8 is the number of channels obtained from the concatenation, i.e., the above dimension 5+3). Specifically, as shown... Figure 2B As shown, the features of the visible light encoded stream V2-SAT are fused with the features of the infrared encoded stream T2-SAT by addition, and then connected to the D4 decoding module by channel splicing. The following three skip connections are the same as described above.
[0084] II. Decoding Subnetwork
[0085] like Figure 2B As shown, the decoding subnetwork consists of decoding convolutional blocks D of multiple scales. i 'i' is used to identify the i-th decoded convolutional block. This application uses five cascaded, high-performance, and concise decoded convolutional blocks D1 to D5 as an example for illustration. It should be noted that D... i Structure such as Figure 2B As shown, for any D iThis method includes transposed convolutional layers and convolutional layers. Using transposed convolutional layers further extracts features, avoiding excessive loss of spatial information during upsampling and increasing the feature map resolution by a factor of two while maintaining the same number of channels. Then, ordinary convolutions are performed to further refine the features, avoiding the checkerboard effect in the prediction results. Ordinary convolutions maintain the feature resolution and reduce the number of feature channels by a factor of two. It's important to note that the number of output channels in the last decoder layer is set to the number of semantic categories. In this invention, the dataset used is the open-source dataset provided by MFNet, therefore the number of categories n is set to 9. In actual operation, if more emphasis is placed on the two target categories of "people" and "vehicles" in a certain scenario, the number of target categories can be set to 3 (adding a background category). The number of categories here refers to the categories that need to be highlighted and can be modified according to whether the scene contains them or subjective preference. For example, the 9 categories here refer to the 9 common categories in city streets: background, people, cars, bicycles, parking signs, railings, lane lines, cliffs on both sides of the road, and roadblocks. The visible light image features and infrared image features at the end of the encoding process are fused by addition to obtain the fused encoded features, which are then put into the first-scale decoding convolutional block D1. The fused features are decoded through D1-D2-D3-D4-D5, and after activation by the sigmoid activation function, the predicted image I is output. pred .
[0086] Furthermore, each scale's decoding convolutional block reconstructs features based on the received fused features and the reconstructed features passed from the previous scale's decoding convolutional block, obtaining the features to be constrained. A cross-entropy loss function is then used to constrain these features, yielding the target reconstructed features. These target reconstructed features are then passed to the next scale's decoding convolutional block, until the final scale's decoding convolutional block outputs the feature map. The specific loss function is shown in Formula 2 below:
[0087] Formula 2:
[0088] Where α, β, γ, and δ are the weight coefficients of the loss function, respectively, and L CE L represents the cross-entropy loss. Dice Representing Dice loss, O k G represents the segmentation result obtained by predicting the features at the k-th layer in the feature reconstruction process. In the decoding network, the value of k gradually increases from right to left. k It is the result of downsampling the truth label by a factor of k.
[0089] After the decoding network, the 9xHxW feature map is activated using the sigmoid activation function to convert the predicted values into probability values, and then the predicted class is generated using argmax, resulting in a multi-object segmentation image. Multi-object segmentation image I pred This can be expressed as Formula 3 below:
[0090] Formula 3: I Pred =argmax(sigmoid(O5)).
[0091] A multi-object segmentation image contains at least one subset of objects, which indicates the pixel region corresponding to at least one salient object class. For example, if the number of classes is set to 9 (people, vehicles, obstacles, etc.), but an image may contain only one vehicle and no people, then the prediction result I... pred There is only one salient target class, namely "car", so it contains only one subset of targets.
[0092] 203. Based on a multispectral image fusion network, extract the first depth feature corresponding to the visible light image of the target and the second depth feature corresponding to the infrared image of the target. By fusing the first depth feature, the second depth feature and the multi-target segmentation image, generate a target fused image.
[0093] In the embodiments of this application, such as Figure 2D As shown, the multispectral image fusion network includes an encoding subnetwork, a fusion layer subnetwork, and a decoding subnetwork. The encoding subnetwork extracts depth features from the target's visible light and infrared images. The fusion layer subnetwork fuses the extracted depth features with features from the multi-target segmentation image. The decoding subnetwork reconstructs features from the fused features to generate the target fused image.
[0094] I. Encoded Subnetwork
[0095] The encoding subnetwork also has two identical branches with shared weights. Each branch consists of a single convolutional network C1 and three tightly connected feature extraction networks DC1 to DC3. The visible light image and the infrared image of the target are input into the encoding subnetwork of the multispectral image fusion network. The two branches of the encoding subnetwork extract features from the visible light image and the infrared image of the target, respectively, to obtain the first depth feature corresponding to the visible light image and the second depth feature corresponding to the infrared image of the target.
[0096] II. Fusion Subnetwork
[0097] The fusion layer of multispectral image fusion introduces multi-object segmentation images, considering that observation is biased towards specific object categories. For example, this invention always focuses on pedestrians first, then on moving vehicles on the road, and then tries to observe the surrounding environment, such as advertisements on nearby high-rise buildings. This is because, during observation, this invention subconsciously reminds that different categories of objects need to be treated differently, both in terms of attention and saliency. This approach can be easily transferred to the field of image fusion. Most existing fusion approaches ignore the role of object categories, treating all spatial pixels as an equivalent "class" during feature fusion and determining the saliency of pixels within this "class" through various modules or mechanisms. This fusion method lacks the concept of object categories, resulting in a lack of feature interpretability in many scenarios. However, if the category of a feature point is known in advance, such as person or vehicle, this invention can move beyond a single feature domain during fusion, generating better, more interpretable weight information based on category priors from a "God's-eye view." Therefore, this invention proposes a fusion method based on multi-object classification, characterized by adaptive fusion strategies for different categories, and proposes a multi-object enhanced feature fusion module, MTFM.
[0098] It should be noted that, since the multi-object segmentation task has accurately shifted the network's attention from global features to the target category features of interest in this invention, the feature fusion module for multi-object enhancement in this invention does not require a spatial attention mechanism. The feature fusion module based on multi-object enhancement can be described as follows:
[0099] The first depth feature, the second depth feature, and the multi-object segmentation image are passed to the fusion layer of the multispectral image fusion network. The multi-object enhancement feature fusion module in the fusion layer fuses these features to obtain the fused features. Specifically, the multi-object enhancement feature fusion module fuses the first depth feature, the second depth feature, and the multi-object segmentation image according to the salient object class to obtain background features, secondary salient object features, and primary salient object features. Finally, the background features, secondary salient object features, and primary salient object features are added together to obtain the fused features.
[0100] Specifically, the MTFM module is used for classification fusion, and the first deep feature output by the encoding network is represented as Φ. VIS The second depth feature is represented as Φ IR , where Φ VIS ∈R C×H×W , Φ IR ∈R C×H×W Multi-target segmentation image I Pred ∈R H×W It can contain targets {T0, T1, T2…T} nAny subset of targets}, where n is the number of salient target classes, T i For I pred The pixel region corresponding to the i-th salient target class segmented, I pred With T i The relationship is shown in Formula 4 below:
[0101] Formula 4:
[0102] Where n is I pred The number of salient target classes contained, where n ≤ 9.
[0103] Furthermore, the strategy is divided for T. i To integrate.
[0104] Specifically, when the salient target class is the background class T0, the pixel region of the background class is determined based on the target subset corresponding to the multi-target segmentation image. The background binary mask corresponding to the pixel region and the first depth feature are then used for feature fusion to obtain the background features. For the background class T0, this application considers that the fused image should conform as closely as possible to human visual perception; therefore, the overall modality should be as similar as possible to the visible light image. The T0 fusion strategy is designed as shown in Formula 5 below:
[0105] Formula 5: F0=M0×Φ VIS
[0106] Where F0∈R C×H×W , represents the fused background features; M0 represents I. pred For the binary mask corresponding to the background T0 pixel region, more fully, M0 can be expressed as the following formula 6:
[0107] Formula 6:
[0108] Besides background targets, low-heat targets such as lane lines and railings among the remaining eight target categories are referred to as secondary salient targets. These targets can be captured well in visible light images but have very little information in infrared images. Therefore, for these segmented targets, this invention still uses the features of the visible light image as the fusion features, without adding feature information from the corresponding region of the infrared image.
[0109] Specifically, when the salient target class is a secondary salient target class, the pixel regions of the secondary salient target class are determined based on the target subset. Feature fusion is then performed using the secondary salient target binary mask corresponding to the pixel region and the first depth feature to obtain the secondary salient target features. The secondary salient target class T... k The fusion strategy design is shown in Formula 7 below:
[0110] Formula 7:
[0111] Where F1∈R C×H×W , represents the secondary salient target features after fusion. k represents the category corresponding to the secondary salient target, and M represents the value of M. k For I pred Corresponding to T k A binary mask for pixel regions. The fusion method for secondary salient targets proposed in this application can preserve the target texture information of the visible light image to the greatest extent and reduce information loss in the fused image.
[0112] The remaining targets, such as people and vehicles, are also important targets of this invention and are referred to as primary salient targets. These targets are well represented in both visible light and infrared images. Therefore, for these segmented salient targets, this invention employs a weighted fusion strategy based on channel attention enhancement.
[0113] Specifically, when the salient target class is the primary salient target class, the pixel regions of the primary salient target class are determined based on the target subset. The first pixel region of the primary salient target class is determined using the primary salient target binary mask and the first depth feature corresponding to the pixel region, and the second pixel region of the primary salient target class is determined using the primary salient target binary mask and the second depth feature. Feature fusion is performed based on the first and second pixel regions to obtain the primary salient target features, and the primary salient target class T. s The fusion strategy design is shown in Formula 8 below:
[0114] Formula 8:
[0115] Where, Φ′ VIS ∈R C×H×W ,Φ′ IR ∈R C×H×W , are the pixel regions corresponding to the main salient targets based on visible light and infrared features, respectively. s represents the category corresponding to the main salient target, and M... s For I pred Corresponding to T s A binary mask for the pixel region. This application uses the extracted Φ′ VIS and Φ′ IR CAT module feature enhancement is performed, specifically based on the following formula 9:
[0116] Formula 9:
[0117] The CAT module is specifically shown in Formula 10 below:
[0118] Formula 10:
[0119] Here, Avgpool2d represents the global average pooling operation for a two-dimensional image using ReLU as the activation function, and Linear1 and Linear2 represent two fully connected layers. Linear1 has 12 neurons, and Linear2 has Φ′ neurons. VIS The number of channels is 9. This represents the Hadamard product, expressed as the dot product operation of the corresponding feature matrices. Further, for the feature-enhanced Φ″... VIS ,Φ″ IR The weighted fusion is performed as shown in Formula 11 below:
[0120] Formula 11:
[0121] The main salient target features after fusion are F2 = W(Φ″). VIS )×Φ″ VIS +W(Φ″ IR )×Φ″ IR
[0122] Final fusion feature F out F is the sum of background feature F0, secondary salient target feature F1, and primary salient target feature F2, i.e., F out =F0+F1+F2
[0123] III. Decoding Subnetwork
[0124] The decoding subnetwork of a multispectral image fusion network is used to reconstruct the fused features to obtain the target image. Specifically, the fused features F after passing through the MTFM module are... out The data is fed into a decoding network for feature reconstruction to obtain the final fusion result I. Fuse The fusion result is for example Figure 2E As shown, Figure 2E The results are compared in four sets, with the first to fourth rows showing the results of visible light images, infrared images, the results of the fusion network of invention patent CN113033630A, and the fusion results of this invention, respectively.
[0125] Furthermore, regarding the selection of the dataset, for the multi-object segmentation network, this invention uses the MFdataset dataset for urban street scene segmentation. This dataset contains nine common infrared and visible light image pairs, including people and vehicles on urban roads, and also provides manually segmented Ground Truth. The dataset exhibits minor misalignment. For the multispectral image fusion network, this invention uses the COCO dataset to train pre-trained encoding, fusion, and decoding networks. The fusion layer uses ground truth labels instead of predicted segmented images for learning the weights of the fully connected layers in the CAT module. Further, regarding the selection of the loss function, for the multi-object segmentation network, this invention uses a weighted average of cross-entropy loss and Dice coefficient loss, expressed as L = L ce +αL Dice Through experiments, α was determined to be 1.5. For the loss function of the fusion network, the total loss function consists of structural similarity and mean squared error, expressed as L = L... MSE +βL SSIM The structural similarity loss function can be further expressed as L SSIM =1-SSIM(I Fuse ,I VIS The initial learning rate for the segmentation network was set to 0.03, optimized using SGD stochastic gradient descent, with a momentum of 0.09 and weight decay of 0.0005. The initial learning rate for the fusion network was set to 0.01, using Adam as the optimizer, with the remaining parameters left as default. The batch size for both training networks was set to 5, and the input image resolution was 480x640.
[0126] The method provided in this application first acquires visible light and infrared images, and performs image registration processing on the visible light and infrared images to obtain registered target visible light and target infrared images. Then, a multi-target segmentation network is used to perform multi-target semantic segmentation on the target visible light and target infrared images to generate a multi-target segmentation image. The multi-target segmentation image contains at least one target subset, which indicates the pixel region corresponding to at least one salient target class. Finally, a first depth feature corresponding to the target visible light image and a second depth feature corresponding to the target infrared image are extracted based on a multispectral image fusion network. By fusing the first depth feature, the second depth feature, and the multi-target segmentation image, a target fused image is generated. The multi-target segmentation network in this application extracts and fuses features from the target visible light and target infrared images at multiple scales to generate a high-quality multi-target segmentation image with sharp edges. The multispectral image fusion network adaptively fuses multiple target categories using a proposed multi-target enhanced feature fusion module, and reconstructs the final fused image based on the fused features. Based on the category of multi-target segmentation, different fusion methods are used for different targets in the feature domain, so that the generated target fusion image can not only have the natural modal appearance of visible light image and produce an image that conforms to human visual perception, but also effectively highlight the salient targets in infrared image.
[0127] Furthermore, as Figure 1 To specifically implement the method, this application provides a multispectral image fusion device based on multi-target segmentation, such as... Figure 3 As shown, the device includes: a data acquisition module 301, a segmentation module 302, and a fusion module 303.
[0128] The acquisition module 301 is used to acquire visible light images and infrared images, and to perform image registration processing on the visible light images and infrared images to obtain registered target visible light images and target infrared images.
[0129] The segmentation module 302 is used to perform multi-target semantic segmentation on the target visible light image and the target infrared image using a multi-target segmentation network to generate a multi-target segmentation image. The multi-target segmentation image contains at least one target subset, and the at least one target subset is used to indicate the pixel region corresponding to at least one salient target class.
[0130] The fusion module 303 is used to extract the first depth feature corresponding to the visible light image of the target and the second depth feature corresponding to the infrared image of the target based on the multispectral image fusion network, and generate a target fused image by fusing the first depth feature, the second depth feature and the multi-target segmentation image.
[0131] In a specific application scenario, the segmentation module 302 is used to input the target visible light image and the target infrared image into the multi-target segmentation network. The multi-target segmentation network includes an encoding sub-network and a decoding sub-network, wherein the encoding sub-network includes a visible light image encoding stream and an infrared image encoding stream; the visible light image encoding stream is used to extract features from the target visible light image to obtain visible light image features at multiple scales, and the infrared image encoding stream is used to extract features from the target infrared image to obtain infrared image features at multiple scales; the visible light image features at each scale are added and fused with the corresponding infrared image features according to the scale identifier to obtain fused features at multiple scales, and the fused features at each scale are... By using skip connections and channel merging, features are added to the corresponding scale of the decoding convolutional block, which is located in the decoding sub-network. The decoding sub-network includes multiple scales of decoding convolutional blocks. Each scale's decoding convolutional block reconstructs features based on the received fused features and the reconstructed features passed from the previous scale's decoding convolutional block, obtaining features to be constrained. The cross-entropy loss function is then used to constrain the features to be constrained, obtaining target reconstructed features. The target reconstructed features are then passed to the next scale's decoding convolutional block until the last scale's decoding convolutional block outputs a feature map. A preset activation function is used to activate the feature map, converting the predicted values in the feature map into probability values and generating predicted categories, thus obtaining the multi-object segmentation image.
[0132] In a specific application scenario, the segmentation module 302 is used to concatenate multiple scale-based coding convolutional modules in the visible light image coding stream. Each scale-based coding convolutional module is connected to an attention enhancement module, which is one of a spatial attention enhancement module, a spatial and channel attention enhancement module, or a channel attention enhancement module. The coding convolutional modules are used for feature extraction, and the attention enhancement modules are used for feature enhancement and suppressing redundant features. The target visible light image is feature extracted using the coding convolutional module at the first scale, and weighted feature enhancement is performed using the attention enhancement module at the first scale to obtain visible light image features at the first scale. The visible light image features at the first scale are then input into the coding convolutional module at the second scale and the attention enhancement module at the second scale to generate visible light image features at the second scale, until the attention enhancement module at the last scale outputs the visible light image features at the last scale. The visible light image features output by the attention enhancement module at each scale are determined to obtain the visible light image features at multiple scales.
[0133] In a specific application scenario, the segmentation module 302 is used to concatenate multiple scale-based coding convolutional modules in the infrared image coding stream. Each scale-based coding convolutional module is connected to an attention enhancement module, which is one of a spatial attention enhancement module, a spatial and channel attention enhancement module, or a channel attention enhancement module. The coding convolutional modules are used for feature extraction, and the attention enhancement modules are used for feature enhancement and suppressing redundant features. The first-scale coding convolutional module is used to extract features from the target infrared image, and the first-scale attention enhancement module is used to perform weighted feature enhancement to obtain the first-scale infrared image features. The first-scale infrared image features are then input to the second-scale coding convolutional module and the second-scale attention enhancement module to generate the second-scale infrared image features, until the last-scale attention enhancement module outputs the last-scale infrared image features. The infrared image features output by the attention enhancement module at each scale are determined to obtain the infrared image features at multiple scales.
[0134] In a specific application scenario, the fusion module 303 is used to input the target visible light image and the target infrared image into the encoding subnetwork of the multispectral image fusion network, extract features from the target visible light image and the target infrared image through the encoding subnetwork to obtain the first depth feature and the second depth feature; pass the first depth feature, the second depth feature and the multi-target segmented image to the fusion layer of the multispectral image fusion network, fuse the first depth feature, the second depth feature and the multi-target segmented image through the multi-target enhancement feature fusion module in the fusion layer to obtain the fused feature; and use the decoding subnetwork of the multispectral image fusion network to perform feature reconstruction on the fused feature to obtain the target image.
[0135] In a specific application scenario, the fusion module 303 is used to perform feature fusion on the first depth feature, the second depth feature and the multi-target segmentation image according to the salient target class to obtain background features, secondary salient target features and primary salient target features; and add the background features, the secondary salient target features and the primary salient target features to obtain the fused features.
[0136] In specific application scenarios, the fusion module 303 is used to: when the salient target class is a background class, determine the pixel region of the background class based on the target subset corresponding to the multi-target segmentation image, and perform feature fusion using the background binary mask corresponding to the pixel region and the first depth feature to obtain the background feature; when the salient target class is a secondary salient target class, determine the pixel region of the secondary salient target class based on the target subset, and perform feature fusion using the secondary salient target binary mask corresponding to the pixel region and the first depth feature to obtain the secondary salient target feature; when the salient target class is a primary salient target class, determine the pixel region of the primary salient target class based on the target subset, determine the first pixel region of the primary salient target class using the primary salient target binary mask corresponding to the pixel region and the first depth feature, and determine the second pixel region of the primary salient target class using the primary salient target binary mask and the second depth feature, and perform feature fusion based on the first pixel region and the second pixel region to obtain the primary salient target feature.
[0137] The apparatus provided in this application first acquires visible light and infrared images, and performs image registration processing on the visible light and infrared images to obtain registered target visible light and target infrared images. Then, a multi-target segmentation network is used to perform multi-target semantic segmentation on the target visible light and target infrared images to generate a multi-target segmentation image. The multi-target segmentation image contains at least one target subset, which indicates the pixel region corresponding to at least one salient target class. Finally, a first depth feature corresponding to the target visible light image and a second depth feature corresponding to the target infrared image are extracted based on a multispectral image fusion network. By fusing the first depth feature, the second depth feature, and the multi-target segmentation image, a target fusion image is generated. The multi-target segmentation network in this application extracts and fuses features from the target visible light and target infrared images at multiple scales to generate a high-quality multi-target segmentation image with sharp edges. The multispectral image fusion network adaptively fuses multiple target categories using a proposed multi-target enhanced feature fusion module, and reconstructs the final fusion image based on the fused features. Based on the category of multi-target segmentation, different fusion methods are used for different targets in the feature domain, so that the generated target fusion image can not only have the natural modal appearance of visible light image and produce an image that conforms to human visual perception, but also effectively highlight the salient targets in infrared image.
[0138] It should be noted that other corresponding descriptions of the functional units involved in the multispectral image fusion device based on multi-target segmentation provided in this application embodiment can be found in the following references. Figure 1 The corresponding descriptions in Figure 2 will not be repeated here.
[0139] In an exemplary embodiment, see Figure 4 The invention also provides a device comprising a communication bus, a processor, a memory, and a communication interface, and may further include an input / output interface and a display device, wherein the various functional units can communicate with each other via the bus. The memory stores a computer program, and the processor executes the program stored in the memory to perform the multispectral image fusion method based on multi-target segmentation described in the above embodiments.
[0140] A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the multispectral image fusion method based on multi-target segmentation.
[0141] Through the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented in hardware or by using software plus necessary general-purpose hardware platforms. Based on this understanding, the technical solution of this application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (such as a CD-ROM, USB flash drive, external hard drive, etc.) and includes several instructions to cause a computer device (such as a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments of this application.
[0142] Those skilled in the art will understand that the accompanying drawings are merely schematic diagrams of a preferred embodiment, and the modules or processes shown in the drawings are not necessarily essential for implementing this application.
[0143] Those skilled in the art will understand that the modules in the apparatus of the implementation scenario can be distributed within the apparatus of the implementation scenario as described, or they can be located in one or more apparatuses different from this implementation scenario, with corresponding changes. The modules of the above-described implementation scenario can be combined into one module, or they can be further divided into multiple sub-modules.
[0144] The serial numbers in this application are for descriptive purposes only and do not represent the superiority or inferiority of the implementation scenario.
[0145] The above disclosures are only a few specific implementation scenarios of this application. However, this application is not limited to these. Any variations that can be conceived by those skilled in the art should fall within the protection scope of this application.
Claims
1. A multispectral image fusion method based on multi-target segmentation, characterized in that, include: Acquire visible light and infrared images, and perform image registration processing on the visible light and infrared images to obtain registered target visible light and target infrared images; A multi-target segmentation network is used to perform multi-target semantic segmentation on the target visible light image and the target infrared image to generate a multi-target segmentation image. The multi-target segmentation image contains at least one target subset, and the at least one target subset is used to indicate the pixel region corresponding to at least one salient target class. The visible light image and the infrared image of the target are input into the coding sub-network of the multispectral image fusion network. The coding sub-network extracts features from the visible light image and the infrared image of the target to obtain the first depth feature and the second depth feature. The first depth feature, the second depth feature, and the multi-target segmentation image are passed to the fusion layer of the multispectral image fusion network. The multi-target enhancement feature fusion module in the fusion layer fuses the first depth feature, the second depth feature, and the multi-target segmentation image to obtain fused features, including: When the salient target class is the background class, the pixel region of the background class is determined based on the target subset corresponding to the multi-target segmentation image. The background binary mask corresponding to the pixel region and the first depth feature are used to perform feature fusion to obtain the background features. When the salient target class is a secondary salient target class, the pixel region of the secondary salient target class is determined based on the target subset, and the secondary salient target binary mask corresponding to the pixel region and the first depth feature are used for feature fusion to obtain the secondary salient target features; When the salient target class is the primary salient target class, the pixel region of the primary salient target class is determined based on the target subset. The first pixel region of the primary salient target class is determined using the primary salient target binary mask corresponding to the pixel region and the first depth feature. The second pixel region of the primary salient target class is determined using the primary salient target binary mask and the second depth feature. Feature fusion is performed based on the first pixel region and the second pixel region to obtain the primary salient target feature. The background features, the secondary salient target features, and the primary salient target features are added together to obtain the fused features; The target image is obtained by reconstructing the fused features using the decoding subnetwork of the multispectral image fusion network.
2. The method according to claim 1, characterized in that, The step of using a multi-target segmentation network to perform multi-target semantic segmentation on the visible light image and the infrared image of the target, generating a multi-target segmented image, includes: The visible light image and the infrared image of the target are input into the multi-target segmentation network, which includes an encoding sub-network and a decoding sub-network, wherein the encoding sub-network includes a visible light image encoding stream and an infrared image encoding stream; The visible light image is used to extract features from the target visible light image using the visible light image encoding stream to obtain visible light image features at multiple scales. The infrared image is used to extract features from the target infrared image using the infrared image encoding stream to obtain infrared image features at multiple scales. The visible light image features at each scale are added and fused with the infrared image features at the corresponding scale according to the scale identifier to obtain fused features at multiple scales. The fused features at each scale are then added to the decoding convolutional block at the corresponding scale through skip connections and channel merging. The decoding convolutional block is located in the decoding sub-network. The decoding subnetwork includes multiple scales of decoding convolutional blocks. Each scale of decoding convolutional block reconstructs features based on the received fused features and the reconstruction features passed from the previous scale of decoding convolutional block to obtain features to be constrained. The cross-entropy loss function is used to constrain the features to be constrained to obtain target reconstruction features. The target reconstruction features are then passed to the next scale of decoding convolutional block until the last scale of decoding convolutional block outputs the feature map. The feature map is activated using a preset activation function, the predicted values in the feature map are converted into probability values, and the predicted categories are generated to obtain the multi-target segmentation image.
3. The method according to claim 2, characterized in that, The visible light image is used to extract features from the target visible light image using the visible light image encoding stream to obtain visible light image features at multiple scales, including: The visible light image encoding stream is composed of multiple scale-based encoding convolutional modules. Each scale-based encoding convolutional module is connected to an attention enhancement module, which is one of a spatial attention enhancement module, a spatial and channel attention enhancement module, or a channel attention enhancement module. The encoding convolutional modules are used for feature extraction, and the attention enhancement modules are used for feature enhancement and suppression of redundant features. The target visible light image is feature extracted using a first-scale encoding convolution module, and weighted feature enhancement is performed using a first-scale attention enhancement module to obtain first-scale visible light image features. The first-scale visible light image features are then input into a second-scale encoding convolution module and a second-scale attention enhancement module to generate second-scale visible light image features, until the last-scale attention enhancement module outputs the last-scale visible light image features. The visible light image features output by the attention enhancement module at each scale are determined to obtain the visible light image features at the multiple scales.
4. The method according to claim 2, characterized in that, The infrared image encoding stream is used to extract features from the target infrared image to obtain infrared image features at multiple scales, including: The infrared image encoding stream is composed of multiple scale-based encoding convolutional modules. Each scale-based encoding convolutional module is connected to an attention enhancement module, which is one of a spatial attention enhancement module, a spatial and channel attention enhancement module, or a channel attention enhancement module. The encoding convolutional modules are used for feature extraction, and the attention enhancement modules are used for feature enhancement and suppression of redundant features. The target infrared image is feature extracted using a first-scale encoding convolution module, and weighted feature enhancement is performed using a first-scale attention enhancement module to obtain first-scale infrared image features. The first-scale infrared image features are then input into a second-scale encoding convolution module and a second-scale attention enhancement module to generate second-scale infrared image features, until the last-scale attention enhancement module outputs the last-scale infrared image features. The infrared image features output by the attention enhancement module at each scale are determined to obtain the infrared image features at the multiple scales.
5. A multispectral image fusion device based on multi-target segmentation, characterized in that, include: The acquisition module is used to acquire visible light images and infrared images, and to perform image registration processing on the visible light images and infrared images to obtain registered target visible light images and target infrared images; The segmentation module is used to perform multi-target semantic segmentation on the target visible light image and the target infrared image using a multi-target segmentation network to generate a multi-target segmentation image. The multi-target segmentation image contains at least one target subset, and the at least one target subset is used to indicate the pixel region corresponding to at least one salient target class. A fusion module is used to input the target visible light image and the target infrared image into the coding subnetwork of a multispectral image fusion network, extract features from the target visible light image and the target infrared image through the coding subnetwork to obtain a first depth feature and a second depth feature; pass the first depth feature, the second depth feature, and the multi-target segmented image to the fusion layer of the multispectral image fusion network, and fuse the first depth feature, the second depth feature, and the multi-target segmented image through a multi-target enhancement feature fusion module in the fusion layer to obtain fused features, including: when the salient target class is the background class, determining the pixel region of the background class based on the target subset corresponding to the multi-target segmented image, and performing feature fusion using the background binary mask corresponding to the pixel region and the first depth feature to obtain background features; when the salient target class is the secondary salient target class, determining the pixel region of the secondary salient target class based on the target subset, and performing feature fusion using the secondary salient target binary mask corresponding to the pixel region and the first depth feature to obtain secondary salient target features; When the salient target class is the primary salient target class, the pixel region of the primary salient target class is determined based on the target subset. The first pixel region of the primary salient target class is determined using the primary salient target binary mask corresponding to the pixel region and the first depth feature, and the second pixel region of the primary salient target class is determined using the primary salient target binary mask and the second depth feature. Feature fusion is performed based on the first pixel region and the second pixel region to obtain the primary salient target feature. The background feature, the secondary salient target feature and the primary salient target feature are added to obtain the fused feature. The decoding subnetwork of the multispectral image fusion network is used to reconstruct the feature of the fused feature to obtain the target image.
6. A device comprising a memory and a processor, the memory storing a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 4.
7. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 4.