An image processing method, a storage medium and an electronic device
By performing cross-modal semantic analysis and differential fusion on spatiotemporally aligned visible light and infrared thermal imaging images of the target scene, the problems of noise and artifacts in the existing technology are solved, improving image quality and target recognition accuracy. It is suitable for monitoring systems in complex environments such as coal mines, tunnels and ports.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHENZHEN STREAMING VIDEO TECH
- Filing Date
- 2026-03-25
- Publication Date
- 2026-06-23
AI Technical Summary
Existing infrared and visible light image fusion technologies are prone to noise and artifacts in complex environments, cannot effectively suppress strong light halos, and target detection methods are prone to missing detections in irregularly shaped areas, resulting in reduced accuracy and reliability of monitoring systems.
By acquiring spatiotemporally aligned visible light and infrared thermal imaging images of the target scene, cross-modal semantic analysis is performed to generate semantic understanding information. Based on differentiated image fusion strategies for different semantic categories, fusion processing is carried out using differentiated image fusion strategies, combined with boundary smoothing processing, to generate high-quality fused images.
It effectively suppresses noise and artifacts in complex environments, improves image detail and visual coherence, and significantly enhances the target recognition accuracy and image quality of industrial monitoring systems.
Smart Images

Figure CN122265053A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image processing technology, and in particular to an image processing method, storage medium, and electronic device. Background Technology
[0002] In industrial settings such as coal mines, tunnels, and ports, video surveillance is often affected by heavy dust, insufficient lighting, and strong light interference. Visible light cameras experience a significant performance degradation in low-light and smoky environments, while infrared thermal imaging, although capable of penetrating smoke and fog, lacks detail and color information. Infrared and visible light fusion technology has become a key means to improve surveillance effectiveness.
[0003] However, existing fusion methods are mainly divided into two categories: one is based on pixel-level weighting, which relies on low-level grayscale or gradient features, lacks semantic understanding, is prone to mistakenly retaining dust noise, and cannot suppress strong light halos; the other is based on target detection, which uses rectangular boxes to divide regions, is not suitable for irregular shapes such as smoke, and misses detection, resulting in key targets not being enhanced, and is prone to splicing artifacts.
[0004] Therefore, how to improve image fusion capabilities in complex environments has become a technical problem that urgently needs to be solved by those skilled in the art. Summary of the Invention
[0005] In view of the above problems, the present invention provides an image processing method, storage medium, and electronic device that overcomes or at least partially solves the above problems, the technical solution of which is as follows:
[0006] An image processing method, comprising:
[0007] Obtain spatiotemporally aligned visible light and infrared thermal images of the target scene;
[0008] Cross-modal semantic analysis is performed on the visible light image and the infrared thermal imaging image to generate semantic understanding information, wherein the semantic understanding information includes the semantic category of each pixel in the visible light image and the infrared thermal imaging image;
[0009] Based on the semantic understanding information, a differentiated image fusion strategy is determined for pixel regions of different semantic categories in the visible light image and the infrared thermal imaging image;
[0010] The visible light image and the infrared thermal imaging image are fused using the differentiated image fusion strategy to generate a target fused image.
[0011] A computer-readable storage medium having a program stored thereon, which, when executed by a processor, implements the image processing method described above.
[0012] An electronic device includes at least one processor, at least one memory connected to the processor, and a bus; wherein the processor and the memory communicate with each other via the bus; the processor is used to call program instructions in the memory to execute the image processing method.
[0013] By employing the above technical solutions, this invention provides an image processing method, storage medium, and electronic device that obtains spatiotemporally aligned visible light images and infrared thermal imaging images of a target scene; performs cross-modal semantic analysis on the visible light and infrared thermal imaging images to generate semantic understanding information, wherein the semantic understanding information includes the semantic category of each pixel in the visible light and infrared thermal imaging images; determines differentiated image fusion strategies corresponding to pixel regions of different semantic categories in the visible light and infrared thermal imaging images based on the semantic understanding information; and uses the differentiated image fusion strategies to fuse the visible light and infrared thermal imaging images to generate a target fused image. This invention, by performing cross-modal pixel-level semantic analysis on spatiotemporally aligned visible light and infrared thermal imaging images of a target scene and adaptively matching differentiated fusion strategies according to different semantic categories, effectively suppresses noise and artifacts in complex environments, improves the detail representation and visual coherence of the fused image, and significantly improves the target recognition accuracy and image quality of industrial monitoring systems.
[0014] The above description is merely an overview of the technical solution of the present invention. In order to better understand the technical means of the present invention and to implement it in accordance with the contents of the specification, and in order to make the above and other objects, features and advantages of the present invention more apparent and understandable, specific embodiments of the present invention are described below. Attached Figure Description
[0015] Various other advantages and benefits will become apparent to those skilled in the art upon reading the following detailed description of preferred embodiments. The accompanying drawings are for illustrative purposes only and are not intended to limit the invention. Furthermore, the same reference numerals denote the same parts throughout the drawings. In the drawings:
[0016] Figure 1 A flowchart illustrating one embodiment of the image processing method provided by this invention is shown.
[0017] Figure 2 The diagram shows a specific implementation of step S110 in the image processing method provided by the present invention.
[0018] Figure 3 The diagram shows a specific implementation of step S120 in the image processing method provided by the present invention. Detailed Implementation
[0019] Exemplary embodiments of the invention will now be described in more detail with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention may be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this invention will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
[0020] With the rapid development of industrial settings such as coal mines, tunnels, and ports, video surveillance systems have become a crucial means of ensuring production safety and equipment operation. However, these industrial environments typically present complex visual conditions, including heavy dust, insufficient lighting, and strong light interference, which severely impact the performance of traditional monitoring equipment. Visible light cameras, in particular, experience a significant drop in image quality under harsh conditions such as low light and dense smoke, sometimes even failing to acquire effective visual information, thus limiting the accuracy and reliability of monitoring.
[0021] Infrared thermal imaging cameras serve as an important supplementary tool, possessing strong smoke and fog penetration capabilities, enabling them to detect temperature distribution and thermal radiation characteristics in environments with limited visible light, thereby improving target identification capabilities. However, the inherent limitations of infrared images lie in their lack of rich texture details and color information, making it difficult to provide a complete scene understanding and failing to meet the comprehensive needs of industrial monitoring for target details and environmental conditions.
[0022] To address the aforementioned issues, the fusion of infrared and visible light images has gradually become a key technological approach to improve industrial monitoring. By fusing image information from two different wavelengths, the textural and color advantages of visible light can be comprehensively utilized with the environmental adaptability of infrared, enhancing the visual performance and target recognition capabilities of the image. However, existing fusion technologies still have significant limitations.
[0023] One mainstream fusion method is based on pixel-level weighting strategies, typically relying on low-level features such as grayscale values and gradients to calculate fusion weights. These methods lack semantic understanding of the targets in the image, easily misclassifying scattering noise such as dust as valid texture and preserving it, resulting in a high noise content in the final fused image. Furthermore, traditional weighted fusion often directly incorporates halo effects produced in bright light areas into the result, exacerbating visual interference and reducing the discernibility of key targets.
[0024] Another type of method relies on object detection technology, using rectangular bounding boxes to divide strategy regions for fusion. This method is effective in handling regularly shaped targets, but it struggles with irregularly shaped regions such as smoke and roads, easily producing stitching artifacts that affect the coherence and naturalness of the image. Furthermore, the possibility of missed detections in object detection may result in key targets not being effectively identified and enhanced, further reducing the reliability and security of the monitoring system.
[0025] Based on this, this embodiment of the invention provides an image processing method that acquires spatiotemporally aligned visible light images and infrared thermal imaging images of a target scene, performs cross-modal pixel-level semantic analysis, and generates understanding information containing the semantic category of each pixel. Based on this semantic information, a differentiated image fusion strategy corresponding to different semantic categories is determined, and the two types of images are fused using this strategy to generate a preliminary fused image. Subsequently, the preliminary fused image undergoes boundary smoothing processing to obtain a high-quality fused image, thereby effectively suppressing noise and artifacts in complex environments, improving image detail and visual coherence, and significantly enhancing the target recognition accuracy and image quality of industrial monitoring systems.
[0026] like Figure 1 The diagram shows a flowchart of one embodiment of the image processing method provided by this invention. The method may include:
[0027] S100: Obtain spatiotemporally aligned visible light and infrared thermal images of the target scene.
[0028] The target scenario refers to the actual application scenario in the industrial monitoring environment where image fusion is required, such as complex and harsh environments like mining areas, tunnels, and ports, which may contain interference factors such as smoke, dust, and strong light.
[0029] Visible light images refer to images captured by visible light camera equipment, which contain texture and color information.
[0030] Infrared thermal imaging images refer to images acquired using infrared thermal imaging equipment, reflecting the distribution of thermal radiation in a scene.
[0031] Specifically, embodiments of the present invention can receive visible light video streams and infrared thermal imaging video streams from a target scene, use timestamps to achieve frame synchronization matching, and then use a pre-calibrated homography matrix to perform spatial geometric correction to eliminate parallax, thereby obtaining pixel-level aligned visible light images and infrared thermal imaging images.
[0032] As examples, embodiments of the present invention can pre-construct a spatiotemporal joint alignment module. This module receives raw video streams from visible light and infrared sensors and unifies the spatiotemporal reference: precise frame synchronization matching is performed based on hardware timestamps to ensure complete temporal correspondence between the two video streams. A pre-calibrated homography matrix is used to geometrically correct the visible light image to eliminate parallax caused by differences in sensor physical location and viewing angle. The corrected and aligned visible light image is converted to the YCbCr color space, separating the luminance (Y) and chromaticity (CbCr) components, while the infrared thermal imaging image is output as an independent intensity channel.
[0033] S110. Perform cross-modal semantic analysis on visible light images and infrared thermal imaging images to generate semantic understanding information, wherein the semantic understanding information includes the semantic category of each pixel in the visible light images and infrared thermal imaging images.
[0034] Semantic understanding information refers to the results obtained by performing pixel-level analysis on spatiotemporally aligned visible light and infrared images using a cross-modal gated semantic segmentation network. Semantic understanding information may also include confidence information corresponding to semantic categories.
[0035] Semantic category refers to the specific target or background category to which a pixel in an image belongs, such as: people, vehicles, equipment, roads, background, strong light source, dust / smoke, ore pile, etc., reflecting the semantic attributes of different objects in the scene.
[0036] Specifically, embodiments of the present invention can extract modal features from visible light images and infrared thermal imaging images, dynamically generate control information based on the modal features, then use the control information to adaptively fuse the modal features, and finally decode the fused unified features to generate semantic understanding information.
[0037] Furthermore, embodiments of the present invention can utilize the infrared thermal radiation structure of infrared thermal imaging images to guide the dynamic adjustment of visible light feature weights, overcome visual blind spots in low-light or dense smoke environments, and output a label map containing the semantic category of each pixel and its confidence map, thereby achieving high-precision pixel-level semantic understanding.
[0038] As examples, embodiments of the present invention can pre-build a cross-modal gated semantic segmentation module for execution. This module employs a dual-encoder-single-decoder architecture, specifically designed to address segmentation challenges in harsh environments. The cross-modal gated semantic segmentation module includes a gate fusion unit (GFU), which dynamically evaluates and fuses visible light and infrared features, guided by structural information from infrared images. When visible light degrades due to dense smoke or darkness, the gate fusion unit automatically reduces its weight, relying more on infrared features for inference, thereby ensuring the robustness of semantic understanding. The cross-modal gated semantic segmentation module ultimately outputs two key information maps: a semantic label map (Class Map), where each pixel is precisely classified into eight categories of industrial scene objects, including people, vehicles, equipment, and strong light sources; and a confidence map, which quantifies the reliability of the classification result for each pixel, providing a basis for subsequent boundary smoothing.
[0039] S120. Based on semantic understanding information, determine the differentiated image fusion strategy corresponding to pixel regions of different semantic categories in visible light images and infrared thermal imaging images.
[0040] Among them, the differentiated image fusion strategy refers to dynamically selecting and adjusting the fusion method according to a variety of fusion operators predefined according to different semantic categories to adapt to changes in target features and environment.
[0041] Specifically, embodiments of the present invention can parse semantic understanding information to identify the semantic categories to which different pixel regions in an image belong. Based on the semantic category of a pixel, a corresponding image fusion operator is matched and assigned. The image fusion operators assigned to all pixels are aggregated to form a differentiated image fusion strategy.
[0042] Furthermore, embodiments of the present invention can, based on semantic understanding information, utilize a predefined "semantic category-fusion operator" mapping mechanism to summarize semantic categories into discrete fusion strategies that conform to physical laws, and combine environmental adaptive real-time adjustment of fusion operator parameters to accurately determine differentiated image fusion strategies corresponding to pixel regions of different semantic categories, ensuring the controllability and adaptability of the fusion effect.
[0043] As examples, embodiments of the present invention can pre-construct a strategy mapping and differential fusion module, which internally includes a deterministic "semantic category-fusion operator" mapping table. This table categorizes eight semantic categories into five discrete fusion strategies that conform to physical laws (such as edge enhancement, infrared dominance, and specular suppression). During processing, the corresponding fusion strategy is assigned to each pixel region in the image by directly "looking up" the table based on the semantic label map. Simultaneously, the internal parameters (such as enhancement coefficients) of each fusion operator can be dynamically fine-tuned by incorporating real-time perceived environmental parameters (such as fog index and light intensity), enabling the strategy to adapt to environmental changes.
[0044] S130. Using a differentiated image fusion strategy, the visible light image and the infrared thermal imaging image are fused to generate a target fused image.
[0045] Among them, the target fusion image refers to the fusion result generated after weighting the corresponding pixels of the visible light and infrared images using a differentiated fusion strategy.
[0046] Specifically, embodiments of the present invention can divide the image spaces of visible light images and infrared thermal imaging images into regions corresponding to different fusion operators based on a differentiated image fusion strategy; apply the specified fusion operator to each region to calculate the corresponding pixels of the visible light image and infrared thermal imaging image; and spatially combine the fusion calculation results of all regions to generate the target fused image.
[0047] Furthermore, embodiments of the present invention can generate corresponding binary masks based on the strategy regions divided in the differentiated image fusion strategy, process the corresponding regions in parallel, realize full-image parallel computation and mask weighted synthesis, and output the target fused image.
[0048] As examples, embodiments of the present invention can utilize a pre-built strategy mapping and differential fusion module to issue five independent fusion calculation kernel functions in parallel on a GPU (Graphics Processing Unit) based on a binary mask image. These kernel functions correspond to five predefined fusion operators. Each kernel function performs its specific pixel calculations (such as weighted blending, edge injection, and logarithmic compression) across the entire image. The calculation results are then weighted and synthesized using the corresponding mask. This embodiment of the present invention maximizes the parallel computing capabilities of the GPU through a mechanism of "parallel computation across the entire image and mask selection output," efficiently generating the target fused image.
[0049] This invention provides an image processing method comprising: obtaining spatiotemporally aligned visible light images and infrared thermal imaging images of a target scene; performing cross-modal semantic analysis on the visible light images and infrared thermal imaging images to generate semantic understanding information, wherein the semantic understanding information includes the semantic category of each pixel in the visible light images and infrared thermal imaging images; determining differentiated image fusion strategies corresponding to pixel regions of different semantic categories in the visible light images and infrared thermal imaging images based on the semantic understanding information; performing fusion processing on the visible light images and infrared thermal imaging images using the differentiated image fusion strategies to generate a preliminary fused image; and performing boundary smoothing processing on the preliminary fused image to generate a target fused image. This invention, by performing cross-modal pixel-level semantic analysis on spatiotemporally aligned visible light images and infrared thermal imaging images of a target scene, adaptively matching differentiated fusion strategies according to different semantic categories, and combining boundary smoothing processing, effectively suppresses noise and artifacts in complex environments, improves the detail representation and visual coherence of the fused image, and significantly improves the target recognition accuracy and image quality of industrial monitoring systems.
[0050] Optional, based on Figure 1 The method shown is as follows: Figure 2 The diagram shows a specific implementation of step S110 in the image processing method provided by this invention. Step S110 may specifically include:
[0051] S200: Extract the first image features of the visible light image and the second image features of the infrared thermal imaging image, respectively.
[0052] The first image feature refers to the feature representation of multi-layered texture and color information extracted from the visible light image, which can be extracted by the visible light image encoder through a convolutional neural network.
[0053] The second image feature refers to the thermal radiation structure features extracted from the infrared thermal imaging image, which reflects the temperature distribution and thermal edge information of the object and can be extracted by the infrared image encoder.
[0054] Specifically, in this embodiment of the invention, a visible light image can be input into a visible light image encoder branch, where a multi-layer convolutional network extracts multi-scale feature representations containing texture and color information. Simultaneously, the corresponding infrared thermal imaging image is input into an infrared image encoder branch to extract feature maps reflecting temperature distribution and thermal structure. The two encoders are structurally symmetrical but have independent parameters, ensuring effective separation and expression of the two modal features, preparing for subsequent gating fusion.
[0055] As examples, embodiments of the present invention can receive visible light brightness images through the visible light encoder branch of the cross-modal gated semantic segmentation module, and extract first image features containing texture, color, and local structural information through a series of convolution and downsampling operations. Simultaneously, the infrared encoder branch of the cross-modal gated semantic segmentation module receives infrared intensity images and extracts second image features containing thermal radiation distribution, temperature structure, and thermal target contour information in the same manner.
[0056] S210. Using the first image features, dynamically generate gating information for evaluating the effectiveness of features in the visible light image.
[0057] The gating information refers to pixel-level weight coefficients dynamically calculated based on the first image features, used to evaluate the effectiveness and reliability of each pixel feature in the visible light image. The value ranges from 0 to 1; a value closer to 1 indicates high reliability of the visible light features in that region, while a value closer to 0 indicates severe obstruction or degradation of visible light, requiring reliance on infrared features. The gating information can be calculated using gated convolutional layers and the sigmoid activation function, dynamically adjusting the feature fusion ratio.
[0058] Specifically, in this embodiment of the invention, visible light features and infrared features are concatenated along the channel dimension and input into a gated convolutional layer. After convolution operations and mapping using the sigmoid activation function, a gate coefficient map of the same size as the feature map is generated. This gate coefficient map reflects the reliability of the visible light features at each pixel location, and its value is dynamically adjusted between 0 and 1. It automatically identifies occluded or degraded areas and guides the weight allocation for subsequent feature fusion.
[0059] As examples, embodiments of the present invention can utilize the stable structural information carried by infrared features in the gated fusion units at each level of the encoder to dynamically evaluate the effectiveness of visible light features in the current region: at each level, the gated fusion unit concatenates the first image feature and the second image feature of the current level along the channel dimension. The concatenated features are passed through a lightweight nonlinear convolutional layer and a sigmoid activation function to generate a gate coefficient map with the same spatial size as the feature map. Each value in this gate coefficient map, within the range of (0, 1), intuitively represents the reliability of the visible light feature at the corresponding pixel location: values close to 1 indicate that the visible light information is clear and effective (e.g., in normally lit areas), while values close to 0 indicate that the visible light information is severely degraded or unreliable (e.g., completely obscured by dense smoke or in completely dark areas).
[0060] S220. Using gating information, adaptively weightedly fuse the first image features and the second image features to obtain fused features.
[0061] In this context, fusion features refer to the feature representation obtained by weighting and combining the first image features and the second image features according to a proportional ratio, controlled by gating information. In fusion features, reliable visible light information is preserved and enhanced, while the features of obstructed regions rely more on infrared information, thereby achieving effective cross-modal fusion.
[0062] Specifically, in this embodiment of the invention, the first image features and the second image features can be interactively combined to generate a gating coefficient map that reflects the credibility of the features. Then, based on the gating coefficient map, the first image features and the second image features are modulated respectively. The modulated first image features and the second image features are combined to obtain fused features.
[0063] Furthermore, in embodiments of the present invention, visible light features can be multiplied by a gating weight and infrared features can be multiplied by (1 - gating weight) according to the gating coefficient map to achieve pixel-level weighted fusion and generate a fused feature map. This fully utilizes texture and color information in the effective visible light region and preferentially relies on infrared thermal radiation features in the visible light degradation region, ensuring that the fused features are robust and rich in semantic information.
[0064] As examples, embodiments of the present invention can utilize a gated fusion unit to perform pixel-by-pixel weighted fusion of the first image features and the second image features using the generated gated map as adaptive weights. The fusion formula is: Fusion Feature = Gated Map × Visible Light Feature + (1 - Gated Map) × Infrared Feature. When the gate value is high, the fusion result is dominated by visible light features, making full use of their rich texture details; when the gate value is low, the fusion result relies more on infrared features, taking advantage of their immunity to adverse optical conditions to compensate for information loss. In this way, a more robust fusion feature is generated at each feature level, which can preserve visible light details under good conditions and maintain basic structural perception by relying on infrared information in extreme environments.
[0065] S230. Based on the fusion features, decode and generate semantic understanding information.
[0066] Specifically, in this embodiment of the invention, the fused features can be input into a single decoder, and the spatial resolution can be gradually restored through multi-layer upsampling and convolution operations to output a semantic label map and a corresponding confidence map. The decoder utilizes the cross-modal information contained in the fused features to achieve accurate segmentation of multiple categories of targets such as personnel, equipment, and smoke in complex mining environments.
[0067] As examples, embodiments of the present invention can receive multi-scale fused features from various levels of the encoder through a single decoder in a cross-modal gated semantic segmentation module. It gradually restores the spatial resolution of the feature map through a series of upsampling and skip connection operations, integrating contextual and detail information from different levels. The decoder output maps the high-resolution features to a preset number of semantic categories through a convolutional layer, assigning a category label to each pixel in the image and generating a semantic label map. Simultaneously, the decoding process also outputs a confidence map, which quantifies the network's confidence in the classification result of each pixel. These two maps together constitute complete semantic understanding information, providing accurate input for subsequent differentiated fusion strategies.
[0068] This invention extracts features from visible light images and infrared thermal imaging images respectively and uses dynamic gating information to achieve adaptive weighted fusion. This can effectively evaluate and enhance the effectiveness of visible light features, improve the accuracy and robustness of cross-modal semantic analysis, thereby ensuring that the generated semantic understanding information is more accurate and reliable, and thus improve the image fusion effect in complex environments.
[0069] Optionally, in the above Figure 2 Based on one or more corresponding embodiments, in another optional embodiment provided by the present invention, step S220 may specifically include:
[0070] The first image features are concatenated with the second image features to obtain the concatenated features.
[0071] Among them, splicing features refer to the joint feature representation obtained by splicing the first image features and the second image features in the channel dimension.
[0072] Specifically, in this embodiment of the invention, the first image feature and the second image feature can be concatenated along the channel dimension to form a concatenated feature containing information from both modalities. This concatenation operation, while maintaining consistent spatial resolution, merges the two features into a higher-dimensional representation, facilitating subsequent convolutional layers to capture cross-modal correlation and complementary information.
[0073] As examples, embodiments of the present invention can receive two inputs from a dual-stream encoder via a gated fusion unit: a first image feature and a second image feature. These two sets of feature tensors are then concatenated along the channel dimension. For example, if the visible light feature map dimension is [C1, H, W] and the infrared feature map dimension is [C2, H, W], then the newly generated feature map after concatenation will have a dimension of [C1 + C2, H, W].
[0074] A nonlinear transformation is performed on the stitched features to generate a gating coefficient map, where each value in the gating coefficient map corresponds to a spatial location, which is used to represent the credibility of the image features at that spatial location.
[0075] The gating coefficient map refers to the spatial weight map obtained by applying gated convolutional layers and nonlinear activation functions to the stitched features. Each pixel value in the gating coefficient map is between 0 and 1, reflecting the credibility or effectiveness of the visible light features at the corresponding pixel location, and is used to dynamically adjust the weight ratio of visible light information in the fusion process.
[0076] Specifically, in this embodiment of the invention, the concatenated feature tensor can be input into a specially designed gated convolutional layer. This convolutional layer extracts local spatial context information by weighted summation and combining it with bias. Then, the output value is normalized to the 0 to 1 range by Sigmoid activation function mapping, forming a spatially distributed gated coefficient map.
[0077] As examples, embodiments of the present invention can input the stitched features into a lightweight nonlinear transformation module. This module may consist of one or more convolutional layers (such as 1x1 convolutions) and a nonlinear activation function (such as ReLU), and its function is to learn a complex decision function to evaluate the validity or reliability of the first image features at each spatial location. The output of this module is passed through a sigmoid activation function, mapping the values to the (0,1) interval to generate a gating coefficient map. Each pixel value in this map represents the reliability of the visible light features at the corresponding spatial location: the closer the value is to 1, the clearer and more reliable the visible light information at that location; the closer the value is to 0, the more severely degraded and unreliable the visible light information is due to occlusion (such as smoke or dust) or insufficient lighting.
[0078] Visible light features are obtained using the gating coefficient map and the first image features.
[0079] The visible light feature refers to the visible light image feature tensor after being weighted by the gating coefficient map, representing the texture and color information that is considered valid and reliable under the current environmental conditions. Its value is obtained by multiplying it element-wise with the gating coefficient map, which can suppress invalid information in the visible light degradation region and improve the fusion robustness.
[0080] Specifically, in this embodiment of the invention, the gating coefficient map and the first image features are multiplied element-wise at pixel positions to achieve spatial weighting of visible light features. The closer the gating coefficient is to 1, the more visible light features are retained; the closer the gating coefficient is to 0, the more visible light information at that position is suppressed, thereby reducing the negative impact of degraded regions on fusion and improving the robustness of fusion.
[0081] As examples, embodiments of the present invention can perform element-wise multiplication of the gating coefficient map with the first image features. This is equivalent to reweighting the original visible light feature map with a spatially adaptive weight matrix. In regions with high gating coefficients, the original features are largely preserved; in regions with low gating coefficients, the original features are significantly suppressed. The features modulated by this step can be regarded as visible light features that have undergone "efficiency filtering," which weakens the interference of noise features in untrusted regions.
[0082] The infrared feature weight map is obtained by subtracting the scalar from the gating coefficient map element by element.
[0083] The infrared feature weight map refers to the weight map obtained by subtracting the gating coefficient map element by element using a scalar of 1, representing the distribution of confidence in infrared image features. The closer the value is to 1, the weaker or unusable the visible light features at the corresponding location are, requiring greater reliance on infrared features for compensation.
[0084] Specifically, in this embodiment of the invention, an infrared feature weight map of the same size as the gating map can be generated by subtracting each value in the gating coefficient map from a scalar of 1. The infrared feature weight map reflects the importance of infrared features at each pixel location; a larger value indicates that the visible light features at that location are unreliable and require more reliance on infrared features for compensation.
[0085] As examples, embodiments of the present invention may first define a scalar (or a tensor of all 1s) with a value of 1, and then subtract it element-wise from the gating coefficient map. The mathematical expression is: Infrared Feature Weight Map = 1 - Gating Coefficient Map. This operation is based on the reasonable assumption that the information reliability of visible light features and infrared features is spatially complementary. Therefore, in regions with low reliability of visible light features (gating coefficients close to 0), the value of the infrared feature weight map is close to 1, meaning that more reliance should be placed on infrared information in these regions.
[0086] Infrared features are obtained by utilizing the infrared feature weight map and the second image features.
[0087] Among them, infrared features refer to the weighted feature tensor obtained by element-wise multiplication of the infrared feature weight map and the second image features. It represents the infrared thermal radiation structure information that is emphasized in the fusion process, and plays a key role, especially in areas with weak or blocked visible light, to ensure the integrity and accuracy of the fused features.
[0088] Specifically, in this embodiment of the invention, the infrared feature weight map and the second image features can be multiplied element by element to obtain a weighted infrared feature tensor, thereby enhancing the expressive ability of the infrared image in the visible light damaged area and ensuring that the fused features remain intact and accurate in complex environments.
[0089] As examples, embodiments of the present invention can perform element-wise multiplication of the infrared feature weight map with the second image features. After this operation, the infrared features are also spatially recalibrated according to their complementary weights. In regions where infrared dependence is required (high infrared feature weights), the features are enhanced; in regions where visible light is reliable (low infrared feature weights), their feature contributions are appropriately reduced. The resulting modulated infrared features are those with "complementary emphasis."
[0090] By utilizing visible light and infrared features, fused features are obtained.
[0091] Specifically, in this embodiment of the invention, visible light features and weighted infrared features can be added element-wise to achieve adaptive fusion of cross-modal features. This results in fused features that contain both the rich texture and color information of visible light and the thermal radiation structure of infrared light, thereby improving semantic understanding and environmental adaptability.
[0092] As examples, embodiments of the present invention can add the visible light features and infrared features, which have been gated and modulated respectively, element-wise. That is: fused feature = visible light feature + infrared feature. Through this weighted summation, adaptive fusion of the two modalities is achieved at each spatial location: in areas with good environmental conditions, the fusion result is dominated by reliable visible light features; in areas with harsh environmental conditions, the fusion result smoothly transitions to being dominated by robust infrared features. The final output fused feature integrates the advantages of both modalities, has stronger robustness and information integrity, and is fed into the decoder to generate the final semantic understanding information.
[0093] To overcome the problem that traditional semantic segmentation networks lose most of the effective information in RGB images captured under heavy smoke or complete darkness, this invention provides a gating coefficient calculated by a gated fusion network with a dual encoder-single decoder architecture. The effectiveness of the visible light features of the current pixel can be determined by learning a non-linear decision surface.
[0094] ;
[0095] Fusion features Weighted combination based on gating coefficient:
[0096] ;
[0097] Where "*" represents convolution operation; Use the Sigmoid activation function; These are the weights of the gated convolutional layer; For visible light feature tensors; For infrared feature tensors; This represents the splicing operation of the visible light feature tensor and the infrared feature tensor along the channel dimension; This is a bias term. The output will be... Mapped to the (0,1) interval, a value close to 1 indicates that visible light is effective, while a value close to 0 indicates that visible light degradation requires infrared light.
[0098] When dense smoke obscures visible light in a scene, the gated fusion network automatically learns that the visible light features of that area are unreliable, thus outputting a value close to 0. This value forces the decoder to primarily utilize infrared features for inference, ensuring segmentation robustness in extreme environments.
[0099] To address the characteristics of "extremely small personnel targets (occupying only tens of pixels)" and "fuzzy semantic boundaries" in underground mines, this embodiment of the invention employs a composite loss function optimization during the training phase of the gated fusion network: to address the challenge of extreme imbalance between positive and negative samples (background pixels far outnumber target pixels), an Online Hard Example Mining (OHEM) loss is introduced. This method does not calculate the loss for all pixels, but focuses on the hard-to-classify pixels that the model mispredicts, thereby improving the targeting and effectiveness of training. The specific formula is as follows:
[0100] ;
[0101] ;
[0102] in, The standard cross-entropy loss for pixel p; H represents the proportion of difficult pixels to retain (e.g., selecting the top 10% of difficult pixels with the highest prediction error); H is the set of difficult pixels selected.
[0103] In addition, to enhance the network's ability to segment at the boundaries of different semantic categories and prevent small targets from being "swallowed" by the background at ambiguous boundaries, a boundary-aware loss is introduced. The formula is:
[0104] ;
[0105] ;
[0106] in, This represents the total number of pixels in the image. For boundary indication functions, when the neighborhood of pixel p The memory is set to 1 for pixels of different categories; The boundary weighting coefficients (dimensionless, ranging from 3.0 to 5.0, with a typical value of 4.0) significantly enhance the training weights of boundary pixels, thereby improving the accuracy and detail preservation of segmentation edges.
[0107] The embodiments of the present invention generate a gating coefficient map that accurately reflects the credibility of spatial location features by concatenating the first image features and the second image features and performing nonlinear transformation. This allows for adaptive weighted fusion of visible light and infrared features, dynamically suppressing invalid information in degraded regions and enhancing effective information. This significantly improves the accuracy and robustness of the fused features, ensuring high-quality cross-modal semantic understanding and image fusion effects even in harsh environments.
[0108] Optional, based on Figure 1The method shown is as follows: Figure 3 The diagram shows a specific implementation of step S120 in the image processing method provided by this invention. Step S120 may specifically include:
[0109] S300. Analyze semantic understanding information to determine the semantic category to which each pixel in the visible light image and infrared thermal imaging image belongs.
[0110] Specifically, embodiments of the present invention can map category information in semantic understanding information to corresponding pixel positions in visible light images and infrared thermal imaging images, ensuring that the semantic category of each pixel is accurately identified and labeled.
[0111] As examples, embodiments of the present invention can receive a semantic label map output from the semantic segmentation module during the initialization phase of the policy mapping and differential fusion module. The semantic label map is a single-channel data matrix with the same spatial resolution as the input image, where each element (pixel value) is an integer index directly corresponding to one of eight predefined semantic categories (e.g., 0-background, 1-person, 2-vehicle, 3-equipment, etc.). By reading this index value pixel by pixel and mapping it to the corresponding semantic category label with a clear physical meaning, the conversion from numerical encoding to semantic understanding is completed, assigning a definite semantic identity to each pixel location in the image.
[0112] S310. For each pixel, query the preset mapping relationship library according to the semantic category of the pixel, match the pixel and assign the corresponding image fusion operator.
[0113] The pre-built mapping library refers to a deterministic mapping table that maps the semantic category of each pixel to a specific image fusion strategy. Based on physical imaging characteristics and application requirements, this mapping library predefines the types and triggering conditions of fusion operators, ensuring that the selection of fusion strategies has a clear physical interpretation and scene adaptability, and avoiding the uninterpretability and instability caused by continuous weight regression.
[0114] Image fusion operators refer to specific fusion calculation rules or algorithm units implemented for pixel regions corresponding to different semantic categories, used to fuse the feature and pixel information of visible light images and infrared thermal imaging images. The fusion operators perform differentiated processing according to a preset strategy and may include: edge enhancement operators. Infrared dominant operator Visible light dominant operator Highlight suppression operator and channel discard operator .
[0115] Among them, the edge enhancement operator is used to utilize the edge information of infrared thermal radiation to enhance the contour details of targets such as people and vehicles through high-pass filtering, thereby improving the contrast and clarity of target recognition.
[0116] Among them, the infrared dominant operator is used to assign a high weight to infrared images to preserve temperature distribution characteristics to the greatest extent, and is suitable for key monitoring objects such as equipment and ore piles.
[0117] Among them, the visible light dominant operator is used to enhance the texture and color representation of areas such as roads and backgrounds by using visible light as the main source and supplementing it with infrared information, thereby ensuring the naturalness of environmental perception.
[0118] Among them, the highlight suppression operator is used to compress the brightness of the strong light source area and fuse the maximum infrared value to suppress overexposed white spots and ensure that the heat source information is not blocked by the halo.
[0119] Among them, the channel discarding operator is used to completely discard visible light channel information and retain infrared channel information based on the principle of physical scattering. It is used in severely degraded areas such as dust and smoke to improve penetration and image quality.
[0120] It is important to note that each operator can incorporate adaptive environmental parameters for adjustment, enabling dynamic responses to different environmental conditions and ensuring high robustness and visual consistency of the fusion results.
[0121] Specifically, in this embodiment of the invention, semantic categories can be used as indexes to query a pre-built preset mapping relationship library, assigning each pixel in the image its corresponding fusion operator label, forming a pixel-level fusion operator allocation map that corresponds one-to-one with the semantic categories, ensuring that the fusion strategy has physical interpretability and specificity.
[0122] As examples, embodiments of the present invention can use the eight semantic categories parsed from the "semantic category-fusion operator" mapping table to summarize and merge them into five predefined, physically consistent discrete fusion operators (e.g., "personnel, vehicles" - "edge enhancement operator"; "equipment, ore pile" - "infrared dominant operator"; "road, background" - "visible light dominant operator"; "strong light source" - "highlight suppression operator"; "dust / smoke" - "channel discarding operator"). For each pixel, its determined semantic category is used as the "key", and the corresponding fusion operator is directly looked up in the table as the "value".
[0123] S320. Generate a differentiated image fusion strategy by using the image fusion operators assigned to each pixel.
[0124] Specifically, embodiments of the present invention can summarize the image fusion operators assigned to all pixels to obtain a differentiated image fusion strategy.
[0125] The embodiments of the present invention accurately identify the semantic category of each pixel based on semantic understanding information, and assign the most suitable fusion operator to the pixel by combining a preset mapping relationship library, thereby realizing a differentiated image fusion strategy, effectively improving the pertinence and interpretability of the fusion process, and enhancing the detail preservation and visual quality of the fused image in complex environments.
[0126] Optionally, in the above Figure 3 Based on one or more corresponding embodiments, in another optional embodiment provided by the present invention, step S130 may specifically include:
[0127] Generate corresponding binary masks for each image fusion operator;
[0128] Specifically, in this embodiment of the invention, the category of the image fusion operator assigned to each pixel can be used to generate a corresponding binary mask image, so that the fusion kernel function for each type of operator can be called on the GPU to calculate the fusion result in parallel, and the internal weights and enhancement coefficients of the operator can be dynamically adjusted in combination with environmental adaptive parameters.
[0129] As examples, embodiments of the present invention can generate five binary mask images of the same size as the image based on the operator allocation results of all pixels. Each mask image corresponds to a fusion operator, where a pixel with a value of 1 indicates that the operator should be executed at that location, and a value of 0 indicates that it should not be executed. These five mask images collectively define the strategy execution map for different semantic regions in the image. Simultaneously, the environment adaptive adjustment stage calculates environmental parameters such as the current scene's fog index and illumination intensity in real time, and dynamically fine-tunes the parameters within each fusion operator (such as edge enhancement coefficients, infrared mixing weights, etc.) according to preset rules. Therefore, the final generated differentiated image fusion strategy is a complete and executable fusion scheme composed of a "mask image (determining where to execute)" and a "dynamically parameterized set of operators (determining how to execute)," preparing for parallel fusion computation on the GPU.
[0130] For any image fusion operator: using the fusion calculation rules defined by the image fusion operator, perform full-area fusion operation on the visible light image and the infrared thermal imaging image to obtain the full-image fusion result corresponding to the image fusion operator.
[0131] Specifically, in this embodiment of the invention, for each image fusion operator, the corresponding fusion kernel function (such as edge enhancement, infrared dominance, visible light dominance, specular suppression, and channel discarding) can be called on the GPU. Using visible light and infrared thermal images as input, the fusion calculation formula defined by the operator itself is executed in parallel on all pixels within the entire image range. For example, the edge enhancement operator superimposes infrared edge gradient information, the specular suppression operator performs logarithmic compression and maximum operation, and the channel discarding operator retains only the infrared channel. After processing each operator, a fusion result image with calculated fusion values for all pixels in the entire image is output, preparing for subsequent mask integration.
[0132] Based on multiple binary masks, the values of the pixel regions identified by the corresponding masks in each full-image fusion result are integrated into the corresponding pixel regions of the target fused image.
[0133] Specifically, in this embodiment of the invention, based on multiple previously generated binary mask images, the full-image fusion result of each fusion operator is mapped one-to-one with its corresponding mask. The fusion result at the pixel location with a mask value of 1 is written into the corresponding pixel region of the target fused image. The entire process is efficiently implemented on the GPU through weighted summation or conditional copying, ensuring that each pixel only uses the result of its assigned fusion strategy. The final output target fused image achieves the effect of spatially differentiated fusion based on semantic category and physical characteristics.
[0134] This invention achieves efficient execution of pixel-level differentiated fusion strategies by accurately allocating fusion operators based on semantic categories and generating corresponding masks, and then combining the full-image fusion calculation defined by the operators with mask weighted integration. This significantly improves the detail preservation, semantic consistency and visual quality of the fused image, while ensuring the parallel acceleration and interpretability of the fusion process.
[0135] Optionally, in the above Figure 1 Based on one or more corresponding embodiments, in another optional embodiment provided by the present invention, after step S130, the method may further include:
[0136] Guiding information is introduced to maintain spatial structure, and the intensity of boundary smoothing is adaptively controlled based on semantic understanding information. Under the joint constraints of guiding information and intensity control, the smoothing operation is completed on the target fusion image.
[0137] Specifically, embodiments of the present invention can utilize the texture continuity and semantic confidence map of infrared thermal imaging images as guidance, employ a confidence-guided bilateral filtering algorithm to adaptively smooth the policy boundaries, eliminate fusion artifacts and block effects, and then fuse the smoothed luminance component with the visible light chromaticity channel to generate a color target fusion image with rich details and visual coherence, thereby improving the fusion quality.
[0138] As examples, embodiments of the present invention can pre-construct a boundary smoothing and color reconstruction module, utilizing a confidence-guided smoothing algorithm to address boundary artifacts that may arise from discrete strategy stitching: continuous texture from the infrared image serves as spatial guidance, while a confidence map from semantic segmentation controls the smoothing intensity. Sharp details are preserved in high-confidence defined regions (such as inside the target), while adaptive blending is performed only in low-confidence semantic boundary regions. This guided filtering effectively eliminates jumps and blocky effects at strategy boundaries. Finally, the smoothed luminance component and visible light chromaticity channel are recombined to reconstruct a final color-fused image with natural colors and smooth transitions.
[0139] Optionally, embodiments of the present invention may use infrared thermal imaging images as spatial continuity guidance.
[0140] Specifically, embodiments of the present invention can utilize infrared thermal imaging images as guide images. These images exhibit a smooth and continuous texture structure in space due to their thermal radiation characteristics, without obvious artificial jumps. The guide image is used during joint bilateral filtering to help maintain the edge information and structural continuity of the image, avoiding unnatural discontinuities or artifacts at the boundaries of different fusion strategies in the fused image.
[0141] Understandably, in a target fusion image, different semantic regions use different fusion operators, potentially leading to numerical jumps between adjacent pixels (i.e., "blocking effect"). Directly filtering the target fusion image may blur its internal structural edges. Infrared thermal imaging images, based on thermal radiation imaging, reflect temperature distribution only in their grayscale changes, unaffected by artificial strategy switching, and possess natural physical continuity in space. Therefore, infrared thermal imaging images can be used as guide images. In subsequent smoothing filtering calculations, the filtering weights depend not only on the spatial distance between pixels but, more importantly, on their grayscale similarity on the guide image. This means that the smoothing process will tend to follow the natural thermal radiation isotherms in the infrared image, effectively preserving the true structural edges of objects while eliminating strategy stitching gaps and preventing over-smoothing.
[0142] Optionally, embodiments of the present invention can determine the smoothing intensity based on the confidence information in the semantic understanding information.
[0143] Specifically, in this embodiment of the invention, a confidence map corresponding to each pixel can be obtained, and the confidence level reflects the reliability of the semantic category determination. This confidence level is used as an adjustment factor for the filtering weights: high-confidence regions retain the clear details of the original fusion result and the smoothing intensity is reduced; while low-confidence, blurred boundary regions are filtered more strongly. By appropriately smoothing out the jagged edges and stitching gaps caused by strategy switching, visual coherence is ensured.
[0144] Understandably, within an object (such as the center of a vehicle), the confidence level is typically high (close to 1), indicating that the semantic label is certain and excessive smoothing is unnecessary. However, at the boundaries between different semantic categories (such as the human-background edge), due to feature ambiguity, the confidence level is often low (close to 0), and these areas are precisely where strategy stitching artifacts are most prevalent. Therefore, this embodiment of the invention can transform confidence information into a control signal for smoothing intensity: the confidence level is directly or indirectly used to modulate the weight of the "pixel similarity" term in the filtering kernel function. Low-confidence regions receive stronger smoothing weights, while the smoothing effect in high-confidence regions is significantly suppressed. In this way, the smoothing intensity is dynamically correlated with the uncertainty of the segmentation result, achieving precise control of "smooth where it should be smooth, and sharp where it should be sharp."
[0145] Optionally, embodiments of the present invention can filter the target fusion image under the joint constraints of spatial continuity guidance and smoothing intensity to obtain a filtered target fusion image.
[0146] Understandably, the filtered target fusion image is the final fusion image generated after confidence-guided boundary smoothing and color channel backfilling. It has high detail and visual coherence and is suitable for industrial monitoring applications.
[0147] Specifically, embodiments of the present invention can combine an infrared guide image and a confidence-adjusted filtering weight to process the target fusion image using a joint bilateral filtering algorithm. Within the local neighborhood, the filtering result is calculated by simultaneously considering the spatial distance of pixels, gray-level similarity, and semantic confidence. This effectively smooths abrupt changes at the boundaries of the fusion strategy while highlighting important details and edges. The filtered image is the final target fusion image, balancing detail preservation and visual naturalness.
[0148] This invention uses infrared thermal imaging images as spatial continuity guides and combines semantic confidence to adaptively adjust the smoothing intensity to perform boundary smoothing processing on the target fused image. This effectively eliminates artifacts and tortuosity caused by switching fusion strategies, while accurately preserving key structures and details, significantly improving the spatial consistency and visual naturalness of the final target fused image.
[0149] Optionally, in the above Figure 1 Based on one or more corresponding embodiments, in another optional embodiment provided by the present invention, after step S100, the method may further include:
[0150] Analyze visible light images and infrared thermal images to obtain real-time environmental state perception parameters of the target scene; based on the perception parameters, dynamically adjust one or more operational parameters involved in the differentiated image fusion strategy to make the fusion process adapt to environmental changes.
[0151] Specifically, embodiments of the present invention can calculate multiple environmental indicators in real time to characterize the current scene state based on spatiotemporally aligned visible light images and infrared thermal imaging images: extracting the fog and dust index from the visible light image using a dark channel prior algorithm to reflect the concentration of dust and smoke in the air; determining the light intensity and whether it is nighttime by statistically analyzing the average brightness of the visible light image; and calculating the dynamic range of the infrared image to evaluate the thermal contrast and temperature difference.
[0152] Furthermore, embodiments of the present invention can automatically reduce the probability threshold of smoke and dust categories when the fog and dust index rises, making it easier to trigger the channel drop-off strategy to enhance the fog penetration effect; when the average visible light brightness is lower than the nighttime threshold, the weight of the infrared-dominated strategy is increased to compensate for the lack of visible light information; when the infrared dynamic range decreases, the edge enhancement coefficient is increased to prevent the target from being submerged by the background. Such an environmental feedback loop ensures that the fusion strategy maintains robustness and optimal visual effects in complex and ever-changing industrial scenarios.
[0153] This invention, by obtaining spatiotemporally aligned visible light and infrared thermal images, senses the environmental state in real time and dynamically adjusts the computational parameters of the differentiated image fusion strategy. This enables the fusion processing to adapt to environmental changes such as fog, illumination, and thermal contrast, continuously improving the clarity and reliability of the target fused image and ensuring the stability and applicability of image processing in complex scenarios.
[0154] Optionally, in the above Figure 1 Based on one or more corresponding embodiments, in another optional embodiment provided by the present invention, the differentiated image fusion strategy provided by the present invention includes edge enhancement, infrared dominance, visible light dominance, highlight suppression, and channel dropping.
[0155] In dimly lit and complex background environments, people and vehicles are often difficult to identify quickly and accurately due to low contrast, posing a significant security risk. To address this issue, the edge enhancement strategy provided in this invention, while fully preserving infrared thermal imaging features, incorporates edge gradient information from the infrared image through high-pass filtering technology to actively enhance the representation of the target contour, thereby improving the target's identifiability.
[0156] ;
[0157] Infrared edge gradient The magnitudes of its horizontal and vertical gradients are calculated using the Sobel operator, expressed by the following formula:
[0158] ;
[0159] Where "*" represents convolution operation; For infrared images at pixels The grayscale value (or intensity value) at that location; To represent a visible light image at pixels The grayscale value (or intensity value) at that location; Infrared-based mixing weight (dimensionless, typical value 0.7); The visible light-based mixing weights are dimensionless, typically 0.3. The edge enhancement coefficient (dimensionless, typical value 0.3-0.5) is dynamically adjusted by the environment adaptive module. and These are the "3×3" Sobel operator kernels in the horizontal and vertical directions, respectively.
[0160] For scenarios involving abnormal temperature distribution, such as overheating failures in mechanical equipment and the risk of spontaneous combustion in ore piles, an infrared-dominated strategy is adopted. This strategy uses linear weighting with high infrared image weights to preserve thermal radiation characteristics and grayscale distribution to the greatest extent possible, effectively highlighting areas of abnormal temperature. Simultaneously, it reduces interference from visible light textures, allowing operators to more intuitively observe and judge abnormal temperature conditions, thus improving safety monitoring efficiency. The formula for the infrared-dominated strategy is:
[0161] .
[0162] In road and background areas, images are primarily used to provide location references, thus requiring rich texture details and road marking information. Since infrared images typically lack detailed texture, and often appear dark gray or lack significant contrast when the surface temperature is relatively uniform, they are difficult to provide effective information. To address this, a visible light-dominated strategy is adopted, introducing only a small amount of infrared components during fusion to avoid completely dark areas in the image, ensuring clear presentation of background details and maintaining the overall visual naturalness and continuity. The formula for the visible light-dominated strategy is:
[0163] .
[0164] When a searchlight or vehicle headlight shines directly into the lens, visible light images often exhibit large areas of overexposed white spots or halos, completely obscuring infrared heat source information within these areas. To address this issue, a highlight suppression strategy employs a logarithmic compression function to suppress visible light brightness. The suppression intensity is controlled by adjusting the compression strength coefficient; a larger coefficient results in stronger suppression. Subsequently, the compressed visible light brightness is fused with the maximum value at the corresponding pixel in the infrared image. This ensures that even in the central region of strong light, if a target with a higher temperature (such as a fire source) exists, the infrared information remains clearly visible, preventing the heat source from being obscured by the halo and thus improving the accuracy and effectiveness of target detection. The highlight suppression strategy formula is:
[0165] ;
[0166] ;
[0167] in, This is the compressive strength coefficient.
[0168] Dust and water mist particles produce significant Mie scattering in the visible light band (approximately 380-780 nm), resulting in a grayish-white "fog" effect that severely impacts image clarity and recognition accuracy. In contrast, the wavelength of the long-wave infrared band (8-14 micrometers) is much larger than the diameter of these particles, thus exhibiting excellent diffraction and penetration capabilities, effectively reducing the impact of scattering. Based on this physical characteristic, when detecting areas of smoke or water mist, a channel discarding strategy can be adopted to forcibly discard visible light channel information (setting its weight to zero), retaining only infrared channel data for image presentation and analysis. This approach eliminates noise interference caused by Mie scattering at its source, significantly improving imaging quality and target recognition accuracy in harsh environments. The formula for the channel discarding strategy is:
[0169] .
[0170] In practical applications, to meet the real-time processing requirements of 30 to 60 frames per second in industrial monitoring, differentiated fusion adopts a parallel rather than serial processing method. Five binary mask images are maintained in the GPU memory. These correspond to the pixel regions of the five fusion strategies proposed earlier. During the computation, the GPU simultaneously launches five parallel CUDA kernel functions to calculate the fusion result for the entire image separately. Finally, the system combines these five results into the final output image using a mask-weighted approach. :
[0171]
[0172] This "full-image computation combined with mask selection" processing logic effectively avoids the thread divergence problem in GPU threads, greatly improving the efficiency and throughput of parallel computing, thereby ensuring the performance requirements of high frame rate real-time fusion.
[0173] However, in directly stitched fused images, abrupt pixel value changes often occur at the boundaries of different strategy regions (e.g., the edge between a person and the background), leading to visual discontinuities. While traditional global Gaussian filtering can smooth these abrupt changes, it blurs image details and affects edge sharpness. Therefore, this invention utilizes semantic confidence to dynamically adjust the filtering intensity and employs an infrared image as a guide map to achieve joint bilateral filtering, thereby effectively eliminating artifacts while preserving edge details.
[0174] Filtered output value Defined as:
[0175] ;
[0176] Where the normalization factor Defined as:
[0177] ;
[0178] The final fusion result is obtained through weighted averaging:
[0179] ;
[0180] in, A local neighborhood window centered on pixel p (typically with a window radius of 5-7 pixels); For spatial distance Gaussian kernel, To control the smoothing range (typical value 5.0 pixels); Use a Gaussian kernel for pixel similarity; Control edge preservation capability (typical value 0.1, normalized grayscale value); The semantic confidence of the neighboring pixels (within the range [0,1]) is used to make the high-confidence region contribute more to the filtering. As a guide map, infrared images, being spatially continuous and without the need for artificial jumps in policy switching, effectively preserve structural edges. When semantic segmentation confidence... When the confidence level is high (e.g., when it is certain that the object is inside a vehicle), the system retains a sharp, original fusion result; when the confidence level is low (typically at the edges of objects), a smoothing term is introduced. Eliminate jagged edges and artifacts caused by strategy switching.
[0181] It is understandable that industrial environments are highly dynamic and changeable. This embodiment of the invention monitors multiple statistical indicators of the current frame in real time and dynamically adjusts the fusion parameters accordingly to adapt to complex and ever-changing field conditions.
[0182] First, embodiments of the present invention can assess fog and dust concentration by calculating a priori indices for dark channels, which serve as a fog and dust index. When this index increases, it indicates a greater concentration of smoke and dust, and the system automatically lowers the probability threshold for classifying it as "dust," making it easier to trigger the corresponding channel discard strategy, thereby enhancing fog penetration capability. The threshold adjustment formula is as follows:
[0183] ;
[0184] in, The baseline probability threshold is 0.5 (dimensionless, typical value 0.5). To adjust the sensitivity coefficient (dimensionless, typical value 0.3); The current frame's fog and dust index (dimensionless, value range [0,1]); For reference, the dust index (dimensionless, typical value 0.3).
[0185] Secondly, embodiments of the present invention can also calculate the average brightness of visible light images. When the average brightness is lower than the nighttime threshold (typically 0.2), the weighting coefficient of the infrared-dominated strategy will be globally increased to effectively compensate for the lack of visible light information and improve the imaging quality in nighttime or low-light environments.
[0186] Furthermore, embodiments of the present invention can also evaluate the dynamic range of infrared images. When the infrared contrast is low, i.e., the scene temperature difference is small, the edge enhancement coefficient is automatically increased to prevent the target from being submerged by the background and to ensure that the edge details of key targets are clearly distinguishable. The adjustment formula for the edge enhancement coefficient is:
[0187] ;
[0188] in, The baseline edge enhancement factor is dimensionless, with a typical value of 0.3. For reference dynamic range (dimensionless, typical value 0.5); The current frame's infrared dynamic range (dimensionless, value range [0,1]); To prevent division by zero by a small constant (typically 0.01).
[0189] Through the above-mentioned multi-dimensional environmental perception and parameter adaptive adjustment, the embodiments of the present invention can respond in real time to changes in light, fog and dust and temperature difference in industrial sites, significantly improving the robustness and visual effect of image fusion.
[0190] Although the operations are described in a specific order, this should not be construed as requiring these operations to be performed in the specific order shown or in a sequential order. In certain environments, multitasking and parallel processing may be advantageous.
[0191] It should be understood that the various steps described in the method embodiments of the present invention may be performed in different orders and / or in parallel. Furthermore, the method embodiments may include additional steps and / or omit the steps shown. The scope of the present invention is not limited in this respect.
[0192] This invention provides a computer-readable storage medium having a program stored thereon, which, when executed by a processor, implements the image processing method.
[0193] This invention provides a processor for running a program, wherein the program executes the image processing method during runtime.
[0194] This invention provides an electronic device, which includes at least one processor, at least one memory connected to the processor, and a bus; wherein the processor and the memory communicate with each other via the bus; the processor is used to call program instructions in the memory to execute the aforementioned image processing method. The electronic device described herein may be a camera device, server, PC, PAD, mobile phone, etc.
[0195] The present invention also provides a computer program product that, when executed on an electronic device, is suitable for executing a program that initializes an image processing method step.
[0196] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatuses, electronic devices (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable device, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0197] In a typical configuration, an electronic device includes one or more processors (CPUs), memory, and a bus. The electronic device may also include input / output interfaces, network interfaces, etc.
[0198] Memory may include non-persistent memory in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, like read-only memory (ROM) or flash RAM, and memory includes at least one memory chip. Memory is an example of computer-readable media.
[0199] Computer-readable media includes both permanent and non-permanent, removable and non-removable media that can store information using any method or technology. Information can be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media does not include transient computer-readable media, such as modulated data signals and carrier waves.
[0200] In the description of this invention, it should be understood that if the terms "upper", "lower", "front", "rear", "left" and "right" are used to indicate the orientation or positional relationship based on the orientation or positional relationship shown in the drawings, they are only for the convenience of describing this invention and simplifying the description, and do not indicate or imply that the position or element referred to must have a specific orientation, or be constructed and operated in a specific orientation, and therefore should not be construed as a limitation of this invention.
[0201] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes the element.
[0202] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0203] The above are merely embodiments of the present invention and are not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principle of the present invention should be included within the scope of the present invention.
Claims
1. An image processing method, characterized in that, include: Obtain spatiotemporally aligned visible light and infrared thermal images of the target scene; Cross-modal semantic analysis is performed on the visible light image and the infrared thermal imaging image to generate semantic understanding information, wherein the semantic understanding information includes the semantic category of each pixel in the visible light image and the infrared thermal imaging image; Based on the semantic understanding information, a differentiated image fusion strategy is determined for pixel regions of different semantic categories in the visible light image and the infrared thermal imaging image; The visible light image and the infrared thermal imaging image are fused using the differentiated image fusion strategy to generate a target fused image.
2. The method according to claim 1, characterized in that, The step of performing cross-modal semantic analysis on the visible light image and the infrared thermal imaging image to generate semantic understanding information includes: The first image features of the visible light image and the second image features of the infrared thermal image are extracted respectively; Using the first image features, gate information is dynamically generated to evaluate the feature effectiveness of the visible light image; Using the gating information, the first image features and the second image features are adaptively weighted and fused to obtain fused features; Based on the fusion features, semantic understanding information is generated through decoding.
3. The method according to claim 2, characterized in that, The step of using the gating information to adaptively weight and fuse the first image features and the second image features to obtain fused features includes: The first image feature and the second image feature are concatenated to obtain the concatenated feature; The splicing features are subjected to a nonlinear transformation to generate a gating coefficient map, wherein each value in the gating coefficient map corresponds to a spatial location and is used to represent the credibility of the image features at the spatial location; Visible light features are obtained using the gating coefficient map and the first image features; The scalar is subtracted element by element from the gating coefficient map to obtain the infrared feature weight map; Infrared features are obtained using the infrared feature weight map and the second image features; The fused features are obtained by utilizing the visible light features and the infrared features.
4. The method according to claim 1, characterized in that, The step of determining the differential image fusion strategy corresponding to pixel regions of different semantic categories in the visible light image and the infrared thermal imaging image based on the semantic understanding information includes: The semantic understanding information is analyzed to determine the semantic category to which each pixel in the visible light image and the infrared thermal imaging image belongs; For each pixel, a preset mapping relationship library is queried according to the semantic category of the pixel, and a corresponding image fusion operator is matched and assigned to the pixel; A differentiated image fusion strategy is generated by using the image fusion operator assigned to each pixel.
5. The method according to claim 4, characterized in that, The step of fusing the visible light image and the infrared thermal imaging image using the differentiated image fusion strategy to generate a target fused image includes: Generate corresponding binary masks for each of the image fusion operators; For any of the image fusion operators: using the fusion calculation rules defined by the image fusion operator, perform a full-area fusion operation on the visible light image and the infrared thermal imaging image to obtain a full-image fusion result image corresponding to the image fusion operator; Based on multiple binary masks, the values of the pixel regions identified by the corresponding masks in each full-image fusion result image are integrated into the corresponding pixel regions of the target fused image.
6. The method according to claim 1, characterized in that, The semantic understanding information also includes confidence information corresponding to the semantic category. After fusing the visible light image and the infrared thermal imaging image using the differential image fusion strategy to generate the target fused image, the method further includes: The infrared thermal imaging image is used as a guide for spatial continuity; The smoothing intensity is determined based on the confidence information in the semantic understanding information; Under the combined constraints of the spatial continuity guidance and the smoothing intensity, the target fusion image is filtered to obtain the filtered target fusion image.
7. The method according to claim 1, characterized in that, After obtaining the spatiotemporally aligned visible light image and infrared thermal image of the target scene, the method further includes: Analyze the visible light image and the infrared thermal imaging image to obtain real-time environmental state perception parameters of the target scene; Based on the perception parameters, one or more operational parameters involved in the differentiated image fusion strategy are dynamically adjusted to adapt the fusion process to environmental changes.
8. The method according to any one of claims 1 to 7, characterized in that, The differentiated image fusion strategy includes edge enhancement, infrared dominance, visible light dominance, highlight suppression, and channel dropping.
9. A computer-readable storage medium having a program stored thereon, characterized in that, When the program is executed by the processor, it implements the image processing method as described in any one of claims 1 to 8.
10. An electronic device, characterized in that, The electronic device includes at least one processor, at least one memory connected to the processor, and a bus; wherein the processor and the memory communicate with each other through the bus; the processor is used to call program instructions in the memory to execute the image processing method as described in any one of claims 1 to 8.