A lightweight target detection method based on infrared and visible light multi-modal fusion
By employing a lightweight target detection method that fuses infrared and visible light multimodal data, and utilizing bi-branch feature extraction and cross-modal feature fusion, combined with a lightweight network and joint loss function, this method addresses the issues of high model complexity and insufficient detection robustness in existing technologies, achieving efficient and accurate target detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUILIN UNIV OF ELECTRONIC TECH
- Filing Date
- 2026-03-31
- Publication Date
- 2026-06-19
AI Technical Summary
Existing infrared and visible light multimodal target detection methods suffer from problems such as complex model structure, large computational load, high deployment cost, and low robustness in complex environments, making it difficult to meet the lightweight and real-time requirements of mobile terminals and edge devices.
A lightweight target detection method based on infrared and visible light multimodal fusion is adopted. Infrared and visible light modal features are extracted separately through a dual-branch feature extraction network. Information interaction and weighted fusion are performed using a cross-modal feature fusion module. The lightweight target detection network YOLOv8 detection framework and GSConv module are used to reduce computational complexity. The method is trained by combining the joint loss function of Varifocal Loss and CIoU Loss.
It improves the accuracy and robustness of target detection in complex environments, reduces the number of model parameters and computational complexity, facilitates deployment on mobile terminals and edge devices, and enhances detection performance.
Smart Images

Figure CN122244422A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of computer vision and intelligent sensing technology, specifically to a lightweight target detection method based on infrared and visible light multimodal fusion, which is particularly suitable for target detection tasks in nighttime, low-light and complex environments. Background Technology
[0002] Object detection is an important research area in computer vision, widely used in scenarios such as video surveillance, autonomous driving, intelligent robots, and security early warning. With the development of deep learning technology, object detection methods based on convolutional neural networks have made significant progress. Among them, the YOLO series of algorithms are widely used in real-time object detection tasks due to their advantages of fast detection speed and high accuracy.
[0003] However, in complex environments such as nighttime, low light, and smoke obscuring the target, relying solely on visible light images for target detection is easily affected by insufficient illumination, blurred target edges, and background interference, leading to decreased detection accuracy. In contrast, infrared images, by acquiring target thermal radiation information, have the advantage of being insensitive to changes in illumination and can provide stable thermal target information in low-light environments, although their texture details and structural information are relatively weaker. Therefore, infrared and visible light images are highly complementary, and fusing them can improve the robustness and accuracy of target detection in complex scenes.
[0004] While existing infrared and visible light multimodal target detection methods can improve detection performance to some extent, most rely on complex network structures or large-scale models, resulting in problems such as large number of parameters, high computational cost, and high deployment cost. This makes it difficult to meet the requirements of mobile terminals and edge devices for lightweight models and real-time performance. At the same time, existing methods still have shortcomings in cross-modal feature interaction capabilities and multi-task loss optimization, making it difficult to balance detection accuracy and inference efficiency.
[0005] Therefore, there is an urgent need to provide a lightweight target detection method that can fully utilize the complementary information of infrared and visible light images, has high detection accuracy, low model complexity, and is easy to deploy. Summary of the Invention
[0006] The purpose of this invention is to provide a lightweight target detection method based on infrared and visible light multimodal fusion, so as to solve the problems of complex structure, high computational cost, insufficient real-time performance, and low robustness in complex environments of existing multimodal target detection models.
[0007] To achieve the above objectives, the present invention adopts the following technical solution:
[0008] A lightweight target detection method based on infrared and visible light multimodal fusion includes the following steps:
[0009] Step 1: Acquire the infrared image of the scene to be detected and the visible light image corresponding to the infrared image, and preprocess the infrared image and the visible light image to obtain the input image pair;
[0010] Step 2: Input the input image pairs into a dual-branch feature extraction network to extract infrared modal features and visible light modal features;
[0011] Step 3: Input the infrared modal features and the visible light modal features into the cross-modal feature fusion module to perform feature alignment, cross-modal interaction and weighted fusion to obtain the fused features;
[0012] Step 4: Input the fused features into the lightweight target detection network to perform multi-scale feature fusion and target prediction, and output target category information and target location information;
[0013] Step 5: Train the lightweight object detection network using a joint loss function that combines the classification loss function and the bounding box regression loss function to obtain the object detection model, and then use the object detection model to detect the target.
[0014] Furthermore, the preprocessing in step 1 includes: performing size unification and normalization processing on the infrared image and the visible light image, and ensuring that the two modal images correspond to each other in the time dimension and the spatial dimension.
[0015] Furthermore, in step 2, the dual-branch feature extraction network performs independent feature extraction on the infrared image and the visible light image, respectively. The infrared branch is used to extract semantic information related to thermal radiation, while the visible light branch is used to extract texture information and structural information.
[0016] Furthermore, the cross-modal feature fusion module in step 3 includes: performing channel mapping and spatial alignment on infrared modal features and visible light modal features respectively using 1×1 convolution; performing cross-modal interaction based on a cross-attention mechanism, wherein a query vector is generated from one modal feature and a key vector and value vector are generated from another modal feature to obtain interaction features; concatenating the interaction features with the original modal features and inputting them into the channel attention module for weighting to obtain fused features.
[0017] Furthermore, the channel attention module is a compression activation module, which generates channel weights through global average pooling and two fully connected layers. The first fully connected layer is used to compress the channel dimension from 2C to 2C / r, and the second fully connected layer is used to restore the channel dimension to C. The ReLU activation function is used after the first fully connected layer, and the Sigmoid activation function is used after the second fully connected layer. Preferably, the compression ratio r is 16.
[0018] Furthermore, the lightweight object detection network in step 4 is based on the YOLOv8 detection framework, and its backbone network is replaced with MobileOneNet to reduce the number of network parameters and inference latency. At the same time, the GSConv module is used in the neck network to replace the traditional convolutional module to reduce the amount of computation and improve the feature fusion efficiency.
[0019] Furthermore, the joint loss function in step 5 includes a Varifocal Loss for the classification subtask and a CIoU Loss for the bounding box regression subtask, wherein the classification loss weights and regression loss weights are set according to a multi-task learning strategy, and the regression loss weights are gradually reduced in the later stages of training.
[0020] Compared with the prior art, the present invention has the following beneficial effects:
[0021] (1) By constructing a dual-branch feature extraction structure for infrared and visible light, this invention can simultaneously utilize the thermal target information of the infrared mode and the texture structure information of the visible light mode, thereby improving the accuracy and robustness of target detection in complex environments.
[0022] (2) The present invention sets up a cross-modal feature fusion module, which enhances the information interaction capability between different modalities and improves the expressive capability of fused features through feature alignment, cross attention interaction and channel attention weighted fusion.
[0023] (3) The present invention uses MobileOneNet to replace the backbone network of YOLOv8 and introduces the GSConv module in the neck network, which effectively reduces the number of model parameters and computational complexity, improves the real-time performance of the model, and facilitates deployment on mobile terminals and edge devices.
[0024] (4) The present invention adopts a joint loss function that combines Varifocal Loss and CIoU Loss, which can take into account both classification confidence learning and bounding box accurate regression, thereby further improving the target detection performance. Attached Figure Description
[0025] Figure 1 This is a schematic diagram of the overall process of a lightweight target detection method based on infrared and visible light multimodal fusion according to the present invention.
[0026] Figure 2 This is a schematic diagram of the overall structure of the dual-branch feature extraction network and the lightweight target detection network in this invention.
[0027] Figure 3 This is a schematic diagram of the cross-modal feature fusion module in this invention.
[0028] Figure 4This is a schematic diagram of the GSConv module in this invention.
[0029] Figure 5 This is a schematic diagram of the target detection results of the method of the present invention in a nighttime scene.
[0030] Figure 6 This diagram illustrates the target detection results of the method of the present invention under different complex environments. Detailed Implementation
[0031] The present invention will now be described in further detail with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are for illustrative purposes only and are not intended to limit the scope of the invention.
[0032] In this embodiment, the present invention proposes a lightweight target detection method based on infrared and visible light multimodal fusion, which includes five stages: data acquisition and preprocessing, dual-branch feature extraction, cross-modal feature fusion, lightweight detection network prediction, and joint loss function training. The overall process is as follows: Figure 1 As shown in the figure. The method can be applied to scenarios such as nighttime surveillance, intelligent security, autonomous driving assisted perception, and edge vision terminals, and is especially suitable for target detection tasks in low-light, target occlusion, complex background, or drastic lighting conditions.
[0033] First, an infrared image and its corresponding visible light image of the scene to be detected are acquired, forming an input image pair. Preferably, the infrared and visible light images are paired images acquired at the same time and in the same scene to ensure consistency between the two modal images in both time and space dimensions. For image pairs with slight field-of-view deviations, preprocessing methods such as image cropping, registration, or resizing can be used to improve the subsequent feature fusion effect.
[0034] The preprocessing includes: unifying the size of infrared and visible light images to meet the network input size requirements; normalizing image pixel values; and performing data format conversion according to the training framework requirements. During the training phase, data augmentation operations such as random flipping, random scaling, brightness perturbation, and noise enhancement can be performed on the input image pairs to improve the model's generalization ability. Preferably, the input image size can be set to 640×640.
[0035] After preprocessing, the infrared and visible light images are input into a dual-branch feature extraction network. This network comprises an infrared branch and a visible light branch. The infrared branch extracts thermal radiation-related semantic features, while the visible light branch extracts texture, edge, and structural features. Since infrared images exhibit strong target saliency in low-light environments, and visible light images offer richer detail, independent feature extraction via a dual-branch structure is beneficial for improving the sufficiency of multimodal information representation.
[0036] In a preferred embodiment, the dual-branch feature extraction network can employ feature extraction sub-networks with the same structure but independent parameters to adapt to the statistical characteristics of infrared and visible light images respectively. In another embodiment, the dual-branch network can also be constructed using a method of partially sharing parameters and partially having independent parameters to take into account both parameter scale and modal differences. Regardless of the form used, it should be ensured that the output features of the two branches can be mapped to the same spatial resolution and channel dimension before subsequent fusion.
[0037] After completing the dual-branch feature extraction, the infrared modal features and visible light modal features are input into the cross-modal feature fusion module to achieve information interaction and complementary enhancement between different modalities, such as... Figure 3 As shown. The cross-modal feature fusion module includes three sub-processes: feature alignment, cross-modal interaction, and channel-weighted fusion.
[0038] During feature alignment, 1×1 convolutions are used to map the infrared and visible light modal features to ensure consistent feature dimensions. Simultaneously, convolutional mapping adjusts the number of channels in the output features of different branches, providing a unified input for subsequent cross-modal interactions. If necessary, upsampling, downsampling, or interpolation can be used to further align the spatial resolution.
[0039] In the cross-modal interaction process, a cross-attention mechanism is used to model the correlation between the two modal features. Specifically, a query vector Q is constructed using one modal feature, and a key vector K and a value vector V are constructed using the other modal feature. The response relationship between the modalities is calculated through attention weights to obtain interaction features containing complementary information. This approach enables the network to establish a closer connection between infrared thermal target information and visible light texture structure information, improving the target representation capability in complex environments. Let the infrared modal feature be... The visible light mode characteristics are Then query vector Key vector Sum value vector They can be represented as:
[0040]
[0041] In the channel-weighted fusion sub-process, the interactive features are concatenated with the original modal features, and then input into the channel attention module. The fused features are recalibrated using channel weights to enhance the response to key channels and suppress redundant features. Preferably, the channel attention module is implemented using a compressed excitation module.
[0042] The compression activation module includes a global average pooling layer, a first fully connected layer, a second fully connected layer, and an activation function layer. Specifically, firstly, global average pooling is performed on the input concatenated features to obtain a global channel description vector; then, the number of channels is compressed through the first fully connected layer, and the number of channels is restored through the second fully connected layer; after the first fully connected layer, a ReLU activation function is applied, and after the second fully connected layer, a Sigmoid activation function is applied to generate the weight coefficients corresponding to each channel, and the input features are weighted channel by channel. Preferably, the compression ratio r is 16. The interaction features are concatenated with the original modality features to obtain the fused input features, which are then processed by the compression activation module to obtain channel weights, and the fused input features are weighted channel by channel, as shown in the following expression:
[0043] Through the aforementioned cross-modal feature fusion module, spatial alignment, semantic interaction, and channel recalibration of modal features can be achieved simultaneously, making full use of the complementary information between infrared and visible light modes, thereby improving the ability of fused features to express weak targets, low-contrast targets, and partially occluded targets.
[0044] like Figure 2 As shown, the fused features are input into a lightweight object detection network for subsequent detection. This lightweight object detection network is based on the YOLOv8 detection framework and consists of three parts: a backbone network, a neck network, and a detection head. The backbone network extracts deep semantic features, the neck network performs multi-scale feature fusion, and the detection head outputs the object category and location prediction results.
[0045] In this invention, to reduce the number of model parameters and inference latency, the original YOLOv8 backbone network is replaced with MobileOneNet. MobileOneNet is composed of multiple stacked MobileOne Blocks. During training, it employs a multi-branch structure, and during inference, structural reparameterization merges the multiple branches into a single convolutional structure, thereby reducing inference complexity while maintaining feature extraction capabilities.
[0046] In a preferred embodiment, the MobileOne Block includes a 3×3 convolution branch in the main branch, a 1×1 convolution branch in the auxiliary branch, a BatchNorm branch, and an optional skip connection branch during the training phase; during the model deployment phase, the multiple branches are equivalently merged into a single-path convolution operation through parameter folding to reduce runtime memory access overhead and improve inference speed.
[0047] To further reduce computational overhead, the GSConv module replaces the traditional convolutional module in the neck network. The GSConv module combines the advantages of standard convolution and lightweight convolution, reducing the number of parameters and floating-point operations while maintaining feature representation capabilities, making it suitable for multi-scale feature fusion processes. Introducing the GSConv module into the neck network further enhances the overall network's lightweight nature and real-time detection capabilities.
[0048] Multi-scale fusion features processed by the backbone network and neck network are input into the detection head, which outputs target category information and bounding box location information. Preferably, the detection head can adopt a decoupled head structure to perform the prediction tasks of the classification branch and the regression branch separately, thereby improving the expressive power and optimization efficiency of the classification and localization tasks.
[0049] During the model training phase, this invention employs a joint loss function to optimize the object detection network. The joint loss function includes Varifocal Loss for the classification subtask and CIoU Loss for the bounding box regression subtask. Varifocal Loss enhances the consistency between classification confidence and object quality, making the network focus more on high-quality positive samples; CIoU Loss comprehensively considers the overlap between predicted and ground truth boxes, center distance, and aspect ratio differences, thereby improving bounding box localization accuracy. The joint loss function can be expressed as:
[0050]
[0051]
[0052]
[0053] In a preferred embodiment, the classification loss weights and regression loss weights are set according to a multi-task learning strategy. In the initial stage, a higher classification loss weight and a relatively lower regression loss weight can be used to prioritize target recall and class discrimination capabilities. As training progresses, especially in the later stages, the regression loss weight can be gradually reduced to minimize the interference of the regression branch on the classification gradient, thereby improving the model's convergence stability and final detection accuracy.
[0054] In the specific training implementation, paired infrared and visible light image data can be divided into training, validation, and test sets, and network training and parameter updates are completed based on a deep learning training framework. After training, a lightweight target detection model capable of utilizing both infrared and visible light information is obtained. During the inference phase, this model only requires the input of the infrared and visible light images corresponding to the scene to be detected to output the target category and its location. Its target detection results in nighttime scenes are as follows: Figure 5 .
[0055] In one exemplary embodiment, the method of the present invention can be applied to pedestrian detection tasks at night. Compared with single-modal detection methods that use only infrared images or only visible light images, the infrared and visible light multimodal fusion method proposed in this invention can make fuller use of the complementary characteristics between different modalities and achieve better detection results in nighttime, low-light, and target occlusion scenarios.
[0056] In another embodiment, the method of the present invention can also be extended to various scenarios such as vehicle inspection, industrial defect detection, perimeter security monitoring, and environmental perception of unmanned platforms. For different application scenarios, only the training data and detection categories need to be adjusted according to the actual target category and data acquisition method, without changing the overall technical route of dual-branch extraction, cross-modal fusion, and lightweight detection proposed in this invention. Its target detection results under different complex environments are as follows: Figure 6 .
[0057] It should be noted that the dataset types, input image sizes, training parameter settings, number of detection categories, and application scenarios involved in the above embodiments are merely illustrative examples and do not constitute a limitation on the scope of protection of this invention. Those skilled in the art can make appropriate adjustments according to actual application needs.
[0058] Those skilled in the art will also understand that, without departing from the spirit and essence of this invention, any equivalent substitution or conventional transformation made to the dual-branch network structure, cross-modal interaction order, channel attention implementation, lightweight backbone network replacement method, lightweight convolution module replacement position, and joint loss function weight adjustment strategy should be considered to fall within the protection scope of this invention.
Claims
1. A lightweight target detection method based on infrared and visible light multimodal fusion, characterized in that, Includes the following steps: Step 1: Acquire the infrared image and the corresponding visible light image of the scene to be detected, and preprocess the infrared image and the visible light image to obtain the input image pair; Step 2: Input the input image pairs into a dual-branch feature extraction network to extract infrared modal features and visible light modal features; Step 3: Input the infrared modal features and the visible light modal features into the cross-modal feature fusion module to perform feature alignment, cross-modal interaction and weighted fusion to obtain the fused features; Step 4: Input the fused features into a lightweight target detection network for multi-scale feature fusion and target prediction, and output target category information and target location information; Step 5: Train the lightweight object detection network using a joint loss function that combines classification loss and bounding box regression loss to obtain a trained object detection model, and then use the object detection model to detect the target.
2. The lightweight target detection method based on infrared and visible light multimodal fusion according to claim 1, characterized in that: The preprocessing in step 1 includes unifying and normalizing the size of the infrared and visible light images, and ensuring that the two modal images correspond to each other in the time and spatial dimensions.
3. The lightweight target detection method based on infrared and visible light multimodal fusion according to claim 1, characterized in that: The dual-branch feature extraction network in step 2 performs independent feature extraction on the infrared image and the visible light image, respectively. The infrared branch is used to extract semantic information related to thermal radiation, and the visible light branch is used to extract texture and structural information.
4. The lightweight target detection method based on infrared and visible light multimodal fusion according to claim 1, characterized in that: The cross-modal feature fusion module in step 3 includes: Step 3.1: Applying 1×1 convolution to the infrared modal features and visible light modal features respectively for channel mapping and spatial alignment; Step 3.2: Performing cross-modal interaction based on a cross-attention mechanism, where a query vector is generated from one modal feature, and a key vector and value vector are generated from another modal feature to obtain the interaction feature; Step 3.3: Concatenating the interaction feature with the original modal feature and inputting it into the channel attention module, calculating the weights of each channel, and then weighting the concatenated feature to obtain the fused feature.
5. The lightweight target detection method based on infrared and visible light multimodal fusion according to claim 4, characterized in that: The channel attention module is a compressed activation module. The compressed activation module generates channel weights through global average pooling and two fully connected layers. The first fully connected layer is used for dimensionality reduction, and the second fully connected layer is used for dimensionality increase. The channel weights are output through an activation function.
6. The lightweight target detection method based on infrared and visible light multimodal fusion according to claim 1, characterized in that: The lightweight object detection network in step 4 is based on the YOLOv8 detection framework. The original backbone network is replaced with the lightweight backbone network MobileOneNet to reduce the number of model parameters and inference latency while maintaining feature extraction capabilities.
7. The lightweight target detection method based on infrared and visible light multimodal fusion according to claim 6, characterized in that: The lightweight object detection network uses the GSConv module to replace the traditional convolutional module in its neck network, in order to reduce computational cost and model complexity, and improve the efficiency of multi-scale feature fusion.
8. The lightweight target detection method based on infrared and visible light multimodal fusion according to claim 1, characterized in that: The joint loss function in step 5 includes Varifocal Loss for the classification subtask and CIoU Loss for the bounding box regression subtask. Varifocal Loss is used to improve the consistency between classification confidence and target quality, while CIoU Loss is used to comprehensively optimize the overlap, center point distance, and aspect ratio difference between the predicted box and the ground truth box.
9. The lightweight target detection method based on infrared and visible light multimodal fusion according to claim 8, characterized in that: In the joint loss function, the classification loss weights and regression loss weights are set according to a multi-task learning strategy to keep the classification sub-tasks and regression sub-tasks balanced during training; and the regression loss weights are reduced in the later stages of training to reduce the interference of the regression task on the classification gradient and improve the model convergence accuracy.
10. The lightweight target detection method based on infrared and visible light multimodal fusion according to claim 1, characterized in that: The target detection method is applicable to pedestrian or target detection tasks in nighttime, low light, fog, or other complex environments, and can be deployed on mobile terminals or edge computing devices.