A low-light pedestrian detection method based on multi-modal fusion

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By employing a multimodal fusion method, combining feature extraction from infrared and RGB images with a global attention mechanism, the accuracy and real-time performance issues of pedestrian detection in low-light environments are addressed, achieving highly efficient pedestrian detection results.

CN116311355BActive Publication Date: 2026-06-19HUBEI UNIV +2

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: HUBEI UNIV
Filing Date: 2023-02-14
Publication Date: 2026-06-19

Application Information

Patent Timeline

14 Feb 2023

Application

19 Jun 2026

Publication

CN116311355B

IPC: G06V40/10; G06V40/20; G06V10/44; G06V10/80; G06V10/82

AI Tagging

Application Domain

Biometric pattern recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN116311355B_ABST

Patent Text Reader

Abstract

This invention discloses a low-light pedestrian detection method based on multimodal fusion, comprising an image input end, an image processing end, and an image output end. The image input end inputs RGB and infrared images through an external imaging device to the image processing end. The image processing end outputs the detection result based on the fused features to the image output end for displaying the final detection result. This method can improve the accuracy of pedestrian recognition in low-light conditions, while ensuring the model's real-time performance and superior performance compared to traditional algorithms.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision target detection, specifically to a low-light pedestrian detection method based on multimodal fusion. Background Technology

[0002] Object detection, as one of the fundamental tasks in computer vision, has experienced rapid development in recent years. Mature object detection algorithms are widely used in fields such as autonomous driving, remote sensing image detection, and video surveillance. Obtaining high-quality images of the target object through imaging devices such as cameras can help detection algorithms achieve better detection results. However, in practical applications, imaging devices are affected by the imaging environment, thus impacting image quality. In low-light environments, imaging instruments capture insufficient light, leading to image quality degradation, resulting in decreased contrast, color distortion, and low signal-to-noise ratio. This severely affects the accuracy of object detection algorithms and greatly increases the difficulty of applying them in practice.

[0003] Image enhancement-based low-light object detection algorithms have been relatively mature low-light detection techniques in recent years. These algorithms enhance images using methods such as Retinex and Generative Adversarial Networks (GANs) to improve image details before inputting them into the object detection algorithm, effectively improving detection accuracy. However, this method is an end-to-end cascaded approach, requiring excessive computation time, and the simple cascading process disconnects image enhancement from detection. Another approach is based on low-light image standards plus detection. This method trains the network using a large number of labeled low-light images to obtain a model that performs well in low-light conditions. However, this method requires a large amount of manually labeled low-light datasets, the current scale of which is insufficient for model training. To achieve a trade-off between real-time performance and detection accuracy, a technique fusing attention and a lightweight head network has been proposed. However, this approach offers limited improvement in accuracy and is difficult to apply in real-world scenarios. Summary of the Invention

[0004] The purpose of this invention is to address the shortcomings of existing technologies by providing a low-light pedestrian detection method based on multimodal fusion. This method can improve the accuracy of pedestrian recognition in low-light conditions, while ensuring that the model's real-time performance and overall performance are superior to traditional algorithms.

[0005] The technical solution to achieve the objective of this invention is:

[0006] A low-light pedestrian detection method based on multimodal fusion includes an image input end, an image processing end, and an image output end. The image input end receives an RGB image I through an external imaging device. R With infrared image I TAt the image processing end, the image processing end uses fusion I R with I T The feature output detection results are displayed at the image output terminal, whereby,

[0007] 1) The specific steps of the image input terminal are as follows:

[0008] 1-1) Obtaining I through a multimodal sensor R with I T As input, I R with I T They are pairs of images that correspond perfectly in space and time;

[0009] 1-2) Image preprocessing: Design the affine transformation matrix Γ(x,y), and apply it to I R with I T Perform random cropping, rotation, and splicing operations;

[0010] 2) The specific steps of the image processing terminal are as follows:

[0011] 2-1) Image Reshape Operation: Reshapes the image after preprocessing at the input end. R with I T Under strict spacetime correspondence, Reshape is 640*640*3 in size;

[0012] 2-2) Front Feature Extraction (FFE) module: Input the image from step 2-1) into the FFE module, where h1 represents I T The original features, after two feature extractions, yielded feature h3, where F1 represents I. T The original features are represented by F3, which represents the features obtained after two feature extractions. Specifically, the input image is transformed into HSV and RGB spaces through spatial transformation, and features are extracted separately. The input image is an I image of size 640*640*3. R with I T The slices are reduced to 160*160 pixels, and the channels are expanded to 256 by splicing to obtain a 160*160*256 feature map. Then, a convolution operation is performed with a kernel size of (3,3), padding of 1, stride of 1, and 128 kernels to obtain the output feature h3, with an F3 size of 160*160*128.

[0013] 2-3) Cross-Modality Feature Fusion (CMFF) Framework: The h3 and F3 obtained in step 2-2) are input into the CMFF framework for feature fusion, as shown in formula (1):

[0014]

[0015] Among them, F R F T They represent I respectively R with I T The feature map is extracted using the network feature extraction function f. R () and f T () generates feature maps of input images with different modalities, F Fused The fused feature map The CMFF feature fusion process uses a global attention mechanism as the fusion function, and is as follows:

[0016] 2-3-1) Input data concatenation module: First, the input data h3 and F3 are reduced to 25600*128 dimensionality. Then, through a concatenation operation, the input features K of 51200*128 are obtained by concatenating them along the second dimension. RT ;

[0017] 2-3-2) Global attention mechanism module, K RT The input is fed into a network module with a three-layer global attention mechanism to calculate three sets of association weights (S, O, V), as shown in the following formula (2):

[0018] S = IW S O = IW O V = IW V (2),

[0019] in, In the CMFF module, the three mapping spaces correspond to D S D O D V The values are equal;

[0020] 2-3-3) Output calculation module: Based on the calculated three correlation weights, the output result is as follows:

[0021]

[0022] in, It is a scaling factor used to prevent the function from getting trapped in a local optimum. Through preset network hyperparameter settings, P is the fusion weight obtained after one CMFF module calculation. The calculation expression of the final output result P” is shown in the following formula (4):

[0023] P”=4@(CMFF(K RT )) (4),

[0024] Where 4@ indicates that the CMFF module has been passed four times, K RT Input fusion features;

[0025] 2-3-4) Input the feature P” obtained in step 2-3-3) into the feature pyramid FPN (Feature Parymid Network) for prediction to obtain the final prediction result;

[0026] 3) The processing procedure of the image output end is as follows: based on the prediction result obtained from the image processing end, the imaging device outputs and marks the human portrait frame in real time. The output result includes the human portrait frame and the prediction confidence.

[0027] Compared with existing technologies, the advantages of this technical solution are as follows:

[0028] 1. By combining the feature extraction capabilities of infrared images under low-light conditions, the detection performance of the algorithm under low-light conditions is enhanced;

[0029] 2. Compared with traditional enhanced post-detection methods, this technical solution can effectively save algorithm running time and improve the real-time performance of the algorithm.

[0030] This method can improve the accuracy of pedestrian recognition in low-light conditions, while ensuring that the model's real-time performance and overall performance are superior to traditional algorithms. Attached Figure Description

[0031] Figure 1 This is a flowchart illustrating the method used in the embodiment.

[0032] Figure 2 This is a schematic diagram of the multimodal feature fusion framework CMFF in the embodiment;

[0033] Figure 3 This is a schematic diagram of the front-end feature extraction module (FFE) in the embodiment.

[0034] Figure 4 This is a detailed feature extraction diagram of the FFE module in the embodiment; Detailed Implementation

[0035] The present invention will be further described below with reference to the accompanying drawings and embodiments, but this is not intended to limit the scope of the invention.

[0036] Example:

[0037] Reference Figure 1 A low-light pedestrian detection method based on multimodal fusion includes an image input end, an image processing end, and an image output end. The image input end receives I data through an external imaging device. R with I T At the image processing end, the image processing end uses fusion I R with I TThe feature output detection results are displayed at the image output terminal, whereby,

[0038] 1) The processing procedure at the image input end is as follows:

[0039] 1-1) Acquiring RGB images using a multimodal sensor R With infrared image I T As input, I R with I T These are paired images that are perfectly corresponding in space and time.

[0040] 1-2) Image preprocessing: Design the affine transformation matrix Γ(x,y), and apply it to I R with I T Perform random cropping, rotation, and splicing operations;

[0041] 2) The processing procedure of the image processing terminal is as follows:

[0042] 2-1) Image Reshape Operation: Reshapes the image after preprocessing at the input end. R with I T Under strict spacetime correspondence, Reshape is 640*640*3 in size;

[0043] 2-2) Front-end feature extraction module (FFE): This module extracts the I-values processed in step 2-1). R with I T Input the FFE module, where h1 represents the original features of the infrared image, which are then extracted twice to obtain h3. F1 represents the original features of the infrared image, and F3 represents the features obtained after the two feature extractions. Specifically, the input image is spatially transformed into HSV and RGB spaces, and features are extracted separately. The input image is then converted to an IF array of size 640*640*3. R with I T The slices are reduced to 160*160 pixels, and the channels are expanded to 256 by splicing to obtain a 160*160*256 feature map. Then, a convolution operation is performed with a kernel size of (3,3), padding of 1, stride of 1, and 128 kernels to obtain the output feature h3, with an F3 size of 160*160*128.

[0044] 2-3) Feature Fusion Framework CMFF: Input h3 and F3 obtained in step 2-2) into the CMFF framework for feature fusion, as shown in formula (1):

[0045]

[0046] Among them, F R F T They represent I respectively Rwith I T The feature map is extracted using the network feature extraction function f. R () and f T () generates feature maps of input images with different modalities, F Fused The fused feature map The CMFF feature fusion process uses a global attention mechanism as the fusion function, and is as follows:

[0047] 2-3-1) Input data concatenation module: First, the input data h3 and F3 are reduced to 25600*128 dimensionality. Then, through a concatenation operation, the input features K of 51200*128 are obtained by concatenating them along the second dimension. RT ;

[0048] 2-3-2) Global attention mechanism module, K RT The input is fed into a network module with a three-layer global attention mechanism to calculate three sets of association weights (S, O, V), as shown in Equation (2):

[0049] S = IW S O = IW O V = IW V (2),

[0050] in, In the CMFF module, the three mapping spaces correspond to D S D O D V The values are equal;

[0051] 2-3-3) Output calculation module: Based on the calculated three correlation weights, the output result is as follows:

[0052]

[0053] in, It is a scaling factor used to prevent the function from getting trapped in a local optimum. Through preset network hyperparameter settings, P is the fusion weight obtained after one CMFF module calculation. The calculation expression of the final output result P” is as follows (4):

[0054] P”=4@(CMFF(K RT )) (4),

[0055] Where 4@ indicates that the CMFF module has been passed four times, K RT Input fusion features;

[0056] 2-3-4) Input the feature P” obtained in step 2-3-3) into the feature pyramid FPN for prediction to obtain the final prediction result;

[0057] 3) The processing procedure of the image output end is as follows: based on the prediction result obtained from the image processing end, the imaging device outputs and marks the human portrait frame in real time. The output result includes the human portrait frame and the prediction confidence.

[0058] In this example, I is obtained from a multimodal sensor. R with I T After inputting the image preprocessing module, initial weights h1 and F1 are obtained. The initial weights are then sent to the front-end feature extraction module FFE to obtain features h3 and F3. These features are then sent to the CMFF module designed in this invention to fuse infrared image features and RGB image features. After passing through the four CMFF modules, the final output feature P” is obtained and sent to the FPN detection network to obtain the final detection result.

[0059] like Figure 2 As shown, the input to the multimodal feature fusion framework is the RGB feature h3 and the infrared feature F3 after image preprocessing and FFE module processing. First, the two features are input to the Reshape module for dimensionality reduction, and then fed into Concat for concatenation. The concatenated features are then fed into three weight networks W respectively. S W O W V The three key parameters (S, O, V) are used to calculate the input P. After being unbalancedly weighted and reassembled, the three key parameters are fed into a global average pooling layer and an MLP layer to obtain the final input P.

[0060] like Figure 3 As shown, the front-end feature extraction framework FFE slices the input 640*640*3 HSV space image and RGB space image respectively, reducing them to 160*160 size. At the same time, it expands the channels by 256 through splicing to obtain a 160*160*256 feature map. Then, it performs a convolution operation with a kernel size of (3,3), padding of 1, stride of 1, and 128 kernels to obtain the output features h3, F3, with a size of 160*160*128.

[0061] like Figure 4 As shown, the feature extraction and segmentation map in the FFE module divides the 640*640 single-channel image into four 160*160 feature maps through the center before extracting features.

Claims

1. A low-light pedestrian detection method based on multi-modal fusion, comprising an image input end, an image processing end and an image output end, characterized in that, The image input terminal receives an RGB image via an external imaging device. R With infrared image I T At the image processing end, the image processing end fuses I R with I T The feature output detection results are displayed at the image output terminal, whereby, 1) The specific steps of the image input terminal are as follows: 1-1) Acquiring I R With I T As input, I R With I T are spatio-temporally fully corresponding pairs of images; 1-2) Image pre-processing: design an affine transformation matrix Γ(x,y) to I R with I T perform random cropping, rotation, and stitching operations; 2) The specific steps of the image processing terminal are as follows: 2-1) Image Reshape operation: I R With I T Reshape is 640*640*3 size in strict space-time correspondence; 2-2) Front-end feature extraction module (FFE): This module extracts the I features obtained in step 2-1). R with I T Input the FFE module, where h1 represents I R The original features, after two feature extractions, yielded feature h3, where F1 represents I. T The original features are represented by F3, which represents the features obtained after two feature extractions. The specific steps are: input I... R with I T After spatial transformation, features are extracted from both HSV and RGB spaces. The input size is 640*640*3. R with I T The features are sliced and reduced to 160*160 pixels. Simultaneously, the channels are expanded to 256 by splicing to obtain a 160*160*256 feature map. Then, a convolution operation is performed with a kernel size of (3,3), padding of 1, stride of 1, and 128 kernels to obtain the output feature h3, with an F3 size of 160*160*128. 2-3) Feature Fusion Framework CMFF: h3 and F3 obtained in step 2-2) are input into the CMFF framework for feature fusion. The main modeling process is shown in formula (1): Among them, F R F T They represent I respectively R with I T The feature map is extracted using the network feature extraction function f. R () and f T () generates feature maps for different modal inputs, F Fused The fused feature map The CMFF feature fusion process uses a global attention mechanism as the fusion function, and is as follows: 2-3-1) Input data concatenation module: First, the input data h3 and F3 are reduced to 25600*128 dimensionality. Then, through a concatenation operation, the input features K of 51200*128 are obtained by concatenating them along the second dimension. RT ; 2-3-2) Global attention mechanism module, K RT The input is fed into a network module with a three-layer global attention mechanism to calculate three sets of association weights (S, O, V), as shown in Equation (2): S = IW S , O = IW O , V = IW V (2), in, In the CMFF module, the three mapping spaces correspond to D S D O D V The values are equal; 2-3-3) Output calculation module: Based on the calculated three correlation weights, the output result is as follows: in, It is a scaling factor. Through preset network hyperparameter settings, P is the fusion weight obtained after one CMFF module calculation. The calculation expression of the final output result P” is as shown in the following formula (4): P" = 4@ (CMFF(K RT )) (4), wherein 4@ denotes passing through four CMFF modules, K RT is the input fused feature; 2-3-4) Input the feature P” obtained in step 2-3-3) into the feature pyramid FPN for prediction to obtain the final prediction result; 3) The processing procedure of the image output end is as follows: based on the prediction result obtained from the image processing end, the imaging device is used to output and mark the human portrait frame in real time. The output result includes the human portrait frame and the prediction confidence.