Night infrared image pedestrian detection method and system based on wavelet feature modulation and HiLo attention mechanism

By combining wavelet feature modulation and HiLo attention mechanism, the problem of insufficient capture of detailed information and global structure in infrared image pedestrian detection is solved, and high-precision and low-complexity pedestrian detection effect is achieved.

CN122200751APending Publication Date: 2026-06-12HUNAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HUNAN UNIV
Filing Date
2026-05-15
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing infrared image pedestrian detection methods struggle to effectively capture subtle details and large-scale structures in low-light environments, thus affecting detection accuracy.

Method used

A pedestrian detection method for nighttime infrared images using wavelet feature modulation and HiLo attention mechanism is proposed. The method decomposes infrared image features through wavelet feature modulation module and combines HiLo attention mechanism to process high-frequency and low-frequency features in target detection network to enhance feature capture capability.

🎯Benefits of technology

It improves the accuracy and robustness of pedestrian detection in infrared images while reducing computational complexity, making it suitable for scenarios with high real-time requirements.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122200751A_ABST
    Figure CN122200751A_ABST
Patent Text Reader

Abstract

The application discloses a night infrared image pedestrian detection method and system based on wavelet feature modulation and a HiLo attention mechanism, wherein the night infrared image pedestrian detection method comprises the following steps: acquiring an infrared image under a target scene; constructing a night infrared image pedestrian detection network, including a wavelet feature modulation module, a target detection network and a detection head connected in sequence; inputting the infrared image into the night infrared image pedestrian detection network, and obtaining wavelet enhanced features through the wavelet feature modulation module; inputting the wavelet enhanced features into the target detection network, and obtaining a decoding feature vector; then inputting the decoding feature vector into the detection head, and predicting a confidence score of a pedestrian and a boundary box coordinate of the pedestrian. The application combines the HiLo attention mechanism, significantly reduces the calculation complexity while maintaining the detection accuracy, and is particularly suitable for real-time pedestrian detection tasks in a low-illumination environment.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of infrared image processing technology, and in particular to a method and system for pedestrian detection in nighttime infrared images based on wavelet feature modulation and HiLo attention mechanism. Background Technology

[0002] With the development of intelligent video surveillance technology, pedestrian detection methods based on infrared images have been widely used in low-light environments. Infrared images, as an image format that does not rely on visible light, have the advantage of working effectively at night or in adverse weather conditions. However, infrared images are characterized by low resolution, high noise, and loss of image detail, posing significant challenges to pedestrian detection.

[0003] Existing pedestrian detection methods are typically based on convolutional neural networks (CNNs) or Transformer models, which extract image features for classification and localization. However, in infrared images, due to the lack of clear details, existing feature extraction methods often fail to fully capture the detailed features of pedestrians, thus affecting the accuracy of pedestrian detection.

[0004] To improve detection accuracy, some methods attempt to combine different feature extraction techniques and attention mechanisms. However, how to improve the ability to capture subtle details and large-scale structures in infrared images while ensuring computational efficiency remains a problem that urgently needs to be solved. Summary of the Invention

[0005] This invention provides a method and system for pedestrian detection in nighttime infrared images based on wavelet feature modulation and HiLo attention mechanism to solve the technical problems mentioned in the background art.

[0006] To achieve the above objectives, the technical solution of the present invention is implemented as follows: This invention provides a method for pedestrian detection in nighttime infrared images based on wavelet feature modulation and HiLo attention mechanism, specifically including the following steps: S1. Acquire infrared images of the target scene; S2. Construct a pedestrian detection network for nighttime infrared images, including a wavelet feature modulation module, a target detection network based on the DETR (Detection Transformer) structure, and a detection head connected in sequence; the target detection network has a built-in HiLo attention mechanism. S3. Input the acquired infrared image into the night infrared image pedestrian detection network, and obtain the wavelet enhancement feature through the wavelet feature modulation module; S4. Input the wavelet enhancement features into the target detection network to obtain the decoded feature vector; S5. Then, the decoded feature vector is input into the detection head to predict the pedestrian's class probability and the pedestrian's bounding box coordinates.

[0007] Furthermore, the wavelet feature modulation module includes a first convolutional layer, a second convolutional layer, a two-dimensional discrete wavelet transform, a wavelet component concatenation operation, a modulation network, an upsampling layer, an element-wise multiplication operation, and a channel dimension concatenation operation; The first convolutional layer, the two-dimensional discrete wavelet transform, the wavelet component splicing operation, the modulation network, and the upsampling layer are connected in series. The output of the upsampling layer is connected to the output of the second convolutional layer through an element-wise multiplication operation. The output of the element-wise multiplication operation, the output of the first convolutional layer, and the output of the second convolutional layer are connected through a channel dimension splicing operation. The modulation network consists of a third convolutional layer, an activation function, and a fourth convolutional layer, which are connected in series.

[0008] Furthermore, the target detection network includes a backbone network, an improved Encoder module, and a Decoder module connected in sequence; The improved Encoder module replaces the original multi-head self-attention module with the HiLo attention mechanism. The HiLo attention mechanism includes a high-frequency attention branch and a low-frequency attention branch, and the outputs of the high-frequency attention branch and the low-frequency attention branch are connected through a splicing operation.

[0009] Furthermore, the backbone network is selected from the ResNet-50 network.

[0010] Furthermore, step S3 specifically includes the following steps: S31. Collect the infrared images The input is fed into the wavelet feature modulation module, where it first passes through the first convolutional layer and then the second convolutional layer to obtain fine features. Contextual features Where H and W represent the height and width of the infrared image, respectively. Let C denote the set of real numbers, and C denote the fine feature. Number of channels; S32, Fine Features A two-dimensional discrete wavelet transform is performed to decompose the components into four categories, namely the approximate low-frequency components. ∈ Horizontal high-frequency detail components ∈ Vertical high-frequency detail components ∈ Diagonal detail components ; S33, Approximate low-frequency component Horizontal high-frequency detail components Vertical high-frequency detail components Diagonal detail components The fused features are obtained by wavelet component concatenation. ; S34, Integrating Features The input is fed into the modulation network and passes through the upsampling layer to obtain the modulation features. ; S35, Modulation features With context features The modulated context features are obtained through element-wise multiplication. ; S36, Modulated context features Fine features Contextual features Wavelet-enhanced features are obtained through channel-dimensional concatenation. .

[0011] Furthermore, the fine features in S31 and context features The expressions are as follows: ; ; in, This indicates the first convolutional layer, which is a convolutional layer with a kernel size of 3×3; This indicates the second convolutional layer, which is a convolutional layer with a kernel size of 7×7; The expression for the two-dimensional discrete wavelet transform in S32 is as follows: ; Where DWT2 represents the two-dimensional discrete wavelet transform; This means "for all"; c represents the c-th channel; The fusion feature in S33 The expression is as follows: ; in, This indicates a wavelet component concatenation operation; modulation features in S34 The expression is as follows: ; in, Indicates the modulation network. Indicates the upsampling layer; the modulation network The third convolutional layer is a 3×3 convolution with 4C input channels and C output channels; the modulation network The activation function used in the inner convolutional layer is SiLU; the fourth convolutional layer is a 3×3 convolution with C input channels and C output channels. Modulated context features in S35 The expression is as follows: ; in, This represents the element-wise multiplication operation; The wavelet enhancement feature in S36 The expression is as follows: ; in, This indicates a channel dimension splicing operation.

[0012] Furthermore, step S4 specifically includes the following steps: S41. Enhance wavelet features The input is fed into the object detection network, first passing through the backbone network, and the output is the feature map. , where C' is the number of output channels of the backbone network; , Representing feature maps f respectively res Height and width; S42. The feature map f is processed by a 1×1 convolutional layer. res Projecting these features onto the hidden dimension d of the Transformer model yields the hidden features f. proj ∈ and in the hidden feature f proj By adding learnable positional encodings, we obtain the feature f after adding positional encodings. pos The expression is as follows: f pos =f proj +PE; Where, PE∈ For learnable location encoding; S43. Add position-encoded features f pos The input is fed into the improved Encoder module to obtain the output feature F. HiLo ∈ ; S44, Output feature F HiLo The input is fed into the Decoder module, and the output is the decoded feature vector Z∈ , where N is the number of learnable object queries.

[0013] Furthermore, step S43 specifically includes the following steps: S431, using pre-set attention point division ratio parameters The number of multi-head attention heads is allocated within high-frequency and low-frequency attention branches; then, the features after adding positional encoding are... The inputs are respectively fed into the high-frequency attention branch and the low-frequency attention branch; the relevant formulas for the multi-head attention head allocation process are as follows: ; ; in, , These represent the number of attention heads assigned to high-frequency attention branches and low-frequency attention branches, respectively. Indicates the number of heads of attention from multiple parties; S432. In the high-frequency attention branch, the feature f after adding positional encoding will be... pos The algorithm is divided into multiple local windows. Local multi-head self-attention is then calculated within each local window. The outputs of each local window are concatenated according to their original spatial positions to obtain the output of the high-frequency attention branch. The relevant expression is as follows: ; ; ; ; Among them, Hi Fi represents the high-frequency attention branch; MSA local This indicates local multi-head self-attention; These represent the query matrix, key matrix, and value matrix within the high-frequency attention branch, respectively. , , These represent the corresponding learnable weight matrices; S433. In the low-frequency attention branch, keeping the query matrix Q2 unchanged, average pooling downsampling is performed on the key matrix K2 and the value matrix V2 to obtain the compressed low-frequency features. Then, in the pooled low-frequency feature space, global multi-head self-attention is calculated based on the query matrix Q2, the pooled key matrix K2, and the pooled value matrix V2 to obtain the output of the low-frequency attention branch. The relevant expressions are as follows: ; ; ; ; Among them, Lo Fi represents the low-frequency attention branch; MSAglobal Represents global multi-head self-attention; AvgPool represents average pooling operation; , , These represent the corresponding learnable weight matrices; S434. Concatenate the outputs of the high-frequency attention branches and the low-frequency attention branches along the attention head dimension to restore the multi-head attention structure. With the same number of heads, the output feature F is obtained. HiLo The expression is as follows: ; in, This indicates a concatenation operation, outputting features. ∈ Dimensions and features after adding positional encoding same.

[0014] Furthermore, S44 specifically includes the following steps: S441, Output features The input is fed into the Decoder module, and the output features are first processed. as a bond matrix and value matrix ; S442. Receive a fixed number of learnable object queries as a query matrix Q3; wherein, the learnable object query is a set of trainable parameter vectors, and each parameter vector corresponds to a potential target region in the infrared image; S443. Input the key matrix K3, value matrix V3, and query matrix Q3 into the Decoder module. The Decoder module first performs self-attention calculation on the query matrix Q3, allowing interactions between the various learnable object queries; then, it uses the self-attention output as a new query and performs cross-attention calculation with the key matrix K3 and value matrix V3 to extract target information from image features; through multi-layer stacking, it progressively optimizes the feature representation corresponding to each learnable object query, outputting the decoded feature vector Z∈ .

[0015] Furthermore, step S5 specifically includes the following steps: S51. Input the decoded feature vector Z into the detection head, which includes a classification branch and a regression branch; wherein, the classification branch outputs the class probability y∈[0,1] through a fully connected layer, where 1 represents pedestrian and 0 represents background; S52, The regression branch outputs the bounding box coordinates through another fully connected layer; S53. Then, binarize the category probability y according to the preset threshold, determine whether it is a pedestrian, and output the number of detected pedestrians and the category label of each pedestrian.

[0016] A second aspect of the present invention also provides a pedestrian detection system for nighttime infrared images, configured to perform a pedestrian detection method for nighttime infrared images based on wavelet feature modulation and HiLo attention mechanism.

[0017] The beneficial effects of this invention are: 1. This invention discloses a pedestrian detection method for nighttime infrared images based on wavelet feature modulation and HiLo attention mechanism. The method uses a nighttime infrared image pedestrian detection network, which includes a wavelet feature modulation module, a target detection network, and a detection head connected in sequence. The wavelet feature modulation module utilizes two-dimensional discrete wavelet transform to decompose the fine features of the infrared image into subbands of different frequencies (i.e., approximate low-frequency components). Horizontal high-frequency detail components Vertical high-frequency detail components Diagonal detail components These subbands of different frequencies can effectively preserve edge and detail information in images, and are particularly suitable for capturing details in infrared images.

[0018] The object detection network incorporates a HiLo attention mechanism, which includes high-frequency and low-frequency attention branches. During feature processing, the intermediate features output by the backbone network are processed separately through these branches, capturing image details and global structure through local window attention and global attention mechanisms, respectively. Subsequently, by concatenating the outputs of the high-frequency and low-frequency attention branches, computational complexity is effectively reduced, while simultaneously improving the performance of the pedestrian detection network in complex environments, particularly in nighttime infrared images.

[0019] 2. This invention utilizes fine features With context features By performing fusion and modulation, the correlation between features is enhanced through the modulation process, making the final detection results (i.e., the pedestrian class probability and the pedestrian bounding box coordinates) more accurate and robust.

[0020] 3. This invention not only improves the accuracy of pedestrian detection, but also effectively reduces computational overhead, making it particularly suitable for scenarios with high real-time requirements. Attached Figure Description

[0021] Figure 1 This is a structural diagram of the wavelet feature modulation module in this invention; Figure 2This is a structural diagram of the pedestrian detection network for nighttime infrared images in this invention. Detailed Implementation

[0022] To facilitate understanding of the present invention, a more complete description will be given below with reference to the accompanying drawings. Preferred embodiments of the invention are shown in the drawings. However, the invention can be implemented in many other different forms and is not limited to the embodiments described herein. Rather, these embodiments are provided to provide a thorough and complete understanding of the disclosure of the invention.

[0023] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of this invention, "a plurality of" means two or more, unless otherwise explicitly specified.

[0024] Reference Figure 1 This application provides a method for pedestrian detection in nighttime infrared images based on wavelet feature modulation and HiLo attention mechanism, specifically including the following steps: S1. Acquire infrared images of the target scene; the infrared images are single-channel infrared images (i.e., grayscale images), acquired by infrared imaging equipment, and used as input images for subsequent pedestrian detection tasks, used for pedestrian detection tasks in nighttime or low-light environments. S2. Construct a pedestrian detection network for nighttime infrared images, including a wavelet feature modulation module, a target detection network based on a DETR structure, and a detection head connected in sequence; the target detection network has a built-in HiLo attention mechanism. S3. Input the acquired infrared image into the night infrared image pedestrian detection network, and obtain the wavelet enhancement feature through the wavelet feature modulation module; S4. Input the wavelet enhancement features into the target detection network to obtain the decoded feature vector; S5. Then, the decoded feature vector is input into the detection head to predict the pedestrian's class probability and the pedestrian's bounding box coordinates.

[0025] The improvement in the Encoder module of this invention lies in replacing the original multi-head self-attention module with a HiLo attention mechanism. The HiLo attention mechanism includes high-frequency and low-frequency attention branches, which effectively solve the problems of blurred pedestrian edges and texture attenuation in nighttime infrared images through local window self-attention and global attention mechanisms, respectively. Finally, the outputs of both are fused to achieve efficient and accurate pedestrian detection. Simultaneously, this invention effectively improves the quality of nighttime pedestrian infrared images through a wavelet feature modulation module. By combining the HiLo attention mechanism, it effectively enhances the detection capability of pedestrian details and global structure in nighttime infrared images, significantly reducing computational complexity while maintaining detection accuracy, making it particularly suitable for real-time pedestrian detection tasks in low-light environments.

[0026] In some embodiments, the wavelet feature modulation module includes a first convolutional layer, a second convolutional layer, a two-dimensional discrete wavelet transform (DWT2), a wavelet component concatenation operation, a modulation network, an upsampling layer, an element-wise multiplication operation, and a channel dimension concatenation operation. In this invention, the first convolutional layer, the two-dimensional discrete wavelet transform, the wavelet component concatenation operation, the modulation network, and the upsampling layer are sequentially connected in series. The output of the upsampling layer is connected to the output of the second convolutional layer through an element-wise multiplication operation. The outputs of the element-wise multiplication operation, the first convolutional layer, and the second convolutional layer are connected through a channel-dimensional concatenation operation. The two-dimensional discrete wavelet transform uses the Haar wavelet basis function, which has good time-frequency locality and edge-preserving properties, effectively improving the discriminability of weak edge targets in infrared images. In this invention, the two-dimensional discrete wavelet transform is used to obtain the multi-scale frequency domain representation of the image, i.e., the approximate low-frequency components. Horizontal high-frequency detail components Vertical high-frequency detail components Diagonal detail components ; The modulation network consists of a third convolutional layer, an activation function, and a fourth convolutional layer connected in series, wherein the activation function is used to enhance nonlinear expressive power.

[0027] In some embodiments, the target detection network includes a backbone network, an improved Encoder module, and a Decoder module connected in sequence; The improved Encoder module replaces the original multi-head self-attention module with the HiLo attention mechanism. The HiLo attention mechanism includes high-frequency attention branches and low-frequency attention branches, and it is implemented by setting the attention head partitioning ratio parameter. ∈(0,1), controls the distribution of the number of high-frequency attention heads in the high-frequency attention branch and the number of low-frequency attention heads in the low-frequency attention branch, thereby achieving an optimal balance between modeling performance and computational efficiency, and also improving the ability of the night infrared image pedestrian detection network to represent targets of different scales.

[0028] The HiLo attention mechanism is divided into two branches, which can improve computational efficiency and modeling capabilities. The outputs of the high-frequency attention branch and the low-frequency attention branch are connected through a concatenation operation. The high-frequency attention branch uses a local window self-attention mechanism to enhance image detail modeling; the low-frequency attention branch applies a global attention mechanism after pooling compression to capture long-range dependencies.

[0029] In some embodiments, the backbone network is selected as a ResNet-50 network.

[0030] In some embodiments, S3 specifically includes the following steps: S31. Collect the infrared images The input is fed into the wavelet feature modulation module, where it first passes through the first convolutional layer and then the second convolutional layer for multi-scale feature extraction, resulting in fine features used to capture local details in the infrared image. Contextual features used to capture a wider range of contextual structures .

[0031] Where H and W represent the height and width of the infrared image, respectively (in pixels). Let C denote the set of real numbers, and C denote the fine feature. The number of channels (C=64 in this embodiment); local detail information includes edges, textures, and target contours; contextual structure includes scene background and local region relationships. These two types of features describe the image at different scales, providing a foundation for subsequent enhancement and fusion. S32, Fine Features A two-dimensional discrete wavelet transform is performed to decompose the components into four categories, namely the approximate low-frequency components. Horizontal high-frequency detail components Vertical high-frequency detail components Diagonal detail components Two-dimensional discrete wavelet transform has good edge preservation and multi-scale frequency domain decomposition capabilities, which can effectively enhance the visibility of detailed targets in infrared images. S33, Approximate low-frequency component Horizontal high-frequency detail components Vertical high-frequency detail components Diagonal detail components The fused features are obtained by wavelet component concatenation. ; S34, Integrating Features The input is fed into the modulation network and passes through the upsampling layer to obtain the modulation features. ; S35, Modulation Characteristics With context features The modulated context features are obtained through element-wise multiplication. ; S36, Modulated context features Fine features Contextual features Wavelet-enhanced features are obtained through channel-dimensional concatenation. Wavelet enhancement features By structurally integrating detailed textures, modulation responses, and contextual information, it can significantly improve the target representation capability of infrared images against complex backgrounds.

[0032] In some embodiments, the fine features in S31 and context features The expressions are as follows: ; ; in, This indicates the first convolutional layer, which is a convolutional layer with a kernel size of 3×3; This indicates the second convolutional layer, which is a convolutional layer with a kernel size of 7×7; The expression for the two-dimensional discrete wavelet transform in S32 is as follows: ; Where DWT2 represents the two-dimensional discrete wavelet transform, "For all" means "for all", and "c" means the c-th channel.

[0033] The fusion feature in S33 The expression is as follows: ; in, This indicates a wavelet component concatenation operation; Modulation features in S34 The expression is as follows: ; in, Indicates the modulation network, This represents the upsampling layer; the upsampling layer is used to restore features to their original spatial resolution; the modulation network The third convolutional layer is a 3×3 convolution with 4C input channels and C output channels; the modulation network The activation function used within the layer is SiLU, and the fourth convolutional layer is a 3×3 convolution with C input channels and C output channels; Modulated context features in S35 The expression is as follows: ; in, This represents the element-wise multiplication operation (Hadamard product). The wavelet enhancement feature in S36 The expression is as follows: ; in, This indicates a channel dimension splicing operation.

[0034] In some embodiments, S4 specifically includes the following steps: S41. Enhance wavelet features f enh The input is fed into the object detection network, where it first passes through the backbone network (ResNet-50) to extract high-level semantic features, resulting in a feature map. ; , Representing feature maps f respectively res The height and width are defined as H1=H / 32 and W1=W / 32; C' is the number of output channels of the backbone network, and the backbone network output is 2048 PPI (pixels per inch); the backbone network is responsible for further extracting high-level semantic features and preserving spatial information; S42. Then, the feature map f is processed through a 1×1 convolutional layer. res Projecting the hidden dimension d onto the Transformer model, where the hidden dimension d = 256, yields the hidden features f. proj ∈ and hide feature f proj By adding learnable positional encodings, we obtain the feature f after adding positional encodings. pos The expression is as follows: f pos =f proj +PE; Where PE stands for learnable location code; S43. Add position-encoded features f pos The input is fed into the improved Encoder module to obtain the output feature F. HiLo ∈ ; S44, Output features The input is fed into the Decoder module, and the output is the decoded feature vector Z∈ , where N is the number of learnable object queries (N=100 in this embodiment).

[0035] In some embodiments, S43 specifically includes the following steps: S431, using pre-set attention point division ratio parameters ∈(0,1) allocates the number of multi-head attention heads in high-frequency attention branches and low-frequency attention branches; then, the features after adding positional encoding are... The inputs are respectively fed into the high-frequency attention branch and the low-frequency attention branch; the relevant formulas for the multi-head attention head allocation process are as follows: ; ; in, , These represent the number of attention heads assigned to high-frequency attention branches and low-frequency attention branches, respectively. Indicates the number of heads requiring multi-head attention (in this embodiment) =8); S432. In the high-frequency attention branch, add the position-encoded features. The algorithm is divided into multiple local windows, each with a size of 2×2 (pixels × pixels, each window containing 4 spatial locations). Local multi-head self-attention is then calculated within each local window. The outputs of each local window are concatenated according to their original spatial locations to obtain the output of the high-frequency attention branch. The relevant expression is as follows: ; ; ; ; in, Indicates high-frequency attention branches; This represents the output of the high-frequency attention branch; This represents Local Multi-head Self-Attention, where self-attention is computed independently within each 2×2 2×2 local window; These represent the query matrix, key matrix, and value matrix within the high-frequency attention branch, respectively. , , These represent the corresponding learnable weight matrices; S433. In the low-frequency attention branch, keeping the query matrix Q2 unchanged, average pooling downsampling is performed on the key matrix K2 and the value matrix V2 to obtain the compressed low-frequency features. Then, in the pooled low-frequency feature space, global multi-head self-attention is calculated based on the query matrix Q2, the pooled key matrix K2, and the pooled value matrix V2 to obtain the output of the low-frequency attention branch. The relevant expressions are as follows: ; ; ; ; in, Indicates low-frequency attention branches; This represents the output of the low-frequency attention branch; This represents Global Multi-head Self-Attention, which is computed in the pooled compressed low-frequency feature space. Indicates average pooling; Representation of features The query matrix; , , These represent the corresponding learnable weight matrices; S434. Concatenate the outputs of the high-frequency attention branches with the outputs of the low-frequency attention branches along the attention head dimension to restore the multi-head attention structure. With the same number of heads, the output features are obtained. The expression is as follows: ; in, This indicates a concatenation operation, outputting feature F. HiLo ∈ Dimensions and features after adding positional encoding same.

[0036] In some embodiments, S44 specifically includes the following steps: S441, Output features The input is fed into the Decoder module, and the output features are first processed. as a bond matrix and value matrix ; S442, Then receive a fixed number of learnable object queries as a query matrix. Among them, the learnable object query is a set of trainable parameter vectors, which are randomly initialized before model training and optimized together with the entire model during training; each parameter vector corresponds to a potential target region in the infrared image, and its number is fixed (N=100 in this embodiment), and its dimension is consistent with the hidden dimension d; S443, Key Matrix Value matrix Query matrix The input is fed into the Decoder module. The Decoder module first processes the query matrix. Self-attention is calculated to allow interactions between learnable object queries; then, the self-attention output is used as a new query and compared with the key matrix. Value matrix Cross-attention computation is performed to extract target information from image features; through multi-layer stacking, the feature representation corresponding to each learnable object query is progressively optimized, and the decoded feature vector Z∈ is output. .

[0037] In some embodiments, S5 specifically includes the following steps: S51, Decode the feature vector Z∈ The input is fed into the detection head, which includes a classification branch and a regression branch: the classification branch outputs the class probability y∈[0,1] through a fully connected layer, where 1 represents a pedestrian and 0 represents the background; S52, The regression branch outputs the bounding box coordinates B through another fully connected layer; S53. Then, binarize the category probability y according to the preset threshold, determine whether it is a pedestrian, and output the number of detected pedestrians and the category label of each pedestrian.

[0038] A second aspect of the present invention also discloses a pedestrian detection system for nighttime infrared images, configured to perform a pedestrian detection method for nighttime infrared images based on wavelet feature modulation and HiLo attention mechanism.

[0039] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Furthermore, the technical solutions of the various embodiments of the present invention can be combined with each other, but this must be based on the ability of those skilled in the art to implement them. When the combination of technical solutions is contradictory or cannot be implemented, it should be considered that such a combination of technical solutions does not exist and is not within the scope of protection claimed by the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

1. A method for pedestrian detection in nighttime infrared images based on wavelet feature modulation and HiLo attention mechanism, characterized in that, Specifically, the steps include the following: S1. Acquire infrared images of the target scene; S2. Construct a pedestrian detection network for nighttime infrared images, including a wavelet feature modulation module, a target detection network, and a detection head connected in sequence; the target detection network has a built-in HiLo attention mechanism; S3. Input the acquired infrared image into the night infrared image pedestrian detection network, and obtain the wavelet enhancement feature through the wavelet feature modulation module; S4. Input the wavelet enhancement features into the target detection network to obtain the decoded feature vector; S5. Then, the decoded feature vector is input into the detection head to predict the pedestrian's category probability and the pedestrian's bounding box coordinates; S4 specifically includes the following steps: S41. Enhance wavelet features The input is fed into the object detection network, first passing through the backbone network, and the output is the feature map. , where C' is the number of output channels of the backbone network; , Representing feature maps f respectively res Height and width; S42. The feature map f is processed by a 1×1 convolutional layer. res Projecting these features onto the hidden dimension d of the Transformer model yields the hidden features f. proj ∈ and in the hidden feature f proj By adding learnable positional encodings, we obtain the feature f after adding positional encodings. pos ; S43. Add position-encoded features f pos The input is fed into the improved Encoder module to obtain the output feature F. HiLo ∈ ; S44, Output feature F HiLo The input is fed into the Decoder module, and the output is the decoded feature vector Z∈ , where N is the number of learnable object queries.

2. The method for pedestrian detection in nighttime infrared images based on wavelet feature modulation and HiLo attention mechanism according to claim 1, characterized in that, The wavelet feature modulation module includes a first convolutional layer, a second convolutional layer, a two-dimensional discrete wavelet transform, a wavelet component concatenation operation, a modulation network, an upsampling layer, an element-wise multiplication operation, and a channel dimension concatenation operation. The first convolutional layer, the two-dimensional discrete wavelet transform, the wavelet component splicing operation, the modulation network, and the upsampling layer are connected in series. The output of the upsampling layer is connected to the output of the second convolutional layer through an element-wise multiplication operation. The output of the element-wise multiplication operation, the output of the first convolutional layer, and the output of the second convolutional layer are connected through a channel dimension splicing operation. The modulation network consists of a third convolutional layer, an activation function, and a fourth convolutional layer, which are connected in series.

3. The method for pedestrian detection in nighttime infrared images based on wavelet feature modulation and HiLo attention mechanism according to claim 2, characterized in that, The target detection network includes a backbone network, an improved Encoder module, and a Decoder module connected in sequence. The improved Encoder module replaces the original multi-head self-attention module with the HiLo attention mechanism. The HiLo attention mechanism includes a high-frequency attention branch and a low-frequency attention branch, and the outputs of the high-frequency attention branch and the low-frequency attention branch are connected through a splicing operation.

4. The method for pedestrian detection in nighttime infrared images based on wavelet feature modulation and HiLo attention mechanism according to claim 3, characterized in that, The backbone network is a ResNet-50 network.

5. The method for pedestrian detection in nighttime infrared images based on wavelet feature modulation and HiLo attention mechanism according to claim 3, characterized in that, S3 specifically includes the following steps: S31. Collect the infrared images The input is fed into the wavelet feature modulation module, where it first passes through the first convolutional layer and then the second convolutional layer to obtain fine features. Contextual features Where H and W represent the height and width of the infrared image, respectively. Let C denote the set of real numbers, and C denote the fine feature. Number of channels; S32, Fine Features A two-dimensional discrete wavelet transform is performed to decompose the components into four categories, namely the approximate low-frequency components. ∈ Horizontal high-frequency detail components ∈ Vertical high-frequency detail components ∈ Diagonal detail components ; S33, Approximate low-frequency component Horizontal high-frequency detail components Vertical high-frequency detail components Diagonal detail components The fused features are obtained by wavelet component concatenation. ; S34, Integrating Features The input is fed into the modulation network and passes through the upsampling layer to obtain the modulation features. ; S35, Modulation features With context features The modulated context features are obtained through element-wise multiplication. ; S36, Modulated context features Fine features Contextual features Wavelet-enhanced features are obtained through channel-dimensional concatenation. .

6. The method for pedestrian detection in nighttime infrared images based on wavelet feature modulation and HiLo attention mechanism according to claim 5, characterized in that, The fine features in S31 and context features The expressions are as follows: ; ; in, This indicates the first convolutional layer, which is a convolutional layer with a kernel size of 3×3; This indicates the second convolutional layer, which is a convolutional layer with a kernel size of 7×7; The expression for the two-dimensional discrete wavelet transform in S32 is as follows: ; Where DWT2 represents the two-dimensional discrete wavelet transform; This means "for all"; c represents the c-th channel; The fusion feature in S33 The expression is as follows: ; in, This indicates a wavelet component concatenation operation; Modulation features in S34 The expression is as follows: ; in, Indicates the modulation network, Indicates the upsampling layer; the modulation network The third convolutional layer is a 3×3 convolution with 4C input channels and C output channels; the modulation network The activation function used in the inner convolutional layer is SiLU; the fourth convolutional layer is a 3×3 convolution with C input channels and C output channels. Modulated context features in S35 The expression is as follows: ; in, This represents the element-wise multiplication operation; The wavelet enhancement feature in S36 The expression is as follows: ; in, This indicates a channel dimension splicing operation.

7. The method for pedestrian detection in nighttime infrared images based on wavelet feature modulation and HiLo attention mechanism according to claim 6, characterized in that, S43 specifically includes the following steps: S431, using pre-set attention point division ratio parameters The number of multi-head attention heads is allocated within high-frequency and low-frequency attention branches; then, the features after adding positional encoding are... The inputs are respectively fed into the high-frequency attention branch and the low-frequency attention branch; The relevant formulas for the multi-head attention allocation process are as follows: ; ; in, , These represent the number of attention heads assigned to high-frequency attention branches and low-frequency attention branches, respectively. Indicates the number of heads of attention from multiple parties; S432. In the high-frequency attention branch, the feature f after adding positional encoding will be... pos The algorithm is divided into multiple local windows. Local multi-head self-attention is then calculated within each local window. The outputs of each local window are concatenated according to their original spatial positions to obtain the output of the high-frequency attention branch. The relevant expression is as follows: ; ; ; ; Among them, Hi Fi represents the high-frequency attention branch; MSA local This indicates local multi-head self-attention; These represent the query matrix, key matrix, and value matrix within the high-frequency attention branch, respectively. , , These represent the corresponding learnable weight matrices; S433. In the low-frequency attention branch, keeping the query matrix Q2 unchanged, average pooling downsampling is performed on the key matrix K2 and the value matrix V2 to obtain the compressed low-frequency features. Then, in the pooled low-frequency feature space, global multi-head self-attention is calculated based on the query matrix Q2, the pooled key matrix K2, and the pooled value matrix V2 to obtain the output of the low-frequency attention branch. The relevant expressions are as follows: ; ; ; ; Among them, Lo Fi represents the low-frequency attention branch; MSA global Represents global multi-head self-attention; AvgPool represents average pooling operation; , , These represent the corresponding learnable weight matrices; S434. Concatenate the outputs of the high-frequency attention branches and the low-frequency attention branches along the attention head dimension to restore the multi-head attention structure. With the same number of heads, the output feature F is obtained. HiLo The expression is as follows: ; in, This indicates a concatenation operation, outputting features. ∈ Dimensions and features after adding positional encoding same.

8. The method for pedestrian detection in nighttime infrared images based on wavelet feature modulation and HiLo attention mechanism according to claim 7, characterized in that, S44 specifically includes the following steps: S441, Output features The input is fed into the Decoder module, and the output features are first processed. as a bond matrix and value matrix ; S442. Receive a fixed number of learnable object queries as a query matrix Q3; wherein, the learnable object query is a set of trainable parameter vectors, and each parameter vector corresponds to a potential target region in the infrared image; S443. Input the key matrix K3, value matrix V3, and query matrix Q3 into the Decoder module. The Decoder module first performs self-attention calculation on the query matrix Q3, allowing interactions between the various learnable object queries; then, it uses the self-attention output as a new query and performs cross-attention calculation with the key matrix K3 and value matrix V3 to extract target information from image features; through multi-layer stacking, it progressively optimizes the feature representation corresponding to each learnable object query, outputting the decoded feature vector Z∈ .

9. The method for pedestrian detection in nighttime infrared images based on wavelet feature modulation and HiLo attention mechanism according to claim 8, characterized in that, S5 specifically includes the following steps: S51. Input the decoded feature vector Z into the detection head, which includes a classification branch and a regression branch; wherein, the classification branch outputs the class probability y∈[0,1] through a fully connected layer, where 1 represents pedestrian and 0 represents background; S52, The regression branch outputs the bounding box coordinates through another fully connected layer; S53. Then, binarize the category probability y according to the preset threshold, determine whether it is a pedestrian, and output the number of detected pedestrians and the category label of each pedestrian.

10. A nighttime infrared image pedestrian detection system, characterized in that, It is configured to perform a nighttime infrared image pedestrian detection method based on wavelet feature modulation and HiLo attention mechanism as described in any one of claims 1 to 9.