Fog target detection method based on multi-scale time-frequency information enhancement mechanism
By decoupling and enhancing the low-frequency cloud and fog components and high-frequency target components of foggy images through a multi-scale time-frequency information enhancement mechanism, and combining multi-scale convolution and attention mechanisms, the problem of balancing cloud and fog suppression and detail preservation in foggy target detection is solved, thereby improving detection accuracy and real-time performance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NORTHWESTERN POLYTECHNICAL UNIV
- Filing Date
- 2026-05-22
- Publication Date
- 2026-06-19
AI Technical Summary
Existing target detection methods in foggy weather, while removing cloud and fog interference, are prone to losing detailed information and fail to make full use of multi-scale and global features, affecting the detection accuracy of small-scale and blurred targets.
A multi-scale time-frequency information enhancement mechanism is adopted. By decoupling the low-frequency cloud and fog components and the high-frequency target components of foggy images, differential enhancement is performed on each component. Feature fusion is then performed by combining multi-scale convolution and attention mechanisms to construct a joint learning framework for defogging and detection.
It effectively suppresses cloud and fog interference while retaining the high-frequency discrimination features of the target, improving the detection accuracy of small-scale and blurred targets, and maintaining the real-time performance and accuracy of detection.
Smart Images

Figure CN122243781A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image processing technology, and specifically to a foggy target detection method based on a multi-scale time-frequency information enhancement mechanism. Background Technology
[0002] Deep learning-based object detection methods, especially convolutional neural networks and visual Transformer models, have made significant progress in the field of image object detection. However, in practical industrial applications, object detection is highly susceptible to adverse weather conditions. In foggy conditions, light scattering and attenuation effects reduce image visibility, decrease contrast, blur target outlines, and make it difficult to distinguish between foreground and background, severely impacting detection accuracy and reliability.
[0003] To address the problem of target detection under foggy conditions, existing technologies mainly follow the following paths.
[0004] One approach employs a multi-sensor fusion strategy, such as using infrared sensors or synthetic aperture radar for all-weather detection. This type of method has the advantages of strong anti-interference capabilities and minimal susceptibility to weather conditions, but it also has inherent limitations such as high hardware costs, significant environmental noise interference, and difficulty in finely distinguishing target categories.
[0005] Another type of method is target detection in foggy weather based on visible light images. The mainstream approach is to use dehazing as a pre-processing or embedded step in the detection process, enabling the detection model to obtain relatively clear input features. Researchers have explored different angles in terms of technical implementation.
[0006] Some studies employ mathematical and data-driven dehazing methods, designing different filters to perform dehazing, exposure, tone, contrast, and sharpening operations on images. Other studies utilize physical dehazing methods based on atmospheric scattering models or lightweight dehazing networks based on U-Net (U-shaped network) structures. Considering the real-time requirements of detection, some studies have designed the dehazing module and detection network into a joint framework that can be trained end-to-end, allowing the dehazing parameters to be optimized for the detection task, thereby achieving a balance between accuracy and speed.
[0007] However, the aforementioned methods have significant technical drawbacks in practical applications. While image operations such as contrast optimization, low-light enhancement, and dehazing can improve image quality to some extent, they can also lead to problems such as color distortion, artifacts, and loss of detail information, thereby reducing object detection performance. This negative impact is particularly pronounced in scenarios requiring fine object recognition or where the target scale is small. The root cause is that the fog in foggy images mainly manifests as low-frequency background interference, and its effective suppression usually requires operations such as pooling, smoothing, or global context aggregation. However, the key discriminative features required for object detection (such as edge contours and texture details) are precisely high-frequency information, which is extremely sensitive to the aforementioned smoothing operations. Existing methods, whether employing independent dehazing preprocessing, end-to-end joint training, or using the same processing strategy for high- and low-frequency features, inevitably attenuate or lose the high-frequency discriminative features necessary for object detection while suppressing fog interference.
[0008] Furthermore, existing methods do not fully utilize multi-scale features and global contextual information during the detection phase, which limits the model's ability to locate and identify small-scale or blurred targets under low visibility conditions.
[0009] Therefore, how to design a detection method that can reduce the loss of detailed information while removing cloud and fog interference, and enhance the ability to perceive multi-scale and global features, is a technical problem that urgently needs to be solved in the field of target detection in foggy weather. Summary of the Invention
[0010] To address the shortcomings of existing technologies, this invention proposes a fog-day target detection method based on a multi-scale time-frequency information enhancement mechanism. This method solves the technical problem of balancing cloud and fog suppression with detail preservation during defogging, enabling the effective removal of cloud and fog interference while maximizing the preservation of the target's high-frequency discrimination features and enhancing multi-scale and global feature perception capabilities.
[0011] To achieve the above objectives, the present invention adopts the following technical solution:
[0012] This invention proposes a foggy target detection method based on a multi-scale time-frequency information enhancement mechanism, comprising:
[0013] Acquire an image of the target object to be detected in foggy weather; preprocess the foggy target image to obtain an image tensor;
[0014] The image tensor is input into a pre-trained target detection neural network, and the target detection neural network is used to perform target detection on the foggy target image.
[0015] The target detection neural network includes a dehazing model and a detection model;
[0016] The defogging model performs the following operations:
[0017] The image tensor is encoded to obtain encoded features;
[0018] The deepest feature in the encoding features is decoupled into a low-frequency cloud and fog component and a high-frequency target component;
[0019] The low-frequency cloud and fog component is suppressed and enhanced using a self-attention mechanism with pooling operation to obtain an enhanced low-frequency cloud and fog component.
[0020] The high-frequency target component is enhanced with detail preservation to obtain the enhanced high-frequency target component;
[0021] The enhanced low-frequency cloud and fog component is fused with the enhanced high-frequency target component to obtain preliminary fusion features;
[0022] The preliminary fusion features are fused with the deepest features in the encoded features to obtain the basic fusion features;
[0023] The basic fusion features and the encoder intermediate features received by the skip connection are subjected to gating context aggregation processing to obtain gating enhancement features;
[0024] The gated enhancement features are decoded to obtain dehazing enhancement features;
[0025] The detection model performs the following operations:
[0026] Multi-scale feature extraction is performed on the dehazing enhancement features to obtain multi-level features;
[0027] The deepest feature in the multi-level feature hierarchy is sequentially subjected to multi-scale convolution and channel fusion to obtain deep fused features.
[0028] The deep fusion features are weighted by both channel attention and pixel attention to obtain weighted enhanced features;
[0029] The weighted enhancement features are added element-wise to the deepest feature in the multi-level features to obtain the residual enhancement features;
[0030] The residual enhancement features are subjected to frequency domain transformation and their amplitude components are enhanced by convolution to obtain time-frequency fusion features;
[0031] The time-frequency fusion feature is fused with features of other scales in the multi-level feature to obtain the multi-scale fusion feature;
[0032] The multi-scale fused features are input into the detection head, which outputs the category labels and bounding box coordinates of all detected targets.
[0033] Furthermore, the process of decoupling the deepest features in the encoded features into low-frequency cloud / fog components and high-frequency target components includes:
[0034] The deepest feature in the encoded features is subjected to mean pooling to obtain the low-frequency cloud and fog component.
[0035] The low-frequency cloud and fog component is upsampled to obtain the upsampled low-frequency cloud and fog component.
[0036] The high-frequency target component is obtained by subtracting the upsampled low-frequency cloud component from the deepest feature in the encoded features.
[0037] Furthermore, the process of suppressing and enhancing the low-frequency cloud and fog components includes:
[0038] Generate a first query, a first key, and a first value based on the low-frequency cloud and fog components;
[0039] After pooling downsampling the first key and the first value, the self-attention weights are calculated and information is aggregated to obtain the aggregated low-frequency features;
[0040] The aggregated low-frequency features are upsampled to obtain an enhanced low-frequency cloud component, which has the same spatial resolution as the deepest feature in the encoded features.
[0041] Furthermore, the process of performing detail-preserving enhancement on the high-frequency target components includes:
[0042] A second query, a second key, and a second value are generated based on the high-frequency target components;
[0043] The self-attention weights are calculated based on the second query, the second key, and the second value, and information is aggregated to obtain the enhanced high-frequency target components.
[0044] Furthermore, the process of fusing the enhanced low-frequency cloud and fog component with the enhanced high-frequency target component includes:
[0045] The enhanced high-frequency target component is added to the enhanced low-frequency cloud and fog component in residual form to obtain preliminary fusion features.
[0046] Furthermore, the process of performing multi-scale convolution on the deepest features in the multi-level feature set includes:
[0047] First, feature extraction is performed using concatenated 1×1 convolutions and 5×5 convolutions to obtain concatenated convolution features;
[0048] Multi-scale spatial feature extraction is then performed on the cascaded convolution features using parallel 3×3 depth separable convolution kernels (DWConv3), 5×5 depth separable convolution kernels (DWConv5), and 7×7 depth separable convolution kernels (DWConv7).
[0049] Furthermore, the process of applying dual weighting of channel attention and pixel attention to the deep fusion features includes:
[0050] First, channel attention weighting is applied to the deep fusion features to extract global channel information and obtain channel-weighted features;
[0051] Then, pixel attention weighting is applied to the channel weighted features to extract local pixel location information, resulting in weighted enhanced features.
[0052] Furthermore, the process of performing frequency domain transformation on the residual enhancement features and convolutional enhancement on their amplitude components includes:
[0053] A two-dimensional fast Fourier transform (FFT) is performed on the residual enhancement features to obtain the amplitude component and the phase component;
[0054] The amplitude component is subjected to two 1×1 convolutions in sequence, and a nonlinear transformation is performed between the two convolutions using the Leaky ReLU activation function to obtain the enhanced amplitude component.
[0055] The phase component and the enhanced amplitude component are subjected to inverse Fourier transform (IFFT) to obtain the time-frequency fusion features.
[0056] This invention also proposes a foggy target detection system based on a multi-scale time-frequency information enhancement mechanism to implement the above-mentioned foggy target detection method, comprising:
[0057] The image acquisition module is used to acquire images of the target to be detected in foggy weather.
[0058] The preprocessing module is used to preprocess the target image to obtain an image tensor;
[0059] The object detection module includes a pre-trained object detection neural network, which is used to perform object detection on the image tensor and output the category labels and bounding box coordinates of all detected objects.
[0060] Compared with the prior art, the beneficial effects of the present invention are as follows:
[0061] (1) In this invention, the High-Low Frequency Layered Processing Module (HFFLP) is placed at the deepest output of the encoder. Since the deepest features have the largest receptive field and the richest semantic information, they can effectively characterize the global distribution pattern of clouds and fog. This invention decouples the deep features output by the encoder into low-frequency cloud and fog components and high-frequency target components through the high-low frequency layered processing module, and uses an asymmetric self-attention mechanism to enhance them differently. The low-frequency branch performs pooling downsampling on the keys and values to focus on the global distribution pattern of clouds and fog, while the high-frequency branch maintains the original resolution to preserve spatial details. Since this method fully considers the essential differences between cloud and fog interference and target structural information in the frequency domain distribution, it effectively solves the technical problem that it is difficult to balance cloud and fog suppression and detail preservation in existing defogging operations. It effectively suppresses cloud and fog interference while preserving the high-frequency discrimination features of the target as much as possible.
[0062] (2) This invention achieves deep feature decoupling by using mean pooling and subtraction residual operation. Since this decoupling method is learnable and data-driven, it can adaptively separate the interference components related to foggy scenes, avoiding the limitations of traditional frequency domain transformation fixed basis decomposition, and can extract target features more accurately.
[0063] (3) This invention introduces a multi-scale time-frequency feature enhancement module (MTFFE) into the detection model. The deepest features output by the backbone network are processed sequentially by multi-scale convolution, dual weighting of channel attention and pixel attention, frequency domain transformation and amplitude component convolution enhancement. Since multi-scale convolution expands the receptive field through a series-parallel hybrid structure, the extracted texture, edge and structural information is richer and more diverse. The dual attention mechanism enables the network to extract global channel information and local pixel position information at the same time, effectively addressing the problem of uneven fog distribution in the image. Frequency domain enhancement enhances high-frequency edge information through frequency domain transformation, solving the technical problem of insufficient utilization of multi-scale features and global context information in the detection stage of existing methods, and significantly improving the detection accuracy of small-scale targets and blurred targets under low visibility conditions.
[0064] (4) This invention combines dehazing with feature refinement to enhance perception by constructing a joint learning framework for the dehazing model and the detection model. Since this framework avoids the error accumulation and noise interference introduced by the multi-module processing in the traditional dehazing preprocessing and detection serial architecture, and the dehazing model and the detection model can be jointly optimized end-to-end, it can improve detection accuracy while maintaining good real-time performance. Attached Figure Description
[0065] Figure 1 This is a schematic diagram of the overall structure of the target detection neural network in an embodiment of the present invention;
[0066] Figure 2 This is a schematic diagram of the internal structure of the high- and low-frequency feature layering processing module in an embodiment of the present invention;
[0067] Figure 3 This is a schematic diagram of the internal structure of the low-frequency enhancement unit in an embodiment of the present invention;
[0068] Figure 4 This is a schematic diagram of the internal structure of the high-frequency enhancement unit in an embodiment of the present invention;
[0069] Figure 5 This is a schematic diagram of the internal structure of the multi-scale time-frequency feature fusion enhancement module in an embodiment of the present invention;
[0070] Figure 6 This is a schematic diagram of the internal structure of the frequency domain enhancement unit in an embodiment of the present invention;
[0071] Figure 7 The following are comparative visualization results of various methods on the Seaships7000 fog dataset: (a) is the visualization result output by the YOLOv8 model; (b) is the visualization result output by the GCANet (Gated Context Aggregation Network)-YOLOv8 model; and (c) is the visualization result output by the target detection neural network in the embodiment of this invention.
[0072] Figure 8 The following are comparative visualization results of various methods on the ShipRS ImageNet fog dataset; (a) is the visualization result output by the YOLOv8 model; (b) is the visualization result output by the GCANet-YOLOv8 model; and (c) is the visualization result output by the target detection neural network in the embodiment of this invention.
[0073] Figure 9 The following are heatmaps comparing the various methods on the Seaships7000 fog dataset; where (a) is the heatmap of the GCANet-YOLOv8 model; and (b) is the heatmap of the target detection neural network in the embodiment of this invention. Detailed Implementation
[0074] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0075] Example
[0076] This embodiment proposes a foggy target detection method based on a multi-scale time-frequency information enhancement mechanism, including the following steps:
[0077] Step 1: Obtain the image of the target in foggy weather to be detected, and preprocess the obtained image.
[0078] The foggy target image can be a picture taken by a surveillance camera or a high-resolution optical remote sensing image. Due to the influence of foggy weather conditions, the visibility of the foggy target image is low, and the edge, texture and other features of the target are severely weakened.
[0079] The preprocessing of foggy target images includes size normalization and standardization. In this embodiment, size normalization scales the input image to a uniform size of 640×640 pixels. Standardization is then performed to map the pixel value range from [0,255] to [0,1], ensuring that the image data distribution meets the input requirements of the target detection neural network, thereby obtaining the image tensor. .
[0080] Step 2: Input the image tensor into the pre-trained target detection neural network, perform multi-scale target detection on the foggy target image through the target detection neural network, and output the category label and bounding box coordinates of all detected targets.
[0081] refer to Figure 1 The target detection neural network comprises two parts: a defogging model and a detection model. The defogging model is used to suppress cloud and fog interference and enhance target features, while the detection model is used for target classification and localization.
[0082] The dehazing model employs a lightweight U-Net structure and introduces a high- and low-frequency feature hierarchical processing module. The model includes an encoder, a decoder, and skip connections connecting the encoder and decoder. The high- and low-frequency feature hierarchical processing module is positioned between the encoder's output and the decoder's input. The encoder obtains data from the image tensor... The coding features are extracted, and the deepest features in the coding features are processed by high and low frequency feature layering to obtain basic fusion features. The basic fusion features and the encoder intermediate features (features other than the deepest features in the coding features) received by the skip connection are subjected to gating context aggregation processing to obtain gating enhancement features. The gating enhancement features are processed by the decoder to obtain dehazing enhancement features.
[0083] In this embodiment, the lightweight U-Net structure is GCA-Net. A context aggregation module and a gated fusion subnetwork are set between the encoder and decoder of the GCA-Net. A high- and low-frequency feature hierarchical processing module is embedded between the output of the encoder and the input of the context aggregation module. The decoder obtains data from the image tensor... The extracted encoded features are processed sequentially through a high- and low-frequency feature layering module and a context aggregation module before being input into the gated fusion sub-network. Intermediate features from the encoder are directly input into the gated fusion sub-network, and the output of the gated fusion sub-network is processed by the decoder before entering the detection model.
[0084] In this embodiment, the encoder consists of three cascaded 3×3 convolutional layers: the first two convolutional layers have a stride of 1 to maintain spatial resolution; the third convolutional layer has a stride of 2 to downsample the feature map to half its original size. The encoder processes the image tensor... After processing, the output encoded features include features at three different levels: shallow features output from the first convolutional layer. The mid-level features output by the second convolutional layer and the deepest features output by the third convolutional layer . After processing by the high- and low-frequency feature hierarchical processing module, basic fusion features are obtained; these basic fusion features are then input into the gated fusion subnetwork after passing through the context aggregation module. and Direct input to the gated fusion subnetwork.
[0085] In this embodiment, the context aggregation module consists of seven cascaded smoothed expanded residual blocks, used to aggregate global context information, expand the receptive field, and avoid mesh artifacts. The expansion rates of the seven smoothed expanded residual blocks are 2, 2, 2, 4, 4, 4, and 1, respectively. The gated fusion subnetwork is used to adaptively weight and fuse features from different levels of the encoder, outputting gated enhanced features. These gated enhanced features are then processed by the decoder to obtain dehazing enhanced features.
[0086] refer to Figure 2 The high- and low-frequency feature layering processing module includes a decoupling unit, a low-frequency enhancement unit, a high-frequency enhancement unit, and a fusion unit. This module decouples and differentially enhances the deepest features in the encoded features to obtain the basic fused features.
[0087] In this embodiment, the high- and low-frequency feature layering processing module processes the deepest layer features. The processing procedure is as follows: the decoupling unit will... Decoupling into low-frequency cloud and fog components With high-frequency target components ; for the low-frequency cloud and fog components A self-attention mechanism with pooling operation is used for suppression enhancement to obtain enhanced low-frequency cloud and fog components. For the high-frequency target components Detail-preserving enhancement is performed to obtain enhanced high-frequency target components. The enhanced low-frequency cloud and fog component With enhanced high-frequency target components The fusion process was carried out to obtain preliminary fusion characteristics. The preliminary fusion features After processing with a 1×1 convolutional layer and By adding elements one by one, we obtain the basic fusion characteristics;
[0088] The decoupling unit will Decoupling into low-frequency cloud and fog components With high-frequency target components The specific process is as follows:
[0089] right Batch normalization is performed to obtain the first normalized feature; the channel dimension of the first normalized feature is adjusted by passing it through a 3×3 convolutional layer followed by a 1×1 convolutional layer to obtain the adjusted feature. ;
[0090] For the adjusted features Two-dimensional mean pooling was performed with a pooling window size of 2×2 and a step size of 2 to obtain the low-frequency cloud and fog components. :
[0091] For the low-frequency cloud and fog component Bilinear upsampling was performed to obtain the upsampled low-frequency cloud and fog components. Upsampled low-frequency cloud and fog components With adjusted features The spatial resolution is the same;
[0092] Use the adjusted features Subtract the upsampled low-frequency cloud component High-frequency target components are obtained. :
[0093]
[0094] refer to Figure 3The low-frequency enhancement unit enhances the low-frequency cloud and fog components. The specific process of suppression and enhancement is as follows:
[0095] Through three independent 1×1 convolutional layers Mapped to the first query respectively First key and the first value The number of channels remains at 64, and the spatial resolution is 160×160.
[0096] For the first key and the first value Perform two-dimensional adaptive average pooling with a pooling window size of 2×2, downsampling the spatial resolution to 80×80 to obtain the pooled keys. Sum The purpose of this pooling operation is to expand the receptive field, enabling the model to focus on the global distribution pattern of clouds and fog, while suppressing local noise interference.
[0097] via key ,value and Calculate self-attention weights Information is then aggregated to obtain the aggregated low-frequency features. :
[0098]
[0099]
[0100] In the formula, For normalization function, for The transpose of .
[0101] Low-frequency features after aggregation Upsampling was performed to obtain the low-frequency cloud and fog components. Enhanced low-frequency cloud and fog components with the same spatial resolution .
[0102] refer to Figure 4 The high-frequency enhancement unit enhances the high-frequency target components. The specific process of performing detail preservation enhancement is as follows:
[0103] Through three independent 1×1 convolutional layers Mapped to the second query respectively Second key Second value The number of channels remains at 64, and the spatial resolution is 320×320.
[0104] Keep the second key Second value The original resolution is 320×320, obtained through the second query. Second key Second value Calculate self-attention weights Information is then aggregated to obtain enhanced high-frequency target components. :
[0105]
[0106]
[0107] In the formula, for The transpose of .
[0108] The fusion unit will enhance the low-frequency cloud and fog components. With enhanced high-frequency target components The specific process of integration is as follows:
[0109] To enhance low-frequency cloud and fog components Upsampling is performed to obtain the upsampled enhanced low-frequency cloud and fog components. ;
[0110] Will enhance high-frequency target components Compared with the upsampled enhanced low-frequency cloud and fog components By adding the residuals, preliminary fusion characteristics are obtained. :
[0111]
[0112] In this embodiment, the detection model is built based on the YOLOv8 network and specifically includes: a backbone network, a neck network, and a detection head.
[0113] The backbone network is used to extract multi-scale features from the dehazing enhancement features, resulting in multi-level features. In this embodiment, the backbone network outputs features at three different scales: shallow features... Mid-layer characteristics Deepest features .
[0114] The neck network is used for feature fusion and enhancement of multi-level feature maps, outputting multi-scale fused features; the neck network includes a multi-scale time-frequency feature fusion and enhancement module, as referenced. Figure 5 The multi-scale time-frequency feature fusion enhancement module includes a multi-scale convolutional unit, a dual attention unit, and a frequency domain enhancement unit.
[0115] The multi-scale convolutional unit is used to perform multi-scale convolution processing and channel fusion on the deepest features in a multi-level feature set to obtain deep fused features. In this embodiment, the multi-scale convolutional unit processes the deepest features... The specific processing procedure is as follows:
[0116] right Batch normalization is performed to obtain the second normalized feature;
[0117] Feature extraction of the second normalized feature is performed by concatenated 1×1 and 5×5 convolutions to obtain the concatenated convolution features. , where 1×1 convolution is used for channel dimensionality reduction and computational overhead, and 5×5 convolution is used to expand the receptive field;
[0118] The concatenated convolution features were processed using parallel 3×3, 5×5, and 7×7 depthwise separable convolution kernels. Perform multi-scale spatial feature extraction;
[0119] The outputs of the three depthwise separable convolutional kernels are concatenated along the channel dimension to obtain the concatenated features. ; splicing features Perform 1×1 convolutional channel fusion to obtain deep fused features. .
[0120] The dual attention unit is used to apply both channel attention and pixel attention to the deep fusion features, resulting in weighted enhanced features. In this embodiment, the dual attention unit applies dual weighting to the deep fusion features. The specific processing procedure is as follows:
[0121] First, let's look at the characteristics of deep fusion. Channel attention weighting is performed: the spatial information of each channel is compressed into a scalar using global average pooling, resulting in a channel description vector; this vector is then passed through two fully connected layers, the first of which compresses the number of channels back to its original value. The second fully connected layer restores the number of channels to their original value and uses Sigmoid activation to obtain channel attention weights. These channel attention weights are then compared with... Multiply each channel sequentially to obtain the channel-weighted feature;
[0122] Then, pixel attention weighting is applied to the channel-weighted features: the number of channels in the channel-weighted features is reduced to 1 by a 1×1 convolution to obtain a single-channel spatial attention map; the single-channel spatial attention map is activated by a Sigmoid function to obtain pixel attention weights; the pixel attention weights are multiplied element-wise with the channel-weighted features to obtain weighted enhancement features.
[0123] Combine the weighted enhanced features with the deepest features Element-by-element addition yields the residual enhancement features;
[0124] In this embodiment, by concatenating channel attention and pixel attention, the network can simultaneously extract globally shared information and location-related local information, effectively addressing the problem of uneven fog distribution in images.
[0125] The frequency domain enhancement unit is used to perform frequency domain transformation on the residual enhancement features and convolution enhancement on their amplitude components to obtain time-frequency fusion features. (Reference) Figure 6 In this embodiment, the frequency domain enhancement unit processes the residual enhancement features as follows:
[0126] First, the residual enhancement features are batch normalized to obtain the third normalized features. Then, the third normalized features are subjected to a two-dimensional fast Fourier transform to convert them from the time domain to the frequency domain, resulting in amplitude and phase components. The amplitude component contains the gray-level distribution energy information of the image, and the phase component contains the geometric structure information of the image.
[0127] Then, the amplitude component is subjected to two 1×1 convolutions in sequence to obtain the enhanced amplitude component. The first 1×1 convolution expands the number of channels to twice the original number, and the second 1×1 convolution restores the number of channels to the original number. A Leaky ReLU activation function is used for non-linear transformation between the two convolutions (the negative slope coefficient is set to 0.1) to improve the gradient vanishing problem during training.
[0128] Finally, the enhanced amplitude component and the original phase component are subjected to inverse Fourier transform to obtain the time-frequency fusion features.
[0129] In this embodiment, the frequency domain enhancement unit expands the feature dimension and enriches the information source through frequency domain transformation. It uses frequency domain features to perform targeted enhancement of high-frequency edge information, enabling the network to better distinguish foreground and background information in blurred scenes and effectively improve detection performance.
[0130] The time-frequency fusion features are combined with features at other scales in the multi-level features ( and Multi-scale feature fusion is performed using a top-down FPN (Feature Pyramid Network) structure and a bottom-up PAN (Path Aggregation Network) structure to obtain multi-scale fused features. , , .
[0131] The detection head receives multi-scale fusion features output by the neck network and outputs the category labels and bounding box coordinates of all detected targets.
[0132] In this embodiment, the target detection neural network can be trained and evaluated using the publicly available Seaships7000 or ShipRSImageNet datasets. Before training, the original images in the publicly available datasets are first processed with fog to generate a fogged dataset. The fogging process uses an atmospheric scattering model.
[0133]
[0134]
[0135] In the formula, where, The spatial coordinates of the image, For the generated foggy image, For the original clear image, For global atmospheric light, this embodiment takes =0.5, Transmittance, The atmospheric scattering coefficient is preferably between 0.08 and 0.12. For scene depth.
[0136] The fogged dataset was randomly divided into training, testing, and validation sets according to a preset ratio. The object detection neural network was trained using the training set, employing the Adam optimizer with an initial learning rate of 0.001, a batch size of 16, and 300 training epochs. The training objective function was the total loss function. :
[0137]
[0138] In the formula, For residual loss, for loss, To pinpoint the loss, for The weighting coefficients, for The weighting coefficients, for The weighting coefficients. In this embodiment... , , . Using binary cross-entropy loss, Using full intersection-union loss, The calculation formula is as follows:
[0139]
[0140] in The features are those after dehazing, i.e., the output of the gated fusion subnetwork in the dehazing model. Features for fogged images; To label image features, i.e., clear image features; It is an L2 norm.
[0141] The training process aims to continuously update the network weights using the gradient descent algorithm to minimize the total loss function. This enables the model to accurately predict both the target's category and location simultaneously.
[0142] To better illustrate the beneficial effects of this invention, the performance of the YOLOv8 model, the GCANet-YOLOv8 model, and the object detection neural network (HF-TFNet) trained in the above embodiments was compared on the Seaships7000 fogged dataset and the ShipRSImageNet fogged dataset, respectively. Evaluation metrics included: mean accuracy (mAP50) at an intersection-over-union (IoU) threshold of 0.5, mean accuracy (mAP75) at an IoU threshold of 0.75, and mean accuracy (mAP50:95) at IoU thresholds ranging from 0.5 to 0.95. The GCANet-YOLOv8 model refers to a concatenated network using GCANet as the defogging model and YOLOv8 as the detection model. The Seaships7000 fogged dataset is obtained by fogging the Seaships7000 dataset, and the ShipRSImageNet fogged dataset is obtained by fogging the ShipRSImageNet dataset. The ShipRSImageNet fogged dataset divides the target types into four task layers: Level 0 to Level 3. Level 0 distinguishes whether it is a specific target; Level 1 divides the target into three categories; Level 2 divides the target into 24 types; and Level 3 divides the target into 50 types.
[0143] Table 1 shows the performance comparison of each method on the Seaships7000 fog dataset.
[0144] Table 1 Comparative Experiment Results of Seaships7000 Fog Dataset
[0145]
[0146] As shown in Table 1, the method of this invention achieves optimal results in all three metrics: mAP50, mAP75, and mAP50:95. Compared with the baseline model YOLOv8, all metrics show significant improvements; compared with GCANet-YOLOv8, it also achieves a significant performance gain. The experimental results verify the excellent detection capability of the method of this invention under dense fog conditions.
[0147] Table 2 shows the performance comparison of each method on the ShipRS ImageNet fog dataset.
[0148] Table 2 Comparison Experiment Results of ShipRSlmageNet Fog Dataset
[0149]
[0150] As shown in Table 2, the method of this invention achieves optimal performance across all four task layers, outperforming all comparative methods. This fully demonstrates the superiority of the method of this invention in fine-grained target detection tasks.
[0151] To verify the effectiveness of each key module in the method provided by this invention, an ablation experiment was conducted on the Seaships7000 fog dataset. The results are shown in Table 3. In Table 3, "×" indicates that the module was used and "√" indicates that it was not used.
[0152] Table 3. Results of HF-TFNet ablation experiments
[0153]
[0154] As can be seen from Table 3, adding either the HFFLP module or the MTFFE module alone can improve the detection accuracy. Adding both modules at the same time further improves the accuracy, indicating that the two have a synergistic enhancement effect.
[0155] Figure 7 A visualization comparison of the detection results of the YOLOv8 model, the GCANet-YOLOv8 model, and HF-TFNet on the Seaships7000 foggy dataset is provided. The figure shows that the method of this invention can effectively suppress fog interference, accurately detect targets obscured by fog, and has higher localization accuracy for small-scale targets.
[0156] Figure 8 A visualization comparison of the detection results of the YOLOv8 model, the GCANet-YOLOv8 model, and HF-TFNet on the ShipRSImageNet fogged dataset is provided. Experimental results demonstrate that the method of this invention has significant advantages in fine-grained object detection tasks.
[0157] Figure 9Thermal comparison images of the Seaships7000 fog dataset. Figure 9 As can be seen from the data, the activation region of the method of the present invention is more focused on the target body, indicating that the model can effectively suppress cloud and fog interference and concentrate attention on the target area.
[0158] The specific embodiments of the present invention are provided to enable those skilled in the art to understand or implement the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention.
[0159] It should be understood that the present invention is not limited to the content already described above, and various modifications and changes can be made without departing from its scope. The scope of the present invention is limited only by the appended claims.
Claims
1. A foggy target detection method based on a multi-scale time-frequency information enhancement mechanism, characterized in that: Acquire an image of the target object to be detected in foggy weather; preprocess the foggy target image to obtain an image tensor; The image tensor is input into a pre-trained target detection neural network, and the target detection neural network is used to perform target detection on the foggy target image. The target detection neural network includes a dehazing model and a detection model; The defogging model performs the following operations: The image tensor is encoded to obtain encoded features; The deepest feature in the encoding features is decoupled into a low-frequency cloud and fog component and a high-frequency target component; The low-frequency cloud and fog component is suppressed and enhanced using a self-attention mechanism with pooling operation to obtain an enhanced low-frequency cloud and fog component. The high-frequency target component is enhanced with detail preservation to obtain the enhanced high-frequency target component; The enhanced low-frequency cloud and fog component is fused with the enhanced high-frequency target component to obtain preliminary fusion features; The preliminary fusion features are fused with the deepest features in the encoded features to obtain the basic fusion features; The basic fusion features and the encoder intermediate features received by the skip connection are subjected to gating context aggregation processing to obtain gating enhancement features; The gated enhancement features are decoded to obtain dehazing enhancement features; The detection model performs the following operations: Multi-scale feature extraction is performed on the dehazing enhancement features to obtain multi-level features; The deepest feature in the multi-level feature hierarchy is sequentially subjected to multi-scale convolution and channel fusion to obtain deep fused features. The deep fusion features are weighted by both channel attention and pixel attention to obtain weighted enhanced features; The weighted enhancement features are added element-wise to the deepest feature in the multi-level features to obtain the residual enhancement features; The residual enhancement features are subjected to frequency domain transformation and their amplitude components are enhanced by convolution to obtain time-frequency fusion features; The time-frequency fusion feature is fused with features of other scales in the multi-level feature to obtain the multi-scale fusion feature; The multi-scale fused features are input into the detection head, which outputs the category labels and bounding box coordinates of all detected targets.
2. The foggy target detection method based on a multi-scale time-frequency information enhancement mechanism according to claim 1, characterized in that, The process of decoupling the deepest features in the encoded features into low-frequency cloud and fog components and high-frequency target components includes: The deepest feature in the encoded features is subjected to mean pooling to obtain the low-frequency cloud and fog component. The low-frequency cloud and fog component is upsampled to obtain the upsampled low-frequency cloud and fog component. The high-frequency target component is obtained by subtracting the upsampled low-frequency cloud component from the deepest feature in the encoded features.
3. The foggy target detection method based on a multi-scale time-frequency information enhancement mechanism according to claim 1, characterized in that, The process of suppressing and enhancing the low-frequency cloud and fog components includes: Generate a first query, a first key, and a first value based on the low-frequency cloud and fog components; After pooling downsampling the first key and the first value, the self-attention weights are calculated and information is aggregated to obtain the aggregated low-frequency features; The aggregated low-frequency features are upsampled to obtain an enhanced low-frequency cloud component, which has the same spatial resolution as the deepest feature in the encoded features.
4. The foggy target detection method based on a multi-scale time-frequency information enhancement mechanism according to claim 1, characterized in that, The process of performing detail-preserving enhancement on the high-frequency target components includes: A second query, a second key, and a second value are generated based on the high-frequency target components; The self-attention weights are calculated based on the second query, the second key, and the second value, and information is aggregated to obtain the enhanced high-frequency target components.
5. The foggy target detection method based on a multi-scale time-frequency information enhancement mechanism according to claim 1, characterized in that, The process of fusing the enhanced low-frequency cloud and fog component with the enhanced high-frequency target component includes: The enhanced high-frequency target component is added to the enhanced low-frequency cloud and fog component in residual form to obtain preliminary fusion features.
6. The foggy target detection method based on a multi-scale time-frequency information enhancement mechanism according to claim 1, characterized in that, The process of performing multi-scale convolution on the deepest features in a multi-level feature set includes: First, feature extraction is performed using concatenated 1×1 convolutions and 5×5 convolutions to obtain concatenated convolution features; Then, multi-scale spatial feature extraction is performed on the cascaded convolution features by using parallel 3×3 depth separable convolution kernels, 5×5 depth separable convolution kernels, and 7×7 depth separable convolution kernels.
7. The fog target detection method based on a multi-scale time-frequency information enhancement mechanism according to claim 1, characterized in that, The process of applying dual weighting of channel attention and pixel attention to the deep fusion features includes: First, channel attention weighting is applied to the deep fusion features to extract global channel information and obtain channel-weighted features; Then, pixel attention weighting is applied to the channel weighted features to extract local pixel location information, resulting in weighted enhanced features.
8. The fog target detection method based on a multi-scale time-frequency information enhancement mechanism according to claim 1, characterized in that, The process of performing frequency domain transformation on the residual enhancement features and convolution enhancement on their amplitude components includes: A two-dimensional fast Fourier transform is performed on the residual enhancement features to obtain the amplitude component and the phase component; The amplitude component is subjected to two 1×1 convolutions in sequence, and a nonlinear transformation is performed between the two convolutions using the Leaky ReLU activation function to obtain the enhanced amplitude component. The phase component and the enhanced amplitude component are subjected to inverse Fourier transform to obtain the time-frequency fusion characteristics.
9. A foggy target detection system based on a multi-scale time-frequency information enhancement mechanism, used to implement the foggy target detection method according to any one of claims 1 to 8, characterized in that, include: The image acquisition module is used to acquire images of the target to be detected in foggy weather. The preprocessing module is used to preprocess the target image to obtain an image tensor; The object detection module includes a pre-trained object detection neural network, which is used to perform object detection on the image tensor and output the category labels and bounding box coordinates of all detected objects.