Multi-scale target detection method based on feature enhancement and pixel inversion dehazing
By constructing a feature enhancement module and a multi-scale target detection model in underwater target detection, the problem of low detection accuracy of small targets in complex backgrounds is solved, and higher detection accuracy and network adaptability are achieved.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- AUTOLINK INFORMATION TECHNOLOGY CO LTD
- Filing Date
- 2024-12-23
- Publication Date
- 2026-06-18
AI Technical Summary
Existing underwater target video detection methods have low detection accuracy in complex backgrounds and when the target is too small.
A feature enhancement module is constructed, which combines the SE attention mechanism module and the multi-scale object detection model. By embedding the SE attention mechanism module and the enhanced feature extraction module in the specified convolutional layer, the network's adaptability to changes in the size of the target object is enhanced. VGG16 is used as the base network, and the feature enhancement module is fused and multiple prediction modules are set to extract feature maps at different scales.
It improves the detection accuracy of small targets, enhances the network's adaptability to scale changes, reduces the loss of feature information, and improves the accuracy of underwater target detection.
Smart Images

Figure CN2024141304_18062026_PF_FP_ABST
Abstract
Description
A multi-scale target detection method based on feature enhancement and pixel inversion dehazing Technical Field
[0001] This application relates to the field of image recognition technology, specifically a multi-scale target detection method based on feature enhancement and pixel inversion dehazing. Background Technology
[0002] When developing wading-ready intelligent vehicles, integrating underwater target detection capabilities with onboard image acquisition equipment is an essential research direction. Our company is currently using patent application number CN202411020787.8, which discloses a lightweight underwater target video detection method. This method uses an anti-residual dilated convolution module to build a lightweight feature extraction backbone network. After deep convolution learning features, classification, and localization regression, an SSH detection head is used to obtain multi-channel detection results for fusion detection, achieving a good balance between underwater target detection speed and accuracy. However, in practical applications, it has been found that this existing technical solution has low detection accuracy in complex backgrounds and when the target is too small. Technical issues
[0003] To address the issue of low detection accuracy in existing underwater target video detection methods under complex backgrounds when the target is too small, this application provides a multi-scale target detection method based on feature enhancement and pixel inversion dehazing. This method enhances the network's adaptability to changes in target size and effectively improves the detection accuracy of small targets. Technical solutions
[0004] The technical solution of this application is as follows: a multi-scale object detection method based on feature enhancement and pixel inversion dehazing, characterized by the following steps: S1: Constructing a feature enhancement module; embedding the SE attention mechanism module into a specified convolutional layer to construct the feature enhancement module; S2: Constructing a multi-scale object detection model; the multi-scale object detection model includes: a preprocessing module, a backbone network, and an output layer connected in sequence; the backbone network uses VGG16 as the base network and incorporates the feature enhancement module SFEM, which includes: convolutional layers 1 to 5, a first fully connected FC layer, a second fully connected FC layer, and convolutional layers 6 to 9 connected in sequence; a prediction module of different sizes is set after convolutional layer 4, the second fully connected FC layer, and convolutional layers 6 to 9; a feature enhancement module is embedded in convolutional layer 4, the second fully connected FC layer, and convolutional layer 6; the output of convolutional layer 4 is enhanced by the feature enhancement module. After processing, the data is sent to the prediction module corresponding to this layer; the outputs of convolutional layer 4 and the second fully connected FC layer are respectively processed by feature enhancement, then convolved, and then processed by the feature enhancement module. The output result is denoted as: secondary enhanced feature; the secondary enhanced feature is sent to the prediction module corresponding to the second fully connected FC layer; the output of convolutional layer 6 is processed by feature enhancement, then convolved with the secondary enhanced feature, then processed by the feature enhancement module, and then sent to the prediction module corresponding to convolutional layer 6; all feature maps output by all prediction modules are superimposed to obtain the final output result; S3: construct training dataset and validation dataset based on historical data, and train the multi-scale object detection model based on the training dataset to obtain a trained multi-scale object detection model; S4: recognize the image to be recognized based on the trained multi-scale object detection model.
[0005] Its further feature is that the feature enhancement module includes: a SE attention mechanism module, an enhanced feature extraction module, and an add operation connected in sequence; the SE attention mechanism module is added before the convolution operation of a specified channel of the convolutional network layer to be processed, and the SE attention mechanism module extracts attention weights from the output feature map of the specified channel; the attention weights of the convolutional layer channel extracted by the SE attention mechanism module are multiplied by the output feature map of that channel through a scale operation, and then fed into the enhanced feature extraction module for enhanced feature extraction; finally, the output feature map of the enhanced feature extraction module is concatenated with the input feature map of the next layer channel using the add method, and the channel size of the final output feature map is adjusted to the size of the channel of the next layer. The enhanced feature extraction module includes: a convolution operation with a 1*1 kernel, a convolution operation with a 3*3 kernel and a stride of 2, and a convolution operation with a 1*1 kernel, set sequentially. The prediction module includes a detector and a classifier. The preprocessing module includes: an image inversion operation, dark channel calculation, atmospheric light intensity estimation, transmittance estimation, transmission optimization, and image inversion operation, set sequentially. The number of channels in convolutional layers 1 to 5 are set to 64, 128, 256, 512, and 512, respectively. The number of channels in convolutional layers 6 to 9 are set to 512, 256, 256, and 256, respectively. The scales corresponding to the prediction modules are 38*38, 19*19, 10*10, 5*5, 3*3, and 1*1, respectively. Beneficial effects
[0006] This application provides a multi-scale target detection method based on feature enhancement and pixel inversion dehazing. It introduces an SE attention mechanism combined with an enhanced feature extraction module into a designated convolutional layer to construct a feature enhancement module. This solves the channel attention problem during multi-scale feature fusion, reduces the loss of contextual information in feature maps in deep networks, and effectively improves the accuracy of underwater biometric identification at different scales. The backbone network of the multi-scale target detection model is constructed based on VGG16 as the base network. The feature enhancement module is embedded in convolutional layer 4, the second fully connected FC layer, and convolutional layer 6, respectively. Simultaneously, the convolutional layers 4, FC, and 6 of the backbone network are stacked and connected after multi-scale enhancement by the feature enhancement module for feature extraction. This achieves feature fusion operations at different convolutional layers, allowing more shallow detail information to be transmitted to deeper layers, thus extracting richer information. This application employs a feature fusion strategy and integrates the feature enhancement module to fully extract features at different levels and scales, reducing information loss during feature propagation and improving network performance. This method sets up six prediction modules to extract feature maps at six different scales to adapt to the detection needs of targets of different sizes. This strategy greatly enhances the network's ability to adapt to scale changes. Attached Figure Description
[0007] Figure 1 is a schematic diagram of the SFEM module; Figure 2 is a schematic diagram of the multi-scale target detection model; Figure 3 is a visualization of the output feature maps of SFEM-SSD and SSD; Figure 4 is a graph of the loss function of the SSD target detection algorithm; Figure 5 is a graph of the loss function of the SFEM-SSD network algorithm; Figure 6 is a comparison of the training loss of SSD and the training loss of the multi-scale feature fusion target detection network; Figure 7 is a comparison of the validation loss of SSD and the validation loss of the multi-scale feature fusion target detection network; Figure 8 is a comparison of the underwater fish detection results of different target detection networks. Embodiments of the present invention
[0008] This application includes a multi-scale target detection method based on feature enhancement and pixel inversion dehazing, comprising the following steps.
[0009] S1: Construct the feature enhancement module.
[0010] Considering that small targets in the image to be identified are small in size, have a small pixel ratio, and are poorly resistant to interference, multiple convolutions and pooling processes will result in the loss of a large amount of feature information, leading to a decrease in the accuracy of small target detection. To address this issue, this application embeds an SE attention mechanism module into the network to improve the network's resistance to interference and the accuracy of small target detection. This module is named the Feature-Enhancement module (SE-Feature-Enhancement module, hereinafter referred to as the SFEM module). The network structure diagram of the Feature-Enhancement module is shown in Figure 1.
[0011] The feature enhancement module includes: the SE attention mechanism module, the enhanced feature extraction module, and the add operation, which are connected in sequence; the enhanced feature extraction module includes: a convolution operation with a 1*1 kernel, a convolution operation with a 3*3 kernel and a stride of 2, and a convolution operation with a 1*1 kernel, which are set in sequence.
[0012] The SE attention mechanism (Squeeze-and-Excitation mechanism) consists of two main steps: Squeeze and Excitation. In the Squeeze step, the input feature map is compressed into a vector using global average pooling, capturing global statistics for each channel. Next, in the Excitation step, a fully connected layer and a sigmoid function are used to generate weights for each channel, and these weights are multiplied by the original input feature map to obtain a weighted feature map. In this way, the model can adaptively learn the importance of each channel and adjust accordingly.
[0013] The SFEM module is a top-down structure consisting of three parts. First, an SE attention mechanism module is added before the initial convolution operation of a specified channel to explicitly express the dependencies between channels and adaptively readjust the channel feature responses. The SE attention mechanism module assigns weights to each channel, allowing multiple channels to contribute to the result; these weights represent the influence of each channel on feature extraction. Larger weights result in smaller values for the feature maps of that channel, and thus a smaller impact on the final output. This means that during image feature extraction, some convolutional layers output feature maps with a greater impact on the final result, while others have a smaller impact. Therefore, by applying weights derived from the channels themselves to these feature maps, the given channel weights can be adaptively adjusted based on the features extracted by the convolutional layers, allowing feature maps with a greater impact on the final result to have a larger influence.
[0014] Secondly, after the initial convolution operation on the specified channels by the SE attention mechanism module, the feature map undergoes further enhancement in the feature extraction module. First, a 1x1 convolution is performed on the output feature map of the previous layer, adjusting its channel count. Then, a 3x3 convolution with a stride of 2 is performed, adjusting the width and height of the feature map and simplifying computation. This data is then fed into subsequent 1x1 and 3x3 convolutional layers for enhanced feature extraction. Finally, a 1x1 convolutional layer is used to adjust the channel count. Through downsampling via convolution operations, deeper features can have a larger receptive field, thus better capturing important features in the image. Furthermore, compared to the weighted sum operation of pooling layers, the pointwise operation of 1x1 convolution is more conducive to optimization. Therefore, the pooling layers following the first convolutional layer with a 1*1 kernel and the convolutional layer with a 3*3 kernel were eliminated and replaced with convolutional layers with a 1*1 kernel. This reduced the risk of losing effective information, improved information fusion, and enhanced feature extraction.
[0015] Finally, the `add` method is used to concatenate the feature maps output by the enhanced convolutional network layers with the input feature maps of the next layer's channels, and the channels are adjusted to the corresponding size of the next layer's channels. Compared to the `concat` method, the `add` method superimposes the extracted information multiple times, highlighting the proportion of correctly classified information, which is beneficial for the final target classification and achieves high activation for correct classification.
[0016] Specifically, after adding a feature enhancement module to a specified convolutional layer, the following operations are performed on every two adjacent channels of the convolutional network layer to be processed: For example, if a convolutional layer includes 3 channels, then channels 1 and 2, and channels 2 and 3 are adjacent channels. In Figure 1, FM1 (H*W*C) and FM2 (H'*W'*C') are two adjacent channels. The SE attention mechanism module is set before the first convolution of the FM1 channel. After extracting the attention weights for the FM1 channel, the attention weights extracted by the SE attention mechanism module are multiplied onto the output feature map of the FM1 channel through the scale operation, and then sent to the enhanced feature extraction module for enhanced feature extraction. Finally, the output feature map of the enhanced feature extraction module is concatenated with the input feature map of the FM2 channel using the add method, and the channels of the concatenated feature map are adjusted to the corresponding size (H'*W'*C') of the next feature extraction convolutional network layer to obtain the feature map New_FM_2, which is then sent to the FM2 channel.
[0017] S2: Construct a multi-scale target detection model.
[0018] As shown in Figure 2, the multi-scale target detection model includes a preprocessing module, a backbone network, and an output layer connected in sequence.
[0019] The network's input section receives raw image data and preprocesses it to improve input data quality and subsequent network learning efficiency. The preprocessing module is based on a double-reverse dehazing module and includes, in sequence: image reversal, dark channel calculation, atmospheric light intensity estimation, transmittance estimation, transmission optimization, and image reversal. First, the input image undergoes an image reversal operation. Then, dark channels are calculated (the darkest values of the RGB channels are calculated), atmospheric light intensity corresponding to the feature map is estimated, transmittance is estimated, and transmission optimization is performed. The resulting feature map is the enhanced image to be recognized. Finally, an image reversal operation is performed again to obtain the preprocessed strip image.
[0020] The backbone network uses VGG16 as the base network and incorporates the feature enhancement module SFEM, which includes: convolutional layers 1 (Conv1) to 5 (Conv5), the first fully connected FC layer (fc6), the second fully connected FC layer (fc7), and convolutional layers 6 (Conv6) to 9 (Conv9) connected in sequence.
[0021] The number of channels in convolutional layers 1 through 5 are set to 64, 128, 256, 512, and 512, respectively. During feature extraction, increasing the number of channels during convolution operations can obtain more image feature information.
[0022] The number of channels for convolutional layers 6 through 9 are set to 512, 256, 256, and 256 respectively. While increasing the number of channels enhances the expressive power of features, it can also significantly increase computational cost and the number of parameters. To address this issue, this application reduces the number of channels in subsequent layers of the network, i.e., reduces the depth of the feature maps. The main purpose of reducing the number of channels is to reduce the computational complexity and storage requirements of the network, thereby improving its efficiency.
[0023] A prediction module is placed after convolutional layer 4, the second fully connected (FC) layer, and convolutional layers 6 through 9. Each prediction module includes a detector and a classifier; the final output is obtained by superimposing all feature maps from all prediction modules.
[0024] The multi-scale target detection model has a total of 6 prediction modules, which extract feature maps of 6 different scales to adapt to the detection needs of targets of different sizes. The corresponding output feature map sizes are 38*38, 19*19, 10*10, 5*5, 3*3, and 1*1.
[0025] Feature enhancement modules are embedded in convolutional layer 4, the second fully connected (FC) layer, and convolutional layer 6, respectively. This method utilizes features from different levels to capture multi-scale information. Convolutional layer 4 serves as a shallow feature layer, better capturing small objects; the outputs of convolutional layer 6 and FC7 serve as deep features, helping to handle large objects and complex backgrounds. By fusing these features, the network can simultaneously focus on both small and large objects, avoiding missed or false detections.
[0026] The outputs of Convolutional Layer 4 and the second fully connected FC layer are processed by feature enhancement, followed by a convolution operation, and then further enhanced by the feature enhancement module. The output is denoted as the secondary enhanced feature. The secondary enhanced feature is then fed into the prediction module corresponding to the second fully connected FC layer. The output of Convolutional Layer 6 is processed by the feature enhancement module, followed by a convolution operation with the secondary enhanced feature, and then further enhanced by the feature enhancement module before being fed into the prediction module corresponding to Convolutional Layer 6. Convolutional Layer 4, the second fully connected FC layer, and Convolutional Layer 6 are each enhanced by the feature enhancement module at multiple scales. The enhanced features are then extracted again by SFEM. Through the multi-layer SFEM module, high-level semantic information and low-level detail information are fused, allowing more shallow detail information to be transmitted to the deeper layers.
[0027] In the backbone network, convolutional layer 4, the second fully connected (FC) layer, and convolutional layer 6 each include a 3x3 convolutional kernel. The third convolution result of convolutional layer 4, the convolution result of the FC7 layer, and the second convolution result of convolutional layer 6 are stacked and connected after multi-scale feature enhancement for feature extraction. This method captures multi-scale information using features at different levels. By passing multi-layer features from top to bottom, high-level semantic information is fused with low-level detailed information. Stacked connections preserve rich feature information, and subsequent convolutional layers further extract abstract high-level features. By fusing these features, the network can simultaneously focus on small and large objects, avoiding missed or false detections. Without needing to relearn redundant features, this feature fusion strategy can obtain a large amount of feature information with fewer convolutions, thereby maximizing the optimization of information features in the neural network and enhancing the expressive power of the network model.
[0028] The backbone network uses VGG16 as its base network to extract basic feature information from the input image. However, simply using the final output of VGG16 as the feature source cannot fully utilize the rich information in the image. Therefore, a key improvement is to stack and connect the multi-layer outputs of VGG16, while strengthening the feature extraction network to receive the output of the backbone feature extraction network as input. This not only preserves shallow, detailed features but also integrates deep, abstract features, greatly enriching the expressive power of the features. The design of this part aims to further enhance the expressiveness and diversity of the features. By reprocessing and fusing features at different levels, richer and more powerful feature representations can be generated, providing more accurate information for subsequent target detection. In particular, this study extracts feature maps at six different scales to adapt to the detection needs of targets of different sizes. This strategy greatly enhances the network's adaptability to scale changes. Finally, the results of the convolutional neural network feature extraction are used for localization and classification regression to achieve the purpose of underwater target detection.
[0029] In this application, a feature fusion strategy is added to the backbone feature extraction network and combined with a feature enhancement module, so that the neural network can focus on multi-dimensional feature information at different scales during the feature extraction stage, thereby enhancing the detection of underwater targets.
[0030] S3: Construct training and validation datasets based on historical data, and train the multi-scale object detection model based on the training dataset to obtain a trained multi-scale object detection model; S4: Recognize the image to be recognized based on the trained multi-scale object detection model.
[0031] To verify the performance of this method, simulation experiments were constructed. The experimental platform configuration was as follows: Windows 11 operating system, CUDA 10.1, CUDNN 7.6.5, Python 3.7 version, processor: AMD Ryzen 7 5800H with Radeon Graphics 3.20 GHz, graphics card: NVIDIA GeForce RTX 3060, using the TensorFlow framework.
[0032] To ensure fairness in the experiment, the same initial training parameters were set for each group of experiments. In this experiment, the object detection network used pre-trained weights to pre-train the backbone network, thereby shortening the training time. The batch size was set to 16, the epoch was set to 50, and the learning rate was 2×10⁻³. During the unfreezing phase, the parameters of the entire network were adjusted, with the batch size set to 8, the epoch set to 100, and the learning rate set to 2×10⁻⁴. This chapter uses the SGD method to adjust the loss function, with the weight decay coefficient set to 5×10⁻⁴. During testing, the confidence level was set to 0.5, and the NMS IOU size used for non-maximum suppression was set to 0.2.
[0033] The original SSD (Single Shot Multibox Detector) model is constructed based on VGG-16. The multi-scale target detection model in this method, SFEM-SSD, is then introduced.
[0034] Figure 3 shows the feature maps A, C, and E from the original SSD output on the left, and the feature maps B, D, and F from the SFEM-SSD model on the right. From the three images A, C, and E on the left, it can be seen that while the information acquired becomes more granular with increasing network depth, this fine-grained information is mainly limited to local regions, with limited ability to capture global information. This limitation is particularly evident when dealing with complex scenes or small object detection, as these situations require the network to comprehensively consider both global and local information to achieve more accurate object detection. However, from the three images B, D, and F on the right, it can be seen that by introducing feature enhancement and attention mechanisms, the object detection network combined with multi-scale feature fusion can effectively improve the quality of the feature maps. By fusing feature maps from different depths, the high semantic information of the deep network is preserved, while the high-resolution detail information of the shallow network is combined, thus achieving a finer-grained feature representation. In particular, after introducing the attention mechanism, the network can adaptively focus on more important feature regions, further enhancing the model's ability to detect objects.
[0035] Meanwhile, B, D, and F demonstrate the combination of multi-scale feature fusion and attention mechanisms, which makes the feature information of small targets more prominent, thereby improving the target detection network's ability to detect small targets. Small targets have more obvious contour and other detailed information in shallow feature maps. By fusing with deep features, not only can the expression of these detailed information be enhanced, but the overall semantic information can also be enriched, enabling the network to accurately identify and locate small targets even in complex scenes.
[0036] A loss function measures the difference or error between a model's predictions and the actual observations. The loss function is typically a non-negative real number; a smaller value indicates that the model's predictions are closer to the actual values, while a larger value indicates a greater difference between the predictions and the actual values.
[0037] Figure 4 shows the loss function graph of the SSD object detection algorithm, and Figure 5 shows the loss function graph of the SFEM-SSD network algorithm. The loss functions used in the figures are: smoothL1, train loss represents the training loss of smoothL1, val loss represents the test loss of smoothL1, smooth train loss represents the curve of the second-order fitting of the train loss, and smooth val loss represents the curve of the second-order fitting of the test loss val loss. In the figures, the horizontal axis (loss) represents the loss value, and the vertical axis (Epoch) represents the number of training epochs.
[0038] As shown in Figures 4 and 5, the addition of a feature fusion strategy enables SFEM-SSD to extract features at different scales. The model can simultaneously consider both the details and overall information of the input image, thus better capturing the diverse features of the target and causing the loss function to decrease more quickly. Furthermore, the SFEM module can enhance the original features, making the feature information more discriminative and representative. This helps the model better learn the key features of the target, resulting in a better fit.
[0039] Figure 6 compares the training loss of SSD with that of the multi-scale feature fusion network SFEM-SSD in this application, and Figure 7 compares the test loss of SSD with that of SFEM-SSD. It can be seen that the training and validation losses of the multi-scale feature fusion object detection network are lower than those of the SSD object detection algorithm, proving that the multi-scale feature fusion object detection network can detect targets better, and the improved model is more robust.
[0040] Figure 8 shows three real underwater images: A, B, and C. Image A depicts a school of fish with small underwater targets; image B shows a school of fish with varying sizes; and image C shows a school of fish against a complex underwater background. After testing with SSD (VGG), SSD (Mobilenetv2), YOLOv3, and the proposed SFEM-SSD target detection algorithm, A1, A2, and A3 all exhibited issues with missed detections and false positives for small targets. When large and small targets were mixed, B1, B2, and B3 showed severe issues with missed and false positives for small targets. In complex backgrounds, C1, C2, and C3 showed similar detection results to SSD (VGG) and YOLOv3, but exhibited the same missed detection issues as SSD (Mobilenetv2). Observing the SFEM-SSD detection results for A4, B4, and C4 reveals that SFEM-SSD has advantages in detecting small targets and targets in complex backgrounds.
[0041] The evaluation indicators are compared in Table 1 below.
[0042] Table 1. Comparison of Precision, Recall, and mAP between SFEM-SSD Object Detection Network and Different Object Detection Networks Experimental comparisons revealed that the multi-scale feature fusion deep learning object detection network model using the SFEM module and feature fusion strategy can more accurately identify targets in complex underwater scenes. Compared to the original SSD object detection network, this network exhibited multiple detection errors and missed detections in complex scenes. The proposed deep learning object detection network achieves accuracy improvements of 2.27%, 3.13%, and 4.99% compared to SSD (VGG), SSD (Mobilenetv2), and YOLOv3 object detection networks, respectively. Comparison of the accuracy, recall, and mAP of different algorithms shows that SFEM-SSD basically meets the requirements for underwater target detection.
[0043] Using the technical solution of this application, the VGG16 network structure is used as the basic network to achieve initial feature extraction. Recognizing that shallow-level features in neural networks are more sensitive to smaller objects, while deep-level features contain better semantic information, this method constructs an SFEM module based on the SE attention mechanism, combined with an enhanced feature extraction module, to improve feature extraction. This solves the channel attention problem in multi-scale feature fusion, reduces the loss of contextual information in feature maps in deep networks, and effectively improves the accuracy of underwater biometric identification at different scales. To fully preserve the detailed features in the original underwater images, this method employs a feature fusion strategy and integrates the SFEM module to fully extract features at different levels and scales, reducing information loss during feature propagation and improving network performance. Experimental results show that, compared with other underwater target detection methods, the multi-scale target detection model based on feature enhancement and pixel inversion dehazing algorithm proposed in this patent achieves better average detection accuracy on underwater target detection datasets.
Claims
1. A multi-scale target detection method based on feature enhancement and pixel inversion dehazing, characterized in that, Includes the following steps: S1: Construct a feature enhancement module; The feature enhancement module is constructed by embedding the SE attention mechanism module into a specified convolutional layer; S2: Construct a multi-scale object detection model; the multi-scale object detection model includes: a preprocessing module, a backbone network, and an output layer connected in sequence; the backbone network uses VGG16 as the base network and incorporates the feature enhancement module SFEM, which includes: convolutional layers 1-5, a first fully connected FC layer, a second fully connected FC layer, and convolutional layers 6-9 connected in sequence; a prediction module of different sizes is set after convolutional layer 4, the second fully connected FC layer, and convolutional layers 6-9; a feature enhancement module is embedded in convolutional layer 4, the second fully connected FC layer, and convolutional layer 6; the output of convolutional layer 4 is enhanced by the feature enhancement module. After processing, the outputs are fed into the prediction module corresponding to this layer. The outputs of convolutional layer 4 and the second fully connected FC layer are respectively processed by feature enhancement, then convolved, and then processed by the feature enhancement module. The output result is denoted as: secondary enhanced feature. The secondary enhanced feature is fed into the prediction module corresponding to the second fully connected FC layer. The output of convolutional layer 6 is processed by feature enhancement, then convolved with the secondary enhanced feature, and then processed by the feature enhancement module. Finally, it is fed into the prediction module corresponding to convolutional layer 6. All feature maps output by all prediction modules are superimposed to obtain the final output result. S3: Construct a training dataset and a validation dataset based on historical data, and train the multi-scale object detection model based on the training dataset to obtain a trained multi-scale object detection model; S4: Recognize the image to be identified based on the trained multi-scale object detection model.
2. The multi-scale target detection method based on feature enhancement and pixel inversion dehazing according to claim 1, characterized in that: The feature enhancement module includes: an SE attention mechanism module, an enhanced feature extraction module, and an add operation connected in sequence; The SE attention mechanism module is added before the convolution operation of a specified channel in the convolutional network layer to be processed. The SE attention mechanism module extracts attention weights from the output feature map of the specified channel. The attention weights of the convolutional layer channels extracted by the SE attention mechanism module are multiplied by the scale operation and then fed into the enhanced feature extraction module for enhanced feature extraction. Finally, the output feature map of the enhanced feature extraction module is connected to the input feature map of the next layer channel using the add method, and the channel size of the final output feature map is adjusted to the size of the channel of the next layer.
3. The multi-scale target detection method based on feature enhancement and pixel inversion dehazing according to claim 2, characterized in that: The enhanced feature extraction module includes: a convolution operation with a 1*1 kernel, a convolution operation with a 3*3 kernel and a stride of 2, and a convolution operation with a 1*1 kernel, set sequentially.
4. The multi-scale target detection method based on feature enhancement and pixel inversion dehazing according to claim 1, characterized in that: The prediction module includes a detector and a classifier.
5. The multi-scale target detection method based on feature enhancement and pixel inversion dehazing according to claim 1, characterized in that: The preprocessing module includes, in sequence: image inversion operation, dark channel calculation, atmospheric light intensity estimation operation, transmittance estimation, transmission optimization, and image inversion operation.
6. The multi-scale target detection method based on feature enhancement and pixel inversion dehazing according to claim 1, characterized in that: The number of channels in convolutional layers 1 to 5 are set to 64, 128, 256, 512, and 512, respectively.
7. The multi-scale target detection method based on feature enhancement and pixel inversion dehazing according to claim 1, characterized in that: The number of channels corresponding to convolutional layers 6 to 9 are set to 512, 256, 256, and 256 respectively.
8. The multi-scale target detection method based on feature enhancement and pixel inversion dehazing according to claim 1, characterized in that: The prediction modules correspond to the following scales: 38*38, 19*19, 10*10, 5*5, 3*3, and 1*1.