Mine small target detection method based on deformable convolution and residual structure
By introducing a multi-scale feature fusion and attention mechanism, a small target detection method for coal mines has been developed, which solves the problems of missed detection and false detection of small targets in underground coal mines, achieving higher detection accuracy and robustness, and is suitable for small target detection in complex environments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA UNIV OF MINING & TECH
- Filing Date
- 2025-04-27
- Publication Date
- 2026-06-19
AI Technical Summary
Existing small target detection algorithms suffer from problems such as missed detections, false detections, and low detection accuracy in underground coal mine environments. They are particularly difficult to effectively detect small targets such as miners' safety helmets in complex environments and poor lighting conditions.
A small target detection method for mines based on deformable convolution and residual structure is adopted. By introducing multi-scale feature fusion and attention mechanisms, including MLCA attention mechanism, small target feature extraction network module, feature fusion enhancement network module and target classification and bounding box prediction network, the detection accuracy is optimized by using ShapeIOU loss function.
It improves the accuracy and robustness of small target detection, enabling better detection of small targets in complex environments, especially miners' safety helmets, thereby enhancing detection accuracy and the model's adaptability.
Smart Images

Figure CN120451656B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of underground small target detection technology, and particularly relates to a method for detecting small targets in mines based on deformable convolution and residual structure. Background Technology
[0002] With the advent of the era of smart cities and smart industries, intelligent target detection algorithms have been gradually applied in many industries such as industrial inspection, behavior recognition, fire safety, traffic management, and biomedicine. As application scenarios continue to expand, the performance requirements for target detection models are constantly increasing, especially in the detection of small targets, where significant challenges remain.
[0003] Currently, object detection algorithms are mainly divided into two categories: single-stage and two-stage object detection algorithms. Single-stage object detection algorithms typically have advantages such as high speed, ease of deployment, and high real-time performance. Classic single-stage object detection algorithms mainly include the YOLO (You Only Look Once) series and the SSD (Single Shot Detector) series. Two-stage object detection algorithms have two stages: candidate box generation and fine-tuning of classification. They can achieve higher accuracy in object localization and classification and are better at handling multi-scale objects. However, two-stage methods often have disadvantages such as high complexity, difficult deployment, and unsuitability for high real-time applications. Classic two-stage object detection algorithms mainly include the R-CNN (Region with CNN feature) series and the Faster R-CNN algorithm. Given the current practical needs in industrial scenarios, which place higher demands on the real-time performance and rapid deployment of object detection, single-stage object detection algorithms are more advantageous than two-stage object detection algorithms. In recent years, with the continuous development of YOLO networks, many researchers have proposed target detection models based on YOLO networks for different scenarios. However, for target detection methods for personnel and equipment in coal mines, the algorithm still faces challenges such as insufficient false detection and false negative rates. In addition, due to the complex environment and poor lighting conditions for detecting many small targets, the models still face significant challenges in terms of real-time performance and high detection accuracy.
[0004] In summary, although significant progress has been made in the current single-stage small target detection model, further research is still needed on the following two aspects: (1) How to effectively optimize the feature extraction network and feature fusion network in the single-stage model, adaptively extract detailed features of smaller targets while effectively suppressing background noise interference, and achieving high detection accuracy; (2) How to improve the robustness of the small target detection model in complex environments, ensuring that the model can efficiently and accurately detect small targets, especially high-contrast small targets such as miners' safety helmets, in complex scenarios such as mines.
[0005] To address these issues, a small target detection structure, YOLOv8-DPMS, is proposed. By introducing multi-scale feature fusion and attention mechanisms, the detection accuracy and robustness of small targets are further improved. Summary of the Invention
[0006] The technical problem to be solved by this invention is to provide a small target detection method for coal mines based on deformable convolution and residual structure, which addresses the problems of missed detection, false detection and low detection accuracy of current small target detection algorithms for coal mine datasets.
[0007] To solve the above-mentioned technical problems, the present invention adopts the following technical solution:
[0008] A method for detecting small targets in mines based on deformable convolution and residual structures specifically includes:
[0009] (1) Image preprocessing module: preprocesses the acquired raw image containing the target;
[0010] (2) Small target feature extraction network module based on MLCA attention mechanism: This module focuses on channel, spatial and positional information during the small target feature extraction process, suppresses background noise interference, and enhances the expressive power of the extracted features. The preprocessed image is input into this module to suppress background noise interference in the small target image and extract features with high representational power. Then, features from different layers are input into the feature enhancement and fusion module.
[0011] (3) Small target feature fusion enhancement network module based on deformable convolution and residual structure: The different scale features output by the C2f module and the spatial pyramid pooling module in the feature extraction network are input into the feature fusion enhancement network, so that the shallow and deep features in the feature extraction network are fused and enhanced. While learning high-order semantic features, the rich detailed information represented by low-order features is retained, thereby effectively improving the accuracy of small target detection.
[0012] (4) Target classification and bounding box prediction network: Based on the feature fusion enhancement network module, the output of the four MLCA modules is enhanced, enabling the prediction network to focus on important features and form four target prediction branches. Different anchor boxes are used to predict targets of different sizes. Each anchor box corresponds to a target of a specific size. During training, in order to accelerate the convergence speed of the target detection model and improve the regression accuracy of the target prediction box, the ShapeIOU loss function is used to optimize the original loss function of the network to improve the detection accuracy.
[0013] Compared with the prior art, the present invention, employing the above technical solution, has the following technical effects:
[0014] This invention presents a method for detecting small targets in wells based on deformable convolution and hybrid local channel attention. By introducing multi-scale feature fusion and attention mechanisms, it further improves the detection accuracy and robustness of small targets. Specifically, it includes a backbone network based on the MLCA attention mechanism, which is constructed using two-dimensional convolution, C2f modules, spatial pyramid pooling modules, and the MLCA attention mechanism module. This integrates attention to channel, spatial, and positional information during the small target feature extraction process, suppressing background noise interference and enhancing the expressive power of the extracted features. A small target feature fusion enhancement module, DPC-Block, based on deformable convolution and residual structure, is proposed. This module fuses and enhances shallow and deep features in the feature extraction network, retaining rich details represented by low-order features while learning high-order semantic features, thus improving detection accuracy. Finally, a target classification and bounding box prediction module is included. To accelerate the convergence speed of the target detection model and improve the regression accuracy of the target prediction box, the ShapeIoU loss function is used, enabling YOLOv8-DPMS to better classify and locate targets from the fused feature map obtained by the feature fusion module. Attached Figure Description
[0015] Figure 1 This is a schematic diagram showing the positions of the standard convolution and deformable convolution of this invention;
[0016] Figure 2 This is a schematic diagram of the 3×3 deformable convolution of the present invention;
[0017] Figure 3 This invention compares the Bottleneck module with the DPC-Block module.
[0018] Figure 4 This is a schematic diagram of the Hybrid Local Channel Attention (MLCA) algorithm of this invention;
[0019] Figure 5 This is a diagram of the ShapeIOU loss structure of the present invention;
[0020] Figure 6 This is the YOLOv8-DPMS algorithm structure of the present invention;
[0021] Figure 7 This is a diagram of the internal structure of the Conv module of this invention;
[0022] Figure 8 This is a diagram of the internal structure of the C2f module of this invention;
[0023] Figure 9 This is a diagram of the internal structure of the MLCA module of this invention;
[0024] Figure 10 This is a diagram of the internal structure of the SPPF module of this invention;
[0025] Figure 11 This invention is a small target feature extraction network based on the MLCA attention mechanism;
[0026] Figure 12 This is a structural diagram of the CDPC module of the present invention;
[0027] Figure 13 This invention is based on a feature fusion enhancement network structure of DPC-Block;
[0028] Figure 14 This is a diagram of the internal structure of the improved detection head of this invention;
[0029] Figure 15 This is the detection layer network structure of the present invention;
[0030] Figure 16 This is a schematic diagram of part of the Coal-H dataset of this invention;
[0031] Figure 17 These are the ablation experiment results map@0.5 for various categories of this invention. Detailed Implementation
[0032] The technical solution of the present invention will be further described in detail below with reference to the accompanying drawings:
[0033] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of the present invention. The present invention will be described in detail below with reference to the accompanying drawings and preferred embodiments. The purpose and effects of the present invention will become clearer. It should be understood that the specific embodiments described herein are merely illustrative of the present invention and are not intended to limit the present invention.
[0034] This invention proposes a small target detection structure, YOLOv8-DPMS, which further improves the detection accuracy and robustness of small targets by introducing multi-scale feature fusion and attention mechanisms. YOLOv8-DPMS is a target detection model based on YOLOv8, combined with the DPMS (Dynamic Position-aware Multi-scale) mechanism, aiming to improve the accuracy and efficiency of target detection. The main contributions of this method are as follows:
[0035] (1) A backbone network based on the MLCA attention mechanism is constructed based on two-dimensional convolution, C2f module, spatial pyramid pooling module and MLCA attention mechanism module. This enables the integration of attention to channel, spatial and location information during the small target feature extraction process, suppressing the interference of background noise and enhancing the expressive power of the extracted features.
[0036] (2) A small target feature fusion enhancement module DPC-Block based on deformable convolution and residual structure is proposed, which enables the fusion enhancement of shallow and deep features in the feature extraction network, and retains the rich detailed information represented by low-order features while learning high-order semantic features, thereby improving detection accuracy.
[0037] (3) Target classification and bounding box prediction module. To accelerate the convergence speed of the target detection model and improve the regression accuracy of the target prediction box, the ShapeIoU loss function is used, enabling YOLOv8-DPMS to better perform target classification and localization from the fused feature map obtained by the feature fusion module.
[0038] The remaining parts of this invention are organized as follows: The principles of DPC-Block, MLCA attention mechanism module, and ShapeIOU are described in detail below. The proposed YOLOv8-DPMS small object detection algorithm is highlighted. Object detection experiments are conducted on our own dataset, and the experimental results are analyzed.
[0039] Theoretical Background
[0040] DPC-Block: The Bottleneck residual module in the feature fusion network is mainly used to solve problems such as vanishing gradients, loss of feature information, and high computational complexity in deep networks. In deep learning models, as the number of network layers increases, gradients may gradually vanish or explode during backward propagation, leading to training difficulties. The Bottleneck structure, through residual connections, allows gradients to propagate directly from later layers to earlier layers, effectively alleviating the vanishing gradient problem. This skip connection allows the network to learn residual mappings during training, rather than directly learning the complete input-to-output mapping, making the optimization process more efficient and improving training stability. The feature maps of the Bottleneck residual module are typically processed through a series of convolutions, normalization, and activation operations. Each Bottleneck residual module contains two 3x3 convolutional layers, which transform the input feature maps to extract higher-level feature representations. However, this can lead to decreased model accuracy for objects with irregular scale variations or large scale differences.
[0041] Deformable convolution is a technique that extends traditional convolution operations, aiming to enhance the performance of convolutional neural networks (CNNs) in tasks involving complex shapes and spatial deformations. Traditional convolution operations use fixed-size kernels when processing images, with the kernel performing the same operation at every location. However, objects or features in an image may have different shapes, scales, or locations, and a fixed convolution window may not effectively capture these deformable features. Deformable convolution adds 2D offsets to the regular grid sampling locations in standard convolution, allowing the sampling grid to deform freely. It dynamically adjusts the sampling positions of the convolution kernel by learning a set of offsets. Unlike the fixed sampling method of traditional convolution kernels, deformable convolution allows the sampling positions of the convolution operation to be shifted, such as... Figure 1 As shown. Figure 1 Schematic diagram of the positions of standard convolution and deformable convolution
[0042] Deformable convolution is performed through a separate network branch. It convolves the input feature map to generate a 2N-channel offset feature map. A bilinear interpolation backpropagation algorithm is used to optimize and update the pixel offsets in the feature map; these offsets are typically floating-point numbers. The input feature map is then convolved with the updated offsets to obtain the output feature map. This enhances the performance of convolutional neural networks (CNNs) in tasks involving complex shapes and spatial deformations. The specific process is as follows: Figure 2 As shown. Figure 2 This is a schematic diagram of a 3×3 deformable convolution.
[0043] The formula for the feature value output of deformable convolution kernel sampling is:
[0044]
[0045] Where x is the input feature map, w k p represents the weight at position k. k Δp represents the pre-offset, p is the center position of the sampling point on the feature map, and Δp is the pre-offset. k Δm is the offset relative to the center position p; when it is 0, it is a standard convolution kernel. k This represents the learnable weights, ranging from 0 to 1. For sampling points that do not require adjustment, the weights are set to 0, making the deformation of the convolution kernel more flexible.
[0046] Therefore, this invention proposes a feature fusion module, DPC-Block, using deformable convolution to enhance multi-scale feature fusion and capture complex geometric deformations. A comparison between the common Bottleneck module and our proposed DPC-Block module is as follows: Figure 3 As shown.
[0047] DPC-Block employs a multi-level feature extraction and fusion strategy to enhance the model's ability to perceive objects with complex geometric deformations. First, a set of 3×3 deformable convolutional modules is introduced. Through a bias learning mechanism, the model's adaptability to non-rigid deformations is enhanced, enabling it to more effectively model the relationships between objects with complex geometric structures, thereby improving its ability to capture deformable targets. Next, a set of 1×1 basic convolutional modules is added to adjust low-level channel information, reduce computational complexity, and maintain effective feature representation.
[0048] Furthermore, to enhance the fusion capability of multi-scale information, DPC-Block uses Concat to connect feature maps of different scales, fully integrating low-level local details with high-level semantic information. This enriches target details while reducing spatial information loss, improving the accuracy of target localization. This design not only effectively enhances the fusion capability of multi-scale features but also improves the model's accuracy in capturing targets with complex geometric deformations, thereby further improving the performance of target detection tasks.
[0049] MLCA attention mechanism (Mixed Local Channel Attention):
[0050] In practical image object detection tasks, the size and shape of the same object will vary depending on the camera's position, thus reducing the accuracy of object detection. Furthermore, occlusion problems exist between objects and between the detected object and unrelated objects in real-world scenes, with small objects facing even greater challenges. Introducing an attention mechanism into the network can significantly reduce the negative impact of these issues on detection results and further improve model performance. To effectively model features at different levels, many researchers have proposed various variations of attention mechanisms, including channel attention and spatial attention. However, most channel attention mechanisms only include channel feature information, ignoring spatial feature information, leading to poor model representation or object detection performance. Spatial attention modules are often complex and costly. Therefore, this invention uses a lightweight hybrid local channel attention mechanism, MLCA (Mixed Local Channel Attention), which enhances the detection capability of small objects by combining local spatial information and channel features. Figure 4 This is a schematic diagram of the Hybrid Local Channel Attention (MLCA) algorithm.
[0051] like Figure 4As shown, the working principle of the MLCA attention mechanism is as follows. First, the input image is processed through a convolutional layer to extract features, resulting in a feature map. In the figure, Conv1d represents a one-dimensional convolution operation, and the kernel size k depends on the channel dimension C. This setting means that in capturing local cross-channel interactions, the focus is only on the relationship between each channel and its k neighboring channels. The choice of k is expressed by formula (2):
[0052]
[0053] Where C represents the number of channels, k represents the kernel size, and γ and b are hyperparameters, both preset to 2. k must be an odd number; if k is an even number, it is adjusted in increments of 1.
[0054] For each feature map location, a local receptive field is set to define a local region. For each local region, its location features are calculated, including local center point and boundary features. The local center point feature represents the center point of the region, while the boundary features represent the relative position of the region to the overall image boundary. For each local region, its channel features are calculated to capture information from different channels. Two parallel fully connected layers are used to process the location features and channel features respectively. The correlation between the location features and channel features is learned through the fully connected layers, resulting in location attention weights and channel attention weights. The location attention weights and channel attention weights are multiplied to obtain the final hybrid local channel attention weights. Then, this weight is multiplied with the feature map to fuse features from different channels and locations. Finally, the fused feature map is input into subsequent network layers for further processing and classification tasks.
[0055] By using a hybrid local channel attention mechanism, the model can more accurately focus on the importance of different channels in an image and extract more discriminative features, thereby improving and enhancing the performance and effectiveness of small object detection tasks.
[0056] ShapeIOU: Most mainstream object detection algorithms currently use CIoU as the loss function. This loss function provides accurate target localization and considers target integrity, but it has certain drawbacks for small and irregular targets. The YOLOv8-DPMS algorithm introduces Shape-IoU to consider the shape and scale of the boundary, providing a more accurate measurement. The Shape-IoU loss structure diagram is shown below. Figure 5 As shown; Figure 5 This is a diagram of the ShapeIOU loss structure.
[0057] ShapeIoU focuses on the shape and size information of the bounding box itself, incorporating this information into the IoU loss function. The calculation formula is as follows:
[0058]
[0059]
[0060]
[0061]
[0062]
[0063]
[0064] Here, `scale` is a scaling factor, related to the scale of the targets in the dataset. `ww` and `hh` are the weight coefficients in the horizontal and vertical directions, respectively, and their values depend on the shape of the GT box. The regression loss for the corresponding bounding box is as follows:
[0065] L Shape-IoU =1-IoU+distance shape +0.5×Ω shape (9)
[0066] By taking into account the target's outline, ShapeIoU can effectively improve the detection accuracy of small targets and reduce detection errors caused by inaccurate bounding.
[0067] The proposed YOLOv8-DPMS algorithm: YOLOv8-DPMS algorithm framework
[0068] Figure 6 This invention presents the YOLOv8-DPMS algorithm structure. Addressing the issues of missed detections, false detections, and low detection accuracy in current small object detection algorithms for coal mine underground datasets, this invention focuses on feature extraction based on convolutional neural networks, attention mechanism embedding, and multi-scale feature fusion, proposing a new small object detection algorithm, YOLOv8-DPMS. This algorithm comprises four modules:
[0069] (1) Image preprocessing module. This module preprocesses the acquired original images containing the target by using the Mosaic data augmentation method. The Mosaic method uses four images and splices them together in a random scaling, cropping and arrangement manner to combine multiple images. This can improve the training speed of the model while expanding the dataset and reducing memory requirements.
[0070] (2) Small target feature extraction network module based on MLCA attention mechanism. Figure 6As shown, the network consists of 5 Conv modules, 4 C2f modules, 4 MLCA modules, and 1 Spatial Pyramid Pooling (SPPF) module. This allows for attention to channel, spatial, and location information during small target feature extraction, suppressing background noise interference and enhancing the expressive power of the extracted features. The preprocessed image is input into this module to suppress background noise interference in small target images and extract features with high representational power. Then, features from different layers are input into the feature enhancement and fusion module.
[0071] (3) Small target feature fusion enhancement network module based on deformable convolution and residual structure. This network consists of 2 upsampling modules, 6 CDPC modules, and 3 Conv modules. Features of different scales output from the layers of the C2f and SPPF modules in the feature extraction network are input into the feature fusion enhancement network, which enables the fusion and enhancement of shallow and deep features in the feature extraction network. While learning high-order semantic features, it retains the rich detailed information represented by low-order features, thereby effectively improving the accuracy of small target detection.
[0072] (4) Target Classification and Bounding Box Prediction Network. This network is based on the output of the four MLCA modules of the Feature Fusion Enhancement Network Module, which enables the prediction network to focus on important features and form four target prediction branches. Different anchor boxes are used to predict targets of different sizes. Each anchor box corresponds to a target of a specific size. During training, in order to accelerate the convergence speed of the target detection model and improve the regression accuracy of the target prediction boxes, the ShapeIOU loss function is used to optimize the original loss function of the network to improve the detection accuracy.
[0073] Feature extraction network based on MLCA attention mechanism:
[0074] Depend on Figure 6 As shown, the small target feature extraction network based on the MLCA attention mechanism consists of 5 Conv modules, 4 C2f modules, 4 MLCA modules, and 1 Spatial Pyramid Pooling (SPPF) module. The internal structures of the Conv, C2f, MLCA, and SPPF modules are as follows: Figure 7-10 As shown. The Conv module is a commonly used basic module, consisting of two-dimensional convolution (Conv2d), batch normalization (BN), and the SiLU activation function. It enhances the non-linearity of convolution operations, and as the most basic unit, it ensures that the model can capture low-level features. The internal structure of the C2f module is as follows. Figure 8As shown, during the forward propagation process, C2f first divides the input feature map into two parts, then performs convolution and Bottleneck processing on these two parts respectively. The processed feature maps are then recombined and passed through the final convolutional layer to generate the output, enhancing the expressive power of the feature maps and improving the model's ability to recognize complex targets. The internal structure of the MLCA module is as follows: Figure 9 As shown, this module employs a multi-scale channel-aware strategy, simultaneously considering channel and spatial information. By combining local and global information, it enhances the network's feature extraction capabilities and further improves the network's accuracy in recognizing small targets. The internal structure of the SPPF module is shown below. Figure 10 As shown, this module receives the feature map of the previous layer of SPPF as input and uses multiple small pooling layers to replace the pooling operation of a single large kernel. The main advantage of this design is that SPPF can extract richer multi-scale features through pooling operations of different scales. By concatenating the outputs of each pooling layer, a fixed-length feature vector is generated, which ultimately enhances the network's adaptability to targets of different sizes, thereby further improving the effect of small target detection. Figure 7 This is a diagram of the internal structure of the Conv module. Figure 8 This is a diagram of the internal structure of the C2f module. Figure 9 This is a diagram of the internal structure of the MLCA module. Figure 10 This is a diagram of the internal structure of an SPPF module. Figure 11 This is a small target feature extraction network based on the MLCA attention mechanism.
[0075] Based on the Conv, C2f, MLCA, and SPPF modules, this invention constructs a small target feature extraction network based on the MLCA attention mechanism, the structure of which is as follows: Figure 11 As shown in the figure. By using this structure to extract features from images, it is possible to incorporate attention to channel, spatial, and positional information during the extraction of small target features, better suppress the interference of background noise, and enhance the expressive power of the extracted features, thus providing highly expressive features for subsequent feature enhancement and fusion tasks.
[0076] Feature fusion enhancement network based on DPC-Block: While shallow neurons in convolutional neural networks can only learn low-level features of simple object details (such as edges and textures), this limitation is significant when dealing with targets in complex environments. For example, targets may be affected by occlusion, rotation, or deformation, especially in small object detection tasks. Traditional convolutional kernels are insufficient in capturing detailed features, easily leading to the loss of important information. This lack of detail directly affects the accuracy of small object detection, making it difficult for the model to maintain robustness in complex scenes. Deformable convolutions, through dynamic sampling capabilities, effectively adapt to irregular deformations of targets and complex backgrounds, helping to capture more detailed feature information. Therefore, a feature fusion enhancement module (DPC-Block) is used to improve the accuracy and feature representation ability of small object detection.
[0077] Complete network structure such as Figure 13 As shown, the network consists of 6 CDPC modules, 3 upsampling modules, 3 Conv modules, and 6 Concat modules. The internal structure of the CDPC module is as follows: Figure 12 As shown, dynamic feature fusion and cross-scale interaction effectively enhance the target detection capability in complex scenes, providing strong support for handling small targets. Figure 12 Here is a structural diagram of the CDPC module; Figure 13 To enhance the network structure through feature fusion based on DPC-Block.
[0078] Improvements to the detection layer and loss function: Figure 14 Diagram of the internal structure of the improved detection head; Figure 15 For the detection layer network structure;
[0079] To improve the accuracy and robustness of small object detection, the detection head structure and loss function were redesigned. Furthermore, to make the detection head focus more on important features for small object detection, an MLCA attention mechanism was added before the `detect` function. The detection layer network structure is as follows: Figure 15 As shown. The improved detection head structure includes multiple convolutional layers (Conv) and one two-dimensional convolutional layer (Conv2d), divided into bounding box regression branch and classification branch, as shown. Figure 14 As shown, the regression branch employs a combined loss based on ShapeIOU and Distribution Focal Loss (DFL) to optimize bounding box localization accuracy; the classification branch introduces Binary Cross-Entropy Loss (BCE) to improve classification performance. ShapeIOU effectively solves the gradient vanishing problem in small overlapping regions with traditional IoU, while DFL further refines the predicted distribution of bounding boxes. This effectively improves the model's convergence time and provides strong support for handling small targets.
[0080] Experimental Analysis: Introduction to the Experimental Dataset
[0081] To verify the superior performance of the proposed YOLOv8-DPMS algorithm in small target detection and to effectively enhance the feature extraction and multi-scale feature fusion capabilities of small targets under relatively low model parameter conditions, this study uses the self-built Coal-Helmet dataset for experimental verification. Through experiments, the algorithm's performance in small target detection in complex environments is evaluated, particularly how to improve the model's sensitivity and accuracy for small targets under constrained environmental conditions.
[0082] Coal-H Self-Made Dataset: This Coal-H dataset is specifically designed for small target detection tasks in downhole wells, based on real-world downhole operating environment data. The dataset contains 500 experimental images, each labeled with the target's specific location and category to ensure accuracy and completeness. The training and test sets are divided in an 8:2 ratio. (Some datasets are shown below.) Figure 16 As shown, the categories include seven types: pipeline, track, miner, safety helmet, impact suit, belt conveyor area, and no impact suit. Figure 16 This is a schematic diagram of a portion of the Coal-H dataset;
[0083] Experimental evaluation indicators:
[0084] The main evaluation metrics for the performance of object detection models are precision and recall.
[0085] This invention uses three widely adopted metrics in the research field—precision, recall, and mean precision (mAP)—to measure the performance of the model in object detection. Precision and recall are often considered evaluation metrics for binary classification problems, with the class of interest treated as positive samples and the class of no interest treated as negative samples. In object detection, the Intersection of Union (IoU) is frequently used to classify samples; that is, if the ratio of the intersection to the union of the candidate box and the original labeled box is greater than a certain set value, the candidate box is considered a positive sample, and vice versa. Therefore, the prediction results are divided into the following four categories: TP (True Positives), TN (True Negatives), FP (False Positives), and FN (False Negatives), as shown in Table 1.
[0086] Table 1
[0087]
[0088] Precision measures the similarity between the model's detection results and the ground truth labels, focusing on both predicted positive samples and actual positive and negative samples. Its calculation formula is as follows. It can be seen that a higher precision value results in a lower precision (FP), indicating higher purity of predicted positive samples and fewer false detections.
[0089]
[0090] Recall refers to the ratio of correctly detected targets to the actual number of targets, focusing on the predicted positive and negative samples. Its calculation formula (11) is as follows. It can be seen that the larger the recall value, the smaller the FNFN, the fewer positive samples are predicted as negative samples, and the fewer missed detections.
[0091]
[0092] mAP represents the average precision of all classes in the dataset, and is often used to reflect the accuracy of the entire model. mAP is the average AP value of each class, and its formula is expressed in (12) to (14). The larger the mAP value, the larger the area enclosed by the PR curve and the coordinate axis. The commonly used mAP0.5 form represents the mAP value when the IOU threshold is 0.5, and the PR curve is obtained by plotting recall and precision on the horizontal and vertical axes, respectively.
[0093]
[0094]
[0095]
[0096] (2) Regarding detection speed, the parameter count (Params) model is used as the evaluation metric. The parameter count (Params) refers to the number of parameters included in the model. The calculation formula is as follows:
[0097] Params = C in ×C out ×K×K (15)
[0098] In the above formula, K represents the kernel size, C in and C out These represent the number of input and output channels, respectively.
[0099] Experimental Results and Analysis:
[0100] To verify the effectiveness and advantages of the proposed YOLOv8-DPMS, comparative experiments were conducted using YOLOv5, YOLOv6, YOLOv7, YOLOv9, YOLOv10, and YOLOv11, along with ablation experiments, as shown in Table 2. Object detection experiments were performed on the Coal-H dataset. The experiments were based on the PyTorch deep learning framework, using Python as the development language. The hardware and software environment configurations are shown in Table 3. The training strategy was as follows: the BatchSize parameter was 16, the initial learning rate was 0.01, the decay coefficient was 0.0005, and the minimum learning rate was 0.0005. The number of training epochs was set to 300.
[0101] Table 2
[0102]
[0103] Table 3
[0104]
[0105]
[0106] The results of the comparative experiments are shown in Table 4. The results of the comparative experiments for safety helmets targeting small targets are shown in Table 5.
[0107] Table 4
[0108]
[0109] Table 5
[0110]
[0111] Tables 4 and 5 show the experimental results of the seven comparison algorithms on the Coal-H dataset, displaying data for four evaluation metrics (mAP@0.5, mAP@0.5-0.95, Recall, Prams). Based on the experimental data, we can conclude that:
[0112] As can be seen from the experimental results in Table 4, YOLOv8-DPMS achieved significant improvements in Recall, Map@0.5, and Map@0.5-0.95. Compared to YOLOv5, YOLOv8-DPMS improved Recall by 2.1% and Map@0.5 by 3.4%. Notably, in Map@0.5-0.95, a measure of overall detection accuracy, YOLOv8-DPMS achieved 0.787, outperforming YOLOv7 (0.755) and YOLOv9 (0.779). This indicates that YOLOv8-DPMS performs better in fine-grained target detection, especially in the accuracy of target localization and classification, demonstrating its innovative module design (such as DPC-Block, ShapeIoU, and MLCA), which significantly improves detection capabilities without significantly increasing computational costs.
[0113] As shown in Table 5, YOLOv8-DPMS also demonstrates a significant advantage in detecting small targets, such as safety helmets. First, the Recall metric improved from 0.846 in YOLOv5 to 0.896, a 5% increase. This indicates that YOLOv8-DPMS has a stronger ability to detect small targets and can more comprehensively identify small targets such as safety helmets. This is particularly important for detection tasks involving numerous obstructions and small-sized targets in mining environments. Second, Map@0.5 improved from 0.834 in YOLOv6 to 0.942, a 10.8% increase. This result shows that YOLOv8-DPMS has significantly improved the accurate localization of small targets, especially for targets like safety helmets that typically have strong contrast with the background, resulting in a greater improvement in accuracy. Finally, on Map@0.5-0.95, YOLOv8-DPMS achieved a score of 0.556, which is significantly better than YOLOv5 (0.462) and YOLOv6 (0.432), indicating that its overall performance in handling small targets has also been greatly improved.
[0114] By comparison, YOLOv8-DPMS shows a more significant improvement in small target detection, especially in terms of accuracy (Map@0.5) and overall accuracy (Map@0.5-0.95), indicating that it has a stronger adaptability to small targets in the complex environment of mines and can better cope with the diversity and complexity of small targets in mine operation scenarios.
[0115] The ablation test results are shown in Table 6. The ablation test results for small-target safety helmets are shown in Table 7.
[0116] Table 6
[0117]
[0118] Table 7
[0119]
[0120] Tables 6 and 7 show the experimental results of the four comparison algorithms on the Coal-H dataset, illustrating the data for four evaluation metrics (mAP@0.5, mAP@0.5-0.95, Recall, Prams). The Map@0.5 metric for each category is shown below. Figure 17 As shown, based on the experimental data, we can conclude that:
[0121] As shown in Table 6, the experimental results for the overall categories reveal a gradual improvement in object detection performance from Model A to Model D. Specifically, Recall increased from 0.936 to 0.956, Map@0.5 improved from 0.963 to 0.981, and Map@0.5-0.95 improved from 0.762 to 0.787. Particularly noteworthy is the 2.5% improvement in Map@0.5-0.95 compared to the base model Model A, and the 1.8% improvement in Map@0.5. This trend indicates that the introduction of modules such as DPC-Block, ShapeIoU, and MLCA significantly enhances feature extraction and multi-scale object representation capabilities. Furthermore, while the number of parameters increased from 3.0M to 4.2M, the performance gain far exceeded the increase in parameters, demonstrating the high efficiency of the model improvement design and its strong practicality and robustness.
[0122] Table 7 shows the detection results for the small target category "safety helmet," demonstrating a significant improvement in the model's ability to handle small targets. Introducing DPC-Block on top of Model A resulted in mAP@0.5 and mAP@0.5-0.95 values of 0.928 and 0.512, respectively, representing improvements of 4.4% and 2.2% compared to Model A. The Recall value was 0.865, a 1.9% improvement over Model A, and the Prams value was 3.9M, only a 0.9M increase compared to Model A. This proves that DPC-Block effectively enhances the detection capability of small targets and reduces missed detections with a small increase in the number of parameters. Further introducing Shape-IoU on DPC-Block, its mAP@0.5 and mAP@0.5-0.95 results are 0.933 and 0.537 respectively, representing improvements of 0.5% and 2.5%, indicating the effectiveness of Shape-IoU for small object detection in complex situations. ModelC introduces the MLCA attention mechanism on top of ModelB to enhance the model's understanding of details and global information. Its mAP@0.5 and mAP@0.5-0.95 results are 0.942 and 0.556 respectively, representing improvements of 0.9% and 1.9%, indicating the effectiveness of the MLCA attention mechanism for small object detection in complex situations.
[0123] Compared to the detection results for the overall category, the performance improvement in small target detection is more significant, indicating that the introduced modules (especially DPC-Block) play an important role in capturing small target features and background information. Furthermore, the final model design (ModelD) approaches ideal values on Recal and Map@0.5, validating the adaptability and advantages of YOLOv8-DPMS for small target detection tasks in complex downhole environments. Figure 17 Map@0.5 results for various types of ablation experiments.
[0124] This invention proposes a small object detection algorithm, YOLOv8-DPMS, based on conventional convolution, C2f modules, and spatial pyramid pooling modules. It integrates an MLCA attention mechanism module that focuses on channel, spatial, and positional information, constructing a small object feature extraction network based on the MLCA attention mechanism. A small object feature fusion enhancement module, DPC-Block, based on deformable convolution and residual structures, is proposed, incorporating the MLCA attention mechanism to fuse and enhance shallow and deep features in the feature extraction network, constructing a feature fusion enhancement network. Based on these two networks, combined with object classification and bounding box prediction modules, small object detection is achieved. Experimental results on object detection datasets show that:
[0125] The YOLOv8-DPMS algorithm proposed in this invention achieves mAP@0.5 and mAP@0.5-0.95 accuracy rates of 98.1% and 78.7% respectively on the overall category of the Coal-H dataset, representing improvements of up to 3.4% and 3.6% compared to the YOLO series models. In the detection of safety helmets for small targets, mAP@0.5 and mAP@0.5-0.95 achieve accuracy rates of 94.2% and 55.6% respectively, representing improvements of up to 10.8% and 12.4% compared to the YOLO series models.
[0126] In summary, the YOLOv8-DPMS algorithm proposed in this invention can effectively improve the detection accuracy of small targets.
[0127] It will be understood by those skilled in the art that the above descriptions are merely preferred examples of the invention and are not intended to limit the invention. Although the invention has been described in detail with reference to the foregoing examples, those skilled in the art can still modify the technical solutions described in the foregoing examples or make equivalent substitutions for some of the technical features. All modifications and equivalent substitutions made within the spirit and principles of the invention should be included within the scope of protection of the invention. All technical features in this embodiment can be freely combined according to actual needs.
[0128] Finally, it should be noted that the above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing embodiments or make equivalent substitutions for some of the technical features. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A mine small target detection method based on deformable convolution and residual structure, characterized in that, Specifically includes: (1) Image preprocessing module: preprocesses the acquired raw image containing the target; (2) Small target feature extraction network module based on MLCA attention mechanism: It realizes the attention to channel, spatial and positional information in the process of small target feature extraction, suppresses the interference of background noise and enhances the expressive power of the extracted features; the preprocessed image is input into this module to suppress the interference of background noise in the small target image and extract features with high representation ability, and then the features of different layers are input into the feature enhancement and fusion module. (3) Small target feature fusion enhancement network module based on deformable convolution and residual structure: The features of different scales output by the C2f module and the spatial pyramid pooling module in the feature extraction network are input into the feature fusion enhancement network, so that the shallow and deep features in the feature extraction network are fused and enhanced, and the rich detailed information represented by the low-order features is preserved while learning high-order semantic features. (4) Target classification and bounding box prediction network: Based on the feature fusion enhancement network module, the output of the four MLCA modules is enhanced, so that the prediction network can focus on important features and form four target prediction branches. Different anchor boxes are used to predict targets of different sizes. Each anchor box corresponds to a target of a specific size. During the training process, in order to accelerate the convergence speed of the target detection model and improve the regression accuracy of the target prediction box, the ShapeIOU loss function is used to optimize the original loss function of the network to improve the detection accuracy. The small target feature extraction network based on the MLCA attention mechanism consists of 5 Conv modules, 4 C2f modules, 4 MLCA modules, and 1 Spatial Pyramid Pooling module (SPPF). Among them, the Conv module is a commonly used basic module, which consists of two-dimensional convolution Conv2d, batch normalization (BN) and activation function SiLU. It is used to enhance the non-linearity of convolution operations and serves as a basic unit to ensure that the model can capture low-level features. During the forward propagation process, C2f divides the input feature map into two parts, performs convolution and Bottleneck processing on these two parts respectively, and then reassembles the processed feature maps. Finally, the output is generated through the convolutional layer, which enhances the expressive power of the feature map and improves the model's ability to recognize complex targets. The MLCA module adopts a multi-scale channel perception strategy, which considers both channel information and spatial information. By combining local and global information, it improves the feature extraction capability of the network and further enhances the network's recognition accuracy for small targets. The SPPF module takes the feature map from the previous SPPF layer as input and uses multiple small pooling layers to replace the pooling operation of a single large kernel. Through pooling operations of different scales, SPPF can extract richer multi-scale features. By concatenating the outputs of each pooling layer, a fixed-length feature vector is generated, which ultimately enhances the network's adaptability to targets of different sizes, thereby further improving the effect of small target detection.
2. The mine small target detection method based on deformable convolution and residual structure according to claim 1, characterized in that: The principle of the image preprocessing module specifically includes the following steps: the original image is processed using the Mosaic data augmentation method. The Mosaic method uses 4 images and splices them together in a random scaling, cropping and arrangement manner to combine multiple images.
3. The mine small target detection method based on deformable convolution and residual structure according to claim 1, characterized in that: The small target feature fusion enhancement network module based on deformable convolution and residual structure consists of 6 CDPC modules, 3 upsampling modules, 3 Conv modules, and 6 Concat modules. Through dynamic feature fusion and cross-scale interaction, it effectively enhances the target detection capability in complex scenes.
4. The mine small target detection method based on deformable convolution and residual structure according to claim 1, characterized in that: The improvements to the detection layer and loss function in the target classification and bounding box prediction network are as follows: The detection head structure and loss function have been improved. To make the detection head focus more on important features and thus perform small target detection, an MLCA attention mechanism is added before `detect`. The improved detection head structure includes multiple convolutional layers (Conv) and a two-dimensional convolutional layer (Conv2d), divided into a bounding box regression branch and a classification branch. The regression branch uses a combined loss based on ShapeIOU and Distribution Focal Loss to optimize bounding box localization accuracy. The classification branch introduces Binary Cross-Entropy Loss (BCE) to improve classification performance.