A target detection method based on an improved YOLOX algorithm
By improving the network structure and training method of the YOLOX algorithm, the accuracy of small object detection is enhanced, the shortcomings of the YOLOX algorithm in small object detection are addressed, and a more efficient detection effect is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- JIANGSU UNIV OF SCI & TECH
- Filing Date
- 2023-05-18
- Publication Date
- 2026-06-26
AI Technical Summary
Current mainstream object detection algorithms lack accuracy when detecting images of small objects.
By improving the YOLOX algorithm, replacing the spatial pyramid module in the network structure with a dilated convolution module, adding a receptive field module RFB, the ability to extract small target feature maps is enhanced, and the dataset is partitioned and the training weights are optimized.
It improves the detection accuracy of small target images while maintaining the inference speed at virtually no decrease.
Smart Images

Figure CN116704310B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of target detection technology, and in particular to a target detection method based on an improved YOLOX algorithm. Background Technology
[0002] The latest YOLOX algorithm has made significant improvements over its predecessor, YOLO v5, primarily through the introduction of the YOLOHead (decoupled head). YOLOHead separates the confidence score and the bounding box, integrating them during prediction, which greatly improves detection accuracy and inference speed. It achieves a 3% improvement in accuracy compared to YOLO v5. However, current mainstream object detection algorithms are designed based on naturally distributed images, resulting in insufficient accuracy in detecting small objects. Summary of the Invention
[0003] In view of this, the purpose of this invention is to propose a target detection method based on the improved YOLOX algorithm to solve the technical problem of insufficient detection accuracy in images of small targets.
[0004] A target detection method based on an improved YOLOX algorithm is proposed. The method involves acquiring a trademark dataset and converting the images within the dataset into a format directly usable for training the YOLOX network, thus obtaining a usable trademark dataset. The usable trademark dataset is preprocessed, dividing all trademark images into training, validation, and test sets. The training, validation, and test sets are analyzed to determine all trademark types. The YOLOX network structure is improved to obtain an improved YOLOX target detection algorithm. The improved YOLOX target detection algorithm is trained using the training dataset, and the optimal training weight file is evaluated to obtain the final YOLOX target detection algorithm.
[0005] As a further improvement to this application, the step of obtaining the trademark dataset and converting the image format within the trademark dataset into a format that can be directly input into the YOLOX network for training includes: converting the images within the trademark dataset into PNG format images, and uniformly converting the image data recorded in the trademark dataset into VOC dataset format.
[0006] As a further improvement to this application, the step of preprocessing the available trademark dataset to divide all trademark images into a training set, a validation set, and a test set includes: dividing the available trademark dataset into a training set, a validation set, and a test set in a 6:2:2 ratio.
[0007] As a further improvement to this application, the analysis of the training set, validation set, and test set to determine all trademark types includes: counting the number of all trademark types, counting the names of the trademarks, and writing the names of the trademarks into the VOC dataset.
[0008] As a further improvement to this application, the YOLOX-based network structure is improved to obtain an improved YOLOX target detection algorithm, including: improving the feature extraction backbone network of the YOLOX network structure, replacing the spatial pyramid module SPP in the YOLOX network structure with ASPP, adding a receptive field module RFB after the small target branch of the feature pyramid module in the YOLOX network structure, and placing the receptive field module RFB before the 80*80 classifier to obtain the improved YOLOX target detection algorithm.
[0009] As a further improvement of this application, the improved YOLOX object detection algorithm is trained using a training dataset, and the optimal training weight file is evaluated to obtain the final YOLOX object detection algorithm; the improved YOLOX object detection algorithm is trained using a training dataset, and after 300 batches of training, the optimal training weight file is selected and loaded into the improved YOLOX object detection algorithm to obtain the final YOLOX object detection algorithm.
[0010] Based on the above objectives, the present invention provides the following beneficial effects: by improving the YOLOX-based network structure, an improved YOLOX target detection algorithm is obtained; the improved YOLOX target detection algorithm is trained using a training dataset, and the optimal training weight file is evaluated to obtain the final YOLOX target detection algorithm, thereby enhancing the detection accuracy of images of small targets. Attached Figure Description
[0011] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only for this invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0012] Figure 1 This is a flowchart illustrating a target detection method based on an improved YOLOX algorithm according to an embodiment of the present invention;
[0013] Figure 2 This is a YOLOX network structure diagram of a target detection method based on an improved YOLOX algorithm according to an embodiment of the present invention;
[0014] Figure 3 This is a network structure diagram of the improved YOLOX algorithm, which is a target detection method based on the improved YOLOX algorithm according to an embodiment of the present invention.
[0015] Figure 4This is a structural diagram of a receptive field module RFB based on an improved YOLOX algorithm according to an embodiment of the present invention;
[0016] Figure 5 This is a structural diagram of an improved receptive field module RFB-s based on an improved YOLOX algorithm for target detection according to an embodiment of the present invention. Detailed Implementation
[0017] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to specific embodiments.
[0018] It should be noted that, unless otherwise defined, the technical or scientific terms used in this invention should have the ordinary meaning understood by one of ordinary skill in the art to which this invention pertains. The terms "first," "second," and similar terms used in this invention do not indicate any order, quantity, or importance, but are merely used to distinguish different components. Terms such as "comprising" or "including" mean that the element or object preceding the word encompasses the elements or objects listed following the word and their equivalents, without excluding other elements or objects. Terms such as "connected" or "linked" are not limited to physical or mechanical connections, but can include electrical connections, whether direct or indirect. Terms such as "upper," "lower," "left," and "right" are used only to indicate relative positional relationships; when the absolute position of the described object changes, the relative positional relationship may also change accordingly.
[0019] A target detection method based on an improved YOLOX algorithm includes: acquiring a trademark dataset and converting the image format within the trademark dataset into a format that can be directly input into the YOLO network for training, thus obtaining a usable trademark dataset; including converting the images within the trademark dataset into PNG format images, uniformly converting the image data recorded in the trademark dataset into VOC dataset format, and iterating through all file formats to check for correctness.
[0020] The available trademark dataset is preprocessed and then divided into training, validation, and test sets in a 6:2:2 ratio.
[0021] Analyze the training set, validation set, and test set to determine all trademark types; this includes: counting the number of all trademark types, counting the names of the trademarks, and writing the trademark names into the VOC dataset.
[0022] An improved YOLOX object detection algorithm is derived by modifying the YOLOX network structure. The feature extraction backbone network of the YOLOX network structure is improved by replacing the Spatial Pyramid (SPP) module with the ASPP module, and adding a Receptive Field (RFB) module after the small target branch of the feature pyramid module. This RFB module is then placed before the classifier corresponding to the 80*80 pixel array, resulting in the improved YOLOX object detection algorithm. The improved YOLOX object detection algorithm enhances the recognition ability of small target feature maps, thereby improving the detection accuracy of small targets. Since increasing the receptive field does not increase the computational load, the overall inference speed remains essentially unchanged.
[0023] ASPP consists of a 1×1 convolution, a pooling pyramid, and ASPP Pooling. The dilation factor of each layer of the pooling pyramid can be customized, thus enabling flexible multi-scale feature extraction.
[0024] The difference between dilated convolutional layers and regular convolutions lies in the dilation rate, which controls the padding and dilation during convolution. By varying the padding and dilation, receptive fields of different scales can be obtained, extracting multi-scale information. Note that the kernel size remains constant at 3×3.
[0025] ASPP utilizes multiple parallel dilated convolutional layers with different sampling rates in the original module. Features extracted for each sampling rate are further processed in a separate branch and then fused to generate the final result. Convolutional kernels with different receptive fields are constructed using varying dilation rates to acquire multi-scale object information. Compared to the Spatial Pyramid module SPP, ASPP, by adding dilated convolutions, achieves a larger receptive field without increasing computational cost, thus improving detection accuracy.
[0026] The improved YOLOX object detection algorithm is trained using a training dataset, validated using a validation set, and tested using a test set. The optimal training weight file is then evaluated to obtain the final YOLOX object detection algorithm. (The process of training the improved YOLOX object detection algorithm using a training dataset and evaluating the optimal training weight file yields the final YOLOX object detection algorithm.)
[0027] The improved YOLOX object detection algorithm's network structure includes a feature extraction backbone network, a spatial pooling pyramid module, a feature pyramid module, a receptive field module, and a classification module. The feature extraction backbone network is a CSPDarknet53 network structure, which includes a Focus module and five residual convolutional modules, three of which are used to output feature maps of 80*80, 40*40, and 20*20 respectively.
[0028] The improved YOLOX object detection algorithm network structure includes a feature extraction backbone network and a feature fusion module; such as Figure 2 The feature fusion module shown is also known as the Feature Pyramid Network (FPN). Its main function is to fuse the outputs of the last three feature layers in the backbone network and then output the final result. In the feature utilization part, YOLOX extracts multiple feature layers for target detection, extracting a total of three feature layers.
[0029] The three feature layers are located at different positions in the backbone CSPdarknet53, namely the middle layer, the lower middle layer, and the bottom layer. When the input is (640,640,3), the shapes of the three feature layers are feat1=(80,80,256), feat2=(40,40,512), and feat3=(20,20,1024).
[0030] The first path: upsampling of the backbone convolution (20,20,1024) feature map, concatenating with the backbone convolution (40,40,512) feature map and downsampling, fusing information and further feature extraction to obtain merge_01.
[0031] The second path: merge_01 is upsampled through a 1x1 convolution downchannel and concatenated with the backbone (80,80,256) and downsampled through a convolution to obtain merge_02 of size (80,80,256), which is then output to YOLOHead_01.
[0032] The third path: merge02 downsamples and performs concat and convolution downsampling operations with merge01 to obtain merge_03 of size (40,40,512), which is then output to YOLOHead_02.
[0033] The fourth path: the feature map of (20,20,1024) is downsampled and convolved with the backbone by merge_03, and then concatenated and downsampled to obtain merge_04 of (20,20,1024) which is output to YOLOHead_03.
[0034] Feature Pyramid Network (FPN) fuses feature maps from three dimensions, but each output has a certain emphasis, resulting in better detection performance for targets of different scales. To improve the detection accuracy of small targets, it is necessary to enhance the extraction capability of the (80, 80) feature map. This application proposes adding a receptive field module (RFB) between the second path merge_02 and YOLOHead_01 to enhance the extraction capability of small target feature maps.
[0035] The receptive field module (RFB) aims to simulate the receptive field of human vision to enhance the network's feature extraction capabilities, such as... Figure 4 The receptive field module RFB shown in the figure mainly adds dilated convolutions to the Inception architecture, thereby effectively increasing the receptive field. Compared to the Inception structure, it also adds a cross-linking structure, in which different sizes and dilated convolutional layers are finally spliced together. The dilated convolutional layers mentioned above need to be assigned a coefficient, and experiments have shown that a coefficient value of 0.1 is optimal.
[0036] As a further improvement to this application, the receptive field module RFB can be further improved to obtain an improved receptive field module RFB-s. For example... Figure 5 As shown, the improved receptive field module RFB-s has two main improvements compared to the receptive field module RFB. On the one hand, it uses a 3×3 convolutional layer instead of a 5×5 convolutional layer, and on the other hand, it uses 1×3 and 3×1 convolutional layers instead of a 3×3 convolutional layer. The main purpose should be to reduce the amount of computation. This application adopts the structure of the improved receptive field module RFB-s.
[0037] The improved YOLOX object detection algorithm is trained using the training dataset. After 300 batches of training, the best training weight file is selected and loaded into the improved YOLOX object detection algorithm to obtain the final YOLOX object detection algorithm.
[0038] The first 250 batches were validated every 25 batches using a validation set, observing changes in accuracy and loss. The last 50 batches were validated after each training iteration. This resulted in a complete set of weights. By observing the accuracy curves for each batch, the weight with the highest accuracy was selected as the trademark detection model.
[0039] The 300-batch training includes the following steps:
[0040] a. Use the Mosaic data augmentation method to stitch together four images by randomly scaling, cropping, and arranging them, thereby enriching the background of the stitched image.
[0041] b. Using the Mixup data augmentation method, first read an image, fill the sides of the image, and scale it to 640x640. There is a detection box in this area. Then randomly select another image, fill the top and bottom of the image, and scale it to 640x640. There is also a detection box in this area. Then set a fusion coefficient to merge the two images to get a new image and two detection boxes, thereby improving the training intensity.
[0042] c. SGD is used as the optimizer, with cosine annealing learning rate method. At the beginning of training, 5 batches of warm-up training are performed first, and all data augmentation is turned off in the last 50 batches.
[0043] d. Use GIOU as the loss function, GIOU Loss = 1 - GIOU
[0044] The function of GIOU:
[0045] For intersecting boxes, IOU can be backpropagated, meaning it can be directly used as the objective function for optimization. However, for non-intersecting boxes, the gradient will be zero, making optimization impossible. Using GIOU completely avoids this problem, so it can be used as the objective function.
[0046] Can distinguish the alignment of the box
[0047] IOU takes values in the range [0,1], but GIOU has a symmetrical interval, with a value range of [-1,1]. It takes its maximum value of 1 when the two overlap, and its minimum value of -1 when they have no intersection and are infinitely far apart. Therefore, GIOU is a very good distance metric. GIOU focuses not only on overlapping areas but also on other non-overlapping areas.
[0048] e. Regarding the selection of predicted bounding boxes, compared to the anchor boxes obtained by YOLOX v5 clustering, this method is only applicable to a single dataset and is not universal. Anchor boxes increase the complexity of the detector head and the number of generated results. YOLOX adopts an anchor-free strategy. Anchor-free decoding code logic is simpler and more readable. It significantly reduces the number of design parameters that need heuristic tuning and the many techniques involved, such as anchor clustering and grid sensitivity. This makes the detector, especially its training and decoding phases, quite simple. Based on the anchor-free method, the predictions at each position are reduced from 3 to 1, and they directly predict 4 values: the two offsets at the top left corner of the grid, and the height and width of the predicted bounding box. This modification reduces the detector's parameters and GFLOPs, making its inference speed faster, while achieving higher performance.
[0049] The 20*20 feature map is input into the Spatial Pooling Pyramid (ASPP) module. After the ASPP module stitches the feature maps, the output feature map is input into the Feature Pyramid module. The 80*80, 40*40, and 20*20 feature maps are output from the Feature Pyramid module and then input into the classification module for model classification and regression detection, and the final result is output.
[0050] Those skilled in the art should understand that the discussion of any of the above embodiments is merely exemplary and is not intended to imply that the scope of the invention (including the claims) is limited to these examples; within the framework of the invention, the technical features of the above embodiments or different embodiments can also be combined, the steps can be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in the details for the sake of brevity.
[0051] This invention is intended to cover all such substitutions, modifications, and variations that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this invention should be included within the scope of protection of this invention.
Claims
1. A target detection method based on an improved YOLOX algorithm, characterized in that, Obtain the trademark dataset and convert the image format within the trademark dataset into a format that can be directly input into the YOLOX network for training, thus obtaining a usable trademark dataset; The available trademark dataset is preprocessed to divide all trademark images into training, validation, and test sets. Analyze the training set, validation set, and test set to determine all trademark categories; An improved YOLOX target detection algorithm is obtained by improving the YOLOX network structure. The improved YOLOX object detection algorithm is trained using the training dataset, and the optimal training weight file is evaluated to obtain the final YOLOX object detection algorithm. The YOLOX-based network structure is improved to obtain the improved YOLOX target detection algorithm, which includes: The feature extraction backbone network of YOLOX's network structure is improved by replacing the spatial pyramid module (SPP) with ASPP. An improved receptive field module RFB-s is added after the small target branch of the feature pyramid module in the YOLOX network structure, and the improved receptive field module RFB-s is placed before the 80*80 classifier. The small target branch is the 80*80 feature map branch; In the improved receptive field module RFB-s, a 3×3 convolutional layer is used instead of a 5×5 convolutional layer, and a 1×3 and a 3×1 convolutional layer are used instead of a 3×3 convolutional layer.
2. The target detection method based on the improved YOLOX algorithm according to claim 1, characterized in that, The process of acquiring the trademark dataset and converting the image format within the trademark dataset into a format that can be directly input into the YOLOX network for training includes: converting the images in the trademark dataset into PNG format images, and uniformly converting the image data recorded in the trademark dataset into VOC dataset format.
3. The target detection method based on the improved YOLOX algorithm according to claim 1, characterized in that, The step of preprocessing the available trademark dataset and dividing all trademark images into training, validation, and test sets includes: dividing the available trademark dataset into training, validation, and test sets in a 6:2:2 ratio.
4. The target detection method based on the improved YOLOX algorithm according to claim 2, characterized in that, The analysis of the training set, validation set, and test set determines all trademark types, including: counting the number of all trademark types, counting the names of the trademarks, and writing the trademark names into the VOC dataset.
5. The target detection method based on the improved YOLOX algorithm according to claim 1, characterized in that, The improved YOLOX object detection algorithm is trained using a training dataset, and the optimal training weight file is evaluated to obtain the final YOLOX object detection algorithm, including: The improved YOLOX object detection algorithm is trained using the training dataset. After 300 batches of training, the best training weight file is selected and loaded into the improved YOLOX object detection algorithm to obtain the final YOLOX object detection algorithm.