Cherry defect and grading detection method based on improved YOLOX model
By improving the YOLOX model and combining it with the FPN feature pyramid network, Focal Loss, and CBAM attention mechanism, the problems of detection accuracy and speed in cherry grading were solved, achieving high-precision cherry defect and grading detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- DALIAN UNIV
- Filing Date
- 2022-09-26
- Publication Date
- 2026-06-23
AI Technical Summary
Existing technologies suffer from limitations in cherry grading efficiency and accuracy due to factors such as complex environments, inconspicuous defects, and imbalances between positive and negative samples. Consequently, it is difficult to achieve high-precision and high-speed cherry defect and grading detection.
An improved YOLOX model is adopted, which improves the accuracy of cherry defect and grade detection by configuring fusion factors and introducing FPN feature pyramid network, combined with cross-entropy loss function Focal Loss and attention mechanism CBAM.
The improved YOLOX model achieved an average accuracy of 97.59% in cherry defect detection and 95.92% in graded detection, representing improvements of 5.75% and 6.99% respectively compared to the original network, significantly enhancing detection accuracy.
Smart Images

Figure CN115496729B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of fruit grading technology, specifically relating to a cherry defect and grading detection method based on an improved YOLOX model. Background Technology
[0002] Fruit grading has always been a crucial step in the sales of fruit and vegetable products, and with the growing success of e-commerce, fruits can now be distributed and sold globally. To capture a significant market share, industrialized fruit grading is essential.
[0003] Using image vision and neural network algorithms for defect detection and grading of fruits is currently a hot research topic, with many scholars both domestically and internationally conducting extensive work in this area. However, in practical applications of cherry grading, factors such as complex environments, inconspicuous defects, and imbalanced positive and negative samples can easily affect detection efficiency and accuracy. Therefore, improving the detection accuracy and speed of cherry grading is crucial for its application. Summary of the Invention
[0004] To address the shortcomings of existing technologies, this invention provides a cherry defect and grading detection method based on an improved YOLOX model. By improving the multi-feature fusion module and loss function, and combining it with the attention mechanism CBAM, the accuracy of cherry defect and grading detection is improved.
[0005] The technical solution adopted by this invention to solve its technical problem is: a cherry defect and grading detection method based on an improved YOLOX model. The improved YOLOX model includes three parts: a CSPDarknet network, an FPN feature pyramid network, and a YOLO Head network. Among them, CSPDarknet serves as the backbone feature extraction network of YOLOX, the FPN feature pyramid network is used to solve the multi-scale problem in the detection process, and the YOLO Head network is used for feature point judgment.
[0006] The cherry defect and grading detection method includes: using an improved YOLOX model to detect defective fruits, configuring a fusion factor for the FPN feature pyramid network, and integrating the cross-entropy loss function Focal Loss into the loss function; using an improved YOLOX model to grade intact fruits, and introducing a CBAM attention mechanism module.
[0007] Furthermore, the CSPDarknet network first resizes each input cherry image to 640×640, then extracts features from it using a Focus network structure, adjusts the number of channels using convolutional normalization and activation functions, and then extracts features through four Resblock body structures, with an SPP structure added in the fourth Resblock body structure.
[0008] Furthermore, the feature fusion method of the FPN feature pyramid network is expressed by the following formula:
[0009]
[0010] Where f inner f represents a 1×1 convolution operation for channel matching; upsample f represents a 2× upsampling operation for resolution matching; layer α represents the 3×3 convolution operation for feature processing; α represents the fusion factor.
[0011] Furthermore, the fusion factor α is set to 0.5.
[0012] Furthermore, the integration of the cross-entropy loss function (Focal Loss) into the loss function is expressed as follows:
[0013]
[0014]
[0015] Where, p t This represents the probability that the predicted sample belongs to 1, where γ is an adjustable parameter, γ≥0; (1-p t ) γ This represents the modulation coefficient.
[0016] Furthermore, the CBAM attention mechanism module includes a channel attention module and a spatial attention module. The output of the convolutional layer first passes through a channel attention module to obtain a weighted result, and then passes through a spatial attention module for final weighted summation to obtain the final result. Its mathematical expression is:
[0017]
[0018]
[0019] in, Element-wise multiplication is represented by F; the input feature map is represented by M. C (F) represents the output of the channel attention module, M S(F') represents the output of the spatial attention module; F” represents the feature map output by the CBAM attention mechanism module.
[0020] Furthermore, the channel attention module first processes the input feature map through average pooling and max pooling operations, then feeds it into a multilayer perceptron (MLP) with shared weights. The MLP contains one hidden layer, equivalent to two fully connected layers. Finally, a sigmoid activation function is used to obtain the channel attention map, which is the output of the channel attention module. The expression is:
[0021]
[0022] Where σ represents the Sigmoid activation function, W0 and W1 represent the weights of the MLP, and W0∈R C / r×C ,W1∈R C×C / r .
[0023] Furthermore, the spatial attention module first performs max pooling and average pooling on each feature point's channel, stacks these two results to generate a feature map with 2 channels; then, it reduces the number of channels to 1 through a 7×7 convolution, and finally passes a sigmoid activation function to obtain a spatial attention map, which is the output of the spatial attention module; the expression is:
[0024]
[0025] Here, 7×7 represents the size of the convolution kernel.
[0026] The beneficial effects of this invention include: for the defect detection network, using a fusion factor to configure the FPN improves the detection capability of cherries with inconspicuous defects, and integrating Focal loss into the loss function improves the problem of sample imbalance between different classes, achieving an average detection accuracy of 97.59%, an improvement of 5.75% compared to the original network. For the cherry grading detection network, adopting a fusion attention mechanism guides the model's focus direction, achieving an average detection accuracy of 95.92%, an improvement of 6.99% compared to the original network. Therefore, this invention significantly improves the accuracy of cherry defect and grading detection. Attached Figure Description
[0027] Figure 1 This is a cherry type diagram in defect detection;
[0028] Figure 2 This is a chart showing the size, color, and type of cherries;
[0029] Figure 3 This is a schematic diagram of the hardware device for cherry image acquisition;
[0030] Figure 4This is a flowchart of a cherry defect and grading detection system;
[0031] Figure 5 This is a diagram of the YOLOX model structure;
[0032] Figure 6 This is a diagram of the YOLO Head structure;
[0033] Figure 7 This is the FPN structure diagram;
[0034] Figure 8 This is a graph showing the effect of changes in the fusion factor on the experimental results;
[0035] Figure 9 These are the Eval mAP diagrams for different optimization strategy models;
[0036] Figure 10 This is a comparison chart of model losses;
[0037] Figure 11 This is a structural diagram of the CBAM module;
[0038] Figure 12 This is a diagram of the channel module structure;
[0039] Figure 13 This is a diagram of the spatial attention module structure;
[0040] Figure 14 It is a graph of the loss function. Detailed Implementation
[0041] The technical solution of the present invention will now be clearly and completely described with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0042] Furthermore, the technical features involved in the different embodiments of the present invention described below can be combined with each other as long as they do not conflict with each other.
[0043] This invention presents a cherry defect and grading detection method based on an improved YOLOX model, aiming to achieve rapid cherry grading under industrial conditions. First, the YOLOX model is used to detect defective fruit. Appropriate fusion factors are set for the feature pyramid network to improve the detection accuracy of inconspicuous defects, and Focal Loss is integrated into the loss function. Then, the YOLOX model is used to grade intact fruit, and a CBAM attention mechanism is introduced to enhance network feature extraction. Based on this scheme, the average detection accuracy for cherry surface defects is 97.59%, and the average detection accuracy for size and color grading is 95.92%. The improved YOLOX model of this invention significantly improves the accuracy of cherry defect and grading detection.
[0044] Example 1
[0045] Because publicly available cherry datasets lack images of cherries with different defects, sizes, and colors, it is difficult to obtain suitable data by relying solely on publicly available datasets. To address this issue, data was collected from local cherry orchards in Jinzhou District, Dalian City, Liaoning Province, and the dataset was completed through photography and annotation using laboratory equipment.
[0046] The collected cherries were placed on a roller on laboratory equipment. The roller was rotated to capture images of the samples from various angles. Each image contained 1 to 10 cherries, with a resolution of 2046 pixels × 1080 pixels, and all images were in JPG format. A total of 10,000 images were collected. Figure 1 As shown, defective fruits are classified into nine categories in order: tip crack, crack, disease, irritated growth, rot, dry scar, deformity, mold, and twinning. The last category is intact fruit. Figure 2 As shown, intact fruits are categorized into six types based on their ripeness and cherry color: large ripe fruit, large semi-ripe fruit, medium ripe fruit, medium semi-ripe fruit, small ripe fruit, and small semi-ripe fruit. LabelImg software was used to annotate the images, and the annotation information was saved as an XML file in PASCAL VOC format.
[0047] like Figure 3As shown, the aforementioned laboratory hardware mainly consists of two parts: an image acquisition device and a computer processing unit. Specifically, the computer processing unit 1 is connected to the synchronous light source controller 2 and the image acquisition device 6. The image acquisition device 6 is an infrared trigger consisting of a CMOS industrial camera (acA2000-50g type), a lens (M1614-MP29(CH)3 type), and a strobe controller. The camera is triggered to take pictures through the strobe controller and the infrared trigger. The computer processing unit 1 mainly has a GeForce GTX 3080 graphics card, 16G of RAM and a POE gigabit network card, an Intel(R) Core(TM) i9-10900K processor, 32G of RAM, and a frequency of DDR4 3000. It is also equipped with an LED light source diffuser 8 and a tiered light source cover 7 to ensure image quality. A conveyor gear 5 is provided to move the cherries to be photographed forward, and a roller 4 is provided to flip the cherries. A laser proximity sensor 3 connected to the computer processing unit 1 is provided to control the illumination of the LED light source diffuser 8 when there are cherries at the current position.
[0048] Cherry defect and grading detection is primarily based on three criteria: the presence of defects, fruit size, and ripeness. In practice, defective cherries are first removed, and the good cherries are then graded. A cherry defect and grading system mainly consists of two parts: the first part detects cherry defects, and the second part grades the cherries based on size and color. The system flowchart is shown below. Figure 4 As shown.
[0049] This invention utilizes an improved YOLOX network to detect cherry defects and achieve real-time cherry grading. The YOLOX network model mainly consists of three parts: CSPDarknet, FPN, and YOLO Head, as shown in the diagram below. Figure 5 As shown:
[0050] The feature extraction network is the backbone of object detection, determining the speed and accuracy of the detection model. The backbone feature extraction network generates three effective feature layers. YOLOX's backbone feature extraction network is CSPDarknet. For each input cherry image, it is first resized to 640×640 pixels, then features are extracted using a Focus network structure. Afterwards, convolutional normalization and activation functions are used to adjust the number of channels, followed by four Resblock body structures for further feature extraction. The Resblock body structure first uses a 3×3 convolution to compress the height and width and adjust the number of channels, then uses a CSPlayer structure for feature extraction. The fourth Resblock body structure incorporates an SPP structure, which uses max pooling with different kernels for feature extraction, stacks the pooled results, and then uses convolution to adjust the number of channels.
[0051] FPN primarily addresses the multi-scale problem in object detection by significantly improving the performance of small object detection through simple changes in network connections. Since lower-level features have less semantic information but accurate target location, while higher-level features have richer semantic information but less clear target location, the top-level features are upsampled and fused with lower-level features, and each layer is predicted independently.
[0052] In YOLOX, the YOLO Head is divided into two parts, which are then integrated during final prediction. The YOLO Head structure is as follows: Figure 6 As shown. The obtained Cls. is used to determine the type of object contained in each feature point, Reg. is used to determine the regression parameters of each feature point. After adjusting the regression parameters, a prediction box can be obtained, and Obj. is used to determine whether each feature point contains an object.
[0053] In cherry defect detection, the first step is to remove defective cherries. Generally, cherry defects can be divided into two categories. Category 1 defects include nose-tip cracks, splits, lesions, irritated growth, dry scars, deformities, and twinning. These defective cherries can be sold as substandard fruit at a reduced price. Category 2 defects include rot and mold. These defective cherries need to be separated from other cherries promptly to avoid further losses.
[0054] In this embodiment, a fusion factor is used to configure FPN: the difficulty in cherry detection with inconspicuous defects lies mainly in the small scale of the target itself, the limited information content, and the insufficient detailed features. FPN, as a multi-scale detection method, is suitable for algorithms handling small target detection. Two main factors affect the performance of FPN in small target detection: the downsampling factor and the coupling degree between adjacent feature layers. For the downsampling factor, a lower downsampling ratio results in a larger feature map, which is more suitable for small target detection, but the calculation is more complex. The FPN feature fusion method is as follows: Figure 7 As shown, it can be expressed by equation (1):
[0055] P i =f layeri (f inneri (C i )+α i i+1 *f upsample (P′ i+1 )), (1)
[0056] Where f inner f represents a 1×1 convolution operation for channel matching; upsample f represents a 2× upsampling operation for resolution matching; layer The 3×3 convolution operation represents feature processing; α represents the fusion factor.
[0057] Typically, the fusion factor α is set to 1 based on the FPN detector. Since the FPN fuses features from levels P2, P3, P4, P5, and P6, there will be three different α values. These represent the fusion factors between two adjacent layers. Since P6 is generated by directly downsampling P5, there is no fusion factor between P5 and P6. During fusion, the fusion ratio between different feature layers can be adjusted by setting different α values. The fusion formula for different layers in FPN is:
[0058]
[0059]
[0060]
[0061]
[0062] A series of experiments revealed that adjusting the fusion factor can affect the performance of weak target detection. Since the default α is 1, adjusting the value of α impacts the average detection accuracy of the experimental results. Figure 8 It can be seen that when α is 0.5, the experimental results are significantly improved. Therefore, the value of α is determined to be 0.5 in this scheme;
[0063] Integrating Focal loss into the loss function addresses the issue that in object detection algorithms, for each input image, numerous region proposals may be generated, but only a small portion of these contain the true target, leading to class imbalance. YOLOX is a one-stage method; compared to two-stage methods, it does not generate candidate boxes but directly classifies anchor boxes, resulting in faster speed but reduced accuracy.
[0064] Since the imbalance between positive and negative samples can severely reduce detection accuracy, and this imbalance is unavoidable—in practice, the number of intact cherry images will exceed the number of defective cherries, and there will also be many inconspicuous defective samples—this embodiment proposes a novel cross-entropy loss function, Focal loss, to address the imbalance between positive and negative samples and improve the focus on training samples. The new loss function dynamically adjusts using an adjustable factor, automatically reducing the detection of easily classified samples and primarily focusing on difficult-to-classify samples.
[0065] The loss is the direct sum of the cross entropy of each training sample, meaning that the weight formula for each sample is the same.
[0066]
[0067] p t This represents the probability that the predicted sample belongs to 1, when y t When y = 1, it means that the t-th sample belongs to this type of object; when y = 1, it means that the t-th sample belongs to this type of object. t When = 0, it means that the t-th sample does not belong to this class. This leads to sample imbalance, causing the loss function to become skewed. Therefore, the following function is used to improve the original loss function:
[0068]
[0069]
[0070] Where γ represents an adjustable parameter, γ≥0; (1-p t ) γ This represents the modulation coefficient, which reduces the weight of easily classified samples, allowing the model to focus more on difficult-to-classify samples during training.
[0071] To more comprehensively explore the impact of the experimental algorithm on detection accuracy and speed, experiments were conducted on a cherry defect dataset using various strategies. Table 1 shows that the improved cherry defect system enhances the detection results for all categories, achieving a significant improvement for each category, with an mAP of 97.59%, demonstrating a remarkable effect. This is mainly due to the improved network, which enhances the detection capability for inconspicuous features.
[0072] A fusion factor is proposed for the FPN detector to describe the coupling degree between adjacent layers in the feature pyramid. The top-down and lateral connection feature fusion mechanism helps the detector obtain better feature representation, while the hierarchical matching mechanism maps targets of different sizes to feature layers of different resolutions for learning. Feature layers of different resolutions can focus more on learning targets of the appropriate resolution size for the current feature layer. Table 1 shows that after adding the fusion factor, the improvement effect on lesions and decay was significantly improved, with AP values increasing by 4.89% and 4.27%, respectively, and mAP increasing by 2.72%.
[0073] Building upon the aforementioned improvements, the cross-entropy loss function is further refined. This function reduces the weights of easily classified samples, allowing the model to focus more on difficult-to-classify samples during training, thus mitigating sample imbalance. While maintaining the original network's speed advantage, detection accuracy is further improved. Figure 9 It can be seen that the greater the difference between the true value and the predicted value, the larger the loss, the smaller the loss function, the faster the model's processing speed and the higher its accuracy. The loss functions of the model before and after the improvement are as follows: Figure 10 As shown.
[0074] Table 1 Comparison of different improvement strategies on the test set
[0075]
[0076]
[0077] To verify the effectiveness and advancement of the proposed improved YOLOX model in cherry defect detection accuracy and efficiency, while ensuring consistent model parameters, the model was compared with the superior performance of current object detection algorithms such as YOLOv4, Faster R-CNN, and SSD. Test results were evaluated using a test set. Table 2 shows that compared to Faster R-CNN, SSD, and YOLOv4, the proposed algorithm achieves higher detection accuracy in both AP and mAP values, demonstrating significant improvement. In terms of detection speed, the proposed algorithm achieves a detection speed of 33.8 frames / s, a substantial improvement over the other three algorithms.
[0078] Table 2 Comparison of different algorithms on the Cherry test set
[0079]
[0080] In practical detection systems, the lower track is black. When the cherry color is dark, the target is similar to the surrounding background, leading to missed detections. Attention mechanisms focus on local information; as the task changes, the attention area often changes, thus effectively finding the most useful information. In the cherry grading detection network, an attention mechanism is added, allowing the network to focus only on the size and color of the cherries. The experimental algorithm introduces a CBAM module to further enhance feature representation capabilities; the module structure is as follows... Figure 11 As shown.
[0081] The output of the convolutional layer first passes through a channel attention module to obtain a weighted result, then passes through a spatial attention module for final weighting to obtain the final result. Its mathematical expression is:
[0082]
[0083] in, Element-wise multiplication is represented by F; the input feature map is represented by M. C (F) represents the output of the channel attention module, M S (F') represents the output of the spatial attention module; F” represents the feature map output by the CBAM attention mechanism module;
[0084] To efficiently compute channel attention, the spatial dimension of the input feature map is compressed. To focus spatial information, both average pooling and max pooling are used. The channel attention module structure is as follows: Figure 12 As shown, the input feature map is first processed by average pooling and max pooling, and then fed into a multilayer perceptron (MLP) with shared weights. The MLP contains one hidden layer, which is equivalent to two fully connected layers. Finally, a sigmoid activation function is used to obtain the channel attention map. The mathematical expression is:
[0085]
[0086] Where σ represents the Sigmoid activation function, W0 and W1 represent the weights of the MLP, and W0∈R C / r×C ,W1∈R C×C / r .
[0087] Unlike channel attention, spatial attention is used to focus on where meaningful features originate. It complements channel attention, and the spatial attention module structure is as follows: Figure 13 As shown, firstly, max pooling and average pooling are performed on each feature point's channel, and these two results are stacked to generate a feature map with 2 channels; then, a 7×7 convolution is used to reduce the number of channels to 1; finally, a sigmoid activation function is applied to obtain a spatial attention map, the mathematical expression of which is:
[0088]
[0089] Where 7×7 represents the size of the convolution kernel;
[0090] The attention mechanism is a plug-and-play module. Since placing it on the backbone would render the network's pre-trained weights unusable, we added the attention mechanism to the three effective feature layers extracted from the YOLOX backbone network, and also added it after the upsampling module.
[0091] By adding an attention mechanism module to the network, recalibrated features are obtained, emphasizing important features and compressing unimportant features. Table 3 shows that, while maintaining approximately the same grading detection speed, mAP is significantly improved. After the improvement of the cherry grading detection system, the AP values for all six cherry categories are improved, and the detection accuracy is relatively average across categories, with an mAP of 95.92%.
[0092] Table 3 Comparison of the system before and after improvement on the test set
[0093]
[0094] Depend on Figure 14 It can be seen that the YOLOX loss gradually decreased to 0.4 after 30 iterations, and finally stabilized at around 0.34. After introducing the CBAM module, the network loss value decreased and the convergence speed increased, eventually stabilizing at around 0.19, indicating that the improved algorithm proposed in the experiment achieved good results.
[0095] This invention proposes a cherry defect detection method based on an improved YOLOX by improving the multi-feature fusion module and loss function. It combines the attention mechanism CBAM to enhance the learning of key feature information, improve the accuracy of cherry size and maturity grading, and compares it with existing cherry grading algorithms to verify the feasibility of the experimental algorithm in cherry grading detection. The aim is to provide a theoretical basis and technical support for the later realization of automated cherry defect and grading detection.
[0096] For the defect detection network, a fusion factor configuration FPN was used to improve the detection capability of cherries with inconspicuous defects, and Focal loss was integrated into the loss function to improve the problem of sample imbalance between different classes. The average detection accuracy reached 97.59%, an improvement of 5.75% compared to the original network. For the cherry grading detection network, a fusion attention mechanism was adopted to guide the model's focus, and the average detection accuracy reached 95.92%, an improvement of 6.99% compared to the original network. Therefore, the accuracy of cherry defect and grading detection algorithms based on the YOLOX algorithm was significantly improved.
[0097] Obviously, the above embodiments are merely illustrative examples for clear explanation and are not intended to limit the implementation. Those skilled in the art will recognize that other variations or modifications can be made based on the above description. It is neither necessary nor possible to exhaustively list all possible implementations here. However, obvious variations or modifications derived therefrom are still within the scope of protection of this invention.
Claims
1. A method for detecting cherry defects and grading based on an improved YOLOX model, characterized in that, The improved YOLOX model consists of three parts: CSPDarknet network, FPN feature pyramid network, and YOLO Head network. CSPDarknet serves as the backbone feature extraction network of YOLOX, FPN feature pyramid network is used to solve the multi-scale problem in the detection process, and YOLO Head network is used for feature point judgment. The method includes: a first part performing cherry defect detection, and a second part grading the cherries according to their size and color; The cherry defect detection described in Part 1 specifically involves: using an improved YOLOX model to detect defective fruits, configuring a fusion factor for the FPN feature pyramid network, and integrating the cross-entropy loss function Focal Loss into the loss function; The feature fusion method of the FPN feature pyramid network is expressed by the following formula: ; in This represents a 1×1 convolution operation for channel matching. This represents a 2× upsampling operation for resolution matching; The 3×3 convolution operation represents feature processing. This represents the fusion factor between two adjacent layers. The value is 0.5; The integration of the cross-entropy loss function (Focal Loss) into the loss function is expressed as follows: ; ; in, This represents the probability that the predicted sample belongs to the category of 1. This represents an adjustable parameter. ≥0; Indicates the modulation coefficient; The grading of cherries based on size and color described in Part 2 is as follows: an improved YOLOX model is used to grade intact cherries, and a CBAM attention mechanism module is introduced. The CBAM attention mechanism module is added to the three effective feature layers extracted by the YOLOX backbone feature extraction network CSPDarknet, and the CBAM attention mechanism module is added after the upsampling module of the FPN feature pyramid network. The CBAM attention mechanism module includes a channel attention module and a spatial attention module. The output of the convolutional layer first passes through a channel attention module to obtain a weighted result, and then passes through a spatial attention module for final weighted summation to obtain the final result. Its mathematical expression is as follows: ; in, This indicates element-wise multiplication; Represents the input feature map; This represents the output of the channel attention module. The output of the space attention module is indicated; This represents the feature map output by the CBAM attention mechanism module.
2. The cherry defect and grading detection method based on the improved YOLOX model according to claim 1, characterized in that, The CSPDarknet network first resizes each input cherry image to 640×640, then extracts features from it using a Focus network structure, adjusts the number of channels using convolutional normalization and activation functions, and then extracts features through four Resblock body structures, with an SPP structure added in the fourth Resblock body structure.
3. The cherry defect and grading detection method based on the improved YOLOX model according to claim 1, characterized in that, The channel attention module first processes the input feature map through average pooling and max pooling operations, then feeds it into a multilayer perceptron (MLP) with shared weights. The MLP contains one hidden layer, equivalent to two fully connected layers. Finally, it obtains the channel attention map, i.e., the output of the channel attention module, through a sigmoid activation function; the expression is: ; in This represents the Sigmoid activation function. , Indicates the weights of the MLP. , .
4. The cherry defect and grading detection method based on the improved YOLOX model according to claim 3, characterized in that, The spatial attention module first performs max pooling and average pooling on each feature point's channel, stacks these two results to generate a feature map with 2 channels; then, it reduces the number of channels to 1 through a 7×7 convolution; finally, it passes a sigmoid activation function to obtain a spatial attention map, which is the output of the spatial attention module; the expression is: ; Among them, 7 7 indicates the size of the convolution kernel.