X-ray security inspection image illegal article detection method based on improved YOLOv7

By improving the multidimensional efficient channel attention module MECA and the multi-scale feature aggregation module MFA of YOLOv7, and combining them with the precise bounding box regression loss EIoUer Loss, the problem of blurred object features in X-ray security inspection images was solved, achieving high-precision and efficient detection of prohibited items.

CN118397303BActive Publication Date: 2026-06-23CHONGQING UNIV OF POSTS & TELECOMM

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHONGQING UNIV OF POSTS & TELECOMM
Filing Date
2024-02-05
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing target detection models suffer from blurred object features in X-ray security inspection images, resulting in low detection accuracy, high false detection and false negative rates, and difficulty in achieving efficient automated security inspections.

Method used

A high-efficiency backbone network is constructed by combining the multidimensional efficient channel attention module MECA with the backbone network. A transition network is built between the backbone network and the neck network by the multi-scale feature aggregation module MFA. The precise bounding box regression loss EIoUer Loss is designed to replace CIoU Loss, and the MME-YOLO security inspection image prohibited item detection network model is constructed.

Benefits of technology

The model's detection accuracy and convergence speed were improved, while the false detection rate and false negative rate were reduced, resulting in more efficient detection of prohibited items.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118397303B_ABST
    Figure CN118397303B_ABST
Patent Text Reader

Abstract

The present application relates to an X-ray security image contraband detection method based on improved YOLOv7, belonging to the field of target detection, comprising the following steps: S1: preprocessing the security data set and randomly dividing it into a training set and a validation set; S2: combining a multi-dimensional efficient channel attention module with a backbone network to construct an efficient backbone network; S3: constructing a transition network between the backbone network and the neck network through a multi-scale feature aggregation module; S4: designing a precise bounding box regression loss EIoUer Loss as the positioning loss of the model; S5: constructing an MME-YOLO security image contraband detection network model through an image preprocessing module, an efficient backbone network, a transition network, a neck network and a detection head; S6: training the MME-YOLO model; S7: verifying the MME-YOLO model.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of target detection and relates to a method for detecting prohibited items in X-ray security images based on an improved YOLOv7. Background Technology

[0002] Due to the large passenger flow and semi-enclosed nature of public transportation, its safety has always been a major concern. When inspecting passengers' luggage, X-ray security screening equipment can generate images that show the internal structure and shape of objects. Security personnel can then manually inspect these images in real time to check for prohibited items.

[0003] However, with the advancement of technology and the improvement of economic level, my country's transportation system has continued to expand and passenger flow has continued to increase. How to use security images to automatically detect prohibited items in an economical and efficient manner while reducing the workload of security personnel has become a new challenge and requirement.

[0004] In today's era of rapid development in deep learning and artificial intelligence, fully automated security checks with machine assistance are of great significance. However, security check images not only show items arranged in a disorderly manner, but also have unclear features, which is extremely unfavorable for existing target detection models to automatically identify and locate prohibited items. As a result, the detection accuracy of current target detection models still needs to be improved, and the false detection rate and false negative rate also need to be reduced. Summary of the Invention

[0005] In view of this, the purpose of the present invention is to provide a method for detecting prohibited items in X-ray security images based on the improved YOLOv7, thereby solving the problems existing in the prior art.

[0006] To achieve the above objectives, the present invention provides the following technical solution:

[0007] A method for detecting prohibited items in X-ray security images based on an improved YOLOv7 includes the following steps:

[0008] S1: Preprocess the security check dataset and randomly divide it into training and validation sets;

[0009] S2: Combine the Multidimensional High-Efficiency Channel Attention Module (MECA) with the backbone network to construct a high-efficiency backbone network;

[0010] S3: A transition network is constructed between the backbone network and the neck network using the multi-scale feature aggregation module (MFA).

[0011] S4: Design the precise bounding box regression loss EIoUer Loss to replace CIoU Loss as the model's localization loss;

[0012] S5: Construct the MME-YOLO security inspection image prohibited item detection network model through image preprocessing module, high-efficiency backbone network, transition network, neck network and detection head;

[0013] S6: Input the training set data into the MME-YOLO model and train it;

[0014] S7: Input the validation set data into the trained MME-YOLO model for validation to obtain the final detection results.

[0015] Furthermore, the preprocessing in step S1 includes removing images of hammer types that have only a small number of samples in the security inspection dataset, as these will not be used for model training and validation.

[0016] Furthermore, the high-efficiency backbone network consists of a CBS module, an ELAN module, an MP module, and a multi-dimensional high-efficiency channel attention module (MECA).

[0017] The CBS module includes convolutional layers, batch normalization layers, and activation functions for extracting target features;

[0018] The ELAN module contains four branches: one CBS module in the first branch, one CBS module in the second branch, two CBS modules in the third branch, and two CBS modules in the fourth branch. Dense residual structures are used between different branches, and feature layers from different gradient paths are superimposed to enable the model to learn more information.

[0019] The MP module uses convolution and max pooling to preserve the target's feature information to the maximum extent during downsampling;

[0020] In the Multidimensional Efficient Channel Attention Module (MECA), the input feature map is first processed by global max pooling and global average pooling to obtain corresponding feature description vectors. Then, the obtained max-pooling and average-pooling feature description vectors are stacked dimensionally to maximize the preservation of effective feature information. Finally, one-dimensional convolution is used to complete information exchange between channels. The convolution kernel size is determined by an adaptive function, as shown below:

[0021]

[0022] Where k represents the kernel size, C represents the number of channels, b and γ are used to change the ratio between the number of channels and the kernel, and odd represents the kernel size, which can only be an odd number.

[0023] Furthermore, the transition network consists of a CBS module, an SPPCSPC module, and a multi-scale feature aggregation module (MFA), and is positioned between the backbone network and the neck network to re-aggregate the feature maps output by the backbone network from both dimensional and spatial perspectives.

[0024] The SPPCSPC module uses spatial pyramid pooling to adapt the model to images of different resolutions, thereby increasing the receptive field.

[0025] The Multi-Scale Feature Aggregation Module (MFA) improves the network's attention to the feature information of contraband items from a multi-scale perspective by combining the Spatial Pyramid Pooling (SPP) module and the Convolutional Block Attention (CBAM) module. The SPP module has a variety of pooling kernels. After the feature map passes through pooling kernels of different sizes, the dimensions are stacked again and then passed through two convolutional layers of different scales to capture the global contextual prior information of the contraband items.

[0026] The convolutional block attention module (CBAM) is composed of a channel attention module (CAM) and a spatial attention module (SAM) connected in series. First, the input feature map is aggregated by global max pooling and global average pooling to learn the spatial information of the feature map and learn the importance of different channels. Then, a multilayer perceptron is used to capture the interdependence of feature information between multiple channels. Finally, an activation function is used to generate a channel attention map and it is combined with the input feature map to generate a channel attention feature map.

[0027] The channel attention feature map is input into the spatial attention module (SAM), and the average and maximum values ​​are extracted from each channel. This places the focus of the network on a local part of the feature map, making it easier for the network to pay attention to the feature information of contraband items.

[0028] Finally, the input feature map, the multi-scale feature map output by the Spatial Pyramid Pooling (SPP) module, and the attention feature map output by the Convolutional Block Attention (CBAM) module are combined, and the features in different dimensions are deeply fused through convolutional layers to form a multi-scale aggregated feature containing rich perceptual field and feature information.

[0029] Furthermore, the construction of the precise bounding box regression loss EIoUer Loss first involves replacing the bounding box loss function CIoU Loss in YOLOv7 with EIoU Loss, calculating the width loss and height loss separately. Then, the EIoU Loss is improved by defining a width-height loss function that is only related to the width and height values ​​of the predicted and ground truth boxes, and designing constraint functions to limit the values ​​of the width and height losses to prevent loss explosion. Specifically, this is illustrated by the following formula:

[0030]

[0031]

[0032] Among them, b and b gt w and w' represent the center points of the predicted bounding box and the ground truth bounding box, respectively. gt H and h represent the widths of the predicted bounding box and the ground truth bounding box, respectively. gt ρ represents the height of the predicted bounding box and the ground truth bounding box, respectively; ρ represents the Euclidean distance between the two parameters; c represents the diagonal length of the smallest bounding rectangle containing the predicted bounding box and the ground truth bounding box; and S(x) represents the constraint function.

[0033] Furthermore, the MME-YOLO security inspection image prohibited item detection network model consists of an image preprocessing module, a high-efficiency backbone network, a transition network, a neck network, and a detection head;

[0034] The image preprocessing module adjusts all images in the security inspection dataset SIXray to a size of 640*640, and uses Mosaic and Mixup as data augmentation strategies. After data augmentation, the training data is input into an efficient backbone network to extract its features.

[0035] The efficient backbone network is responsible for capturing features at different levels from the input image and gradually reducing the resolution of the feature maps;

[0036] The transition network is responsible for re-aggregating the feature maps output by the backbone network, and outputting multi-scale aggregated features with rich perceptual field and feature information.

[0037] The neck network first uses a feature pyramid network structure to progressively upsample from bottom to top and fuse the multi-scale aggregated features output by the transition network. Then, it uses a path aggregation network structure to progressively downsample from top to bottom and fuse the feature map output by the feature pyramid network. Finally, the feature map is divided into three different scales and sequentially fed into the detection head.

[0038] The detection head includes three detection heads designed for different target sizes, which respectively receive large-size features, medium-size features, and small-size features from the neck network, and use the exact bounding box regression loss (EIoUer Loss) to measure the degree of difference between the model prediction results and the true bounding boxes, and finally output the detection results.

[0039] The beneficial effects of this invention are as follows:

[0040] First, this invention constructs an efficient backbone network by combining the multidimensional efficient channel attention (MECA) module with the backbone network, thereby enhancing feature extraction capabilities while maintaining computational efficiency.

[0041] Second, this invention constructs a transition network between the backbone network and the neck network through the multi-scale feature aggregation (MFA) module, thereby increasing the overall feature extraction capability of the model.

[0042] Third, this invention improves the model's convergence speed and accuracy by designing the precise bounding box regression loss EIoUer Loss to replace CIoU Loss as the model's localization loss, thereby enhancing the model's detection performance.

[0043] Fourth, the detection method of the present invention significantly improves the detection accuracy of prohibited items in X-ray security inspection images, and the prediction box is closer to the target object, thus achieving higher detection accuracy.

[0044] Other advantages, objectives, and features of the invention will be set forth in part in the description which follows, and in part will be apparent to those skilled in the art from the following examination, or may be learned from practice of the invention. The objectives and other advantages of the invention can be realized and obtained through the following description. Attached Figure Description

[0045] To make the objectives, technical solutions, and advantages of the present invention clearer, the preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings, wherein:

[0046] Figure 1 This is a flowchart of the method in this invention;

[0047] Figure 2 This is a network structure diagram of the Multidimensional Efficient Channel Attention (MECA) module in this invention;

[0048] Figure 3 This is a network structure diagram of the Multi-Scale Feature Aggregation (MFA) module in this invention;

[0049] Figure 4 This is a structural diagram of the MME-YOLO network model in this invention;

[0050] Figure 5 This is a comparison chart of the detection performance of the YOLOv7 network model and the MME-YOLO network model. Detailed Implementation

[0051] The following specific examples illustrate the implementation of the present invention. Those skilled in the art can easily understand other advantages and effects of the present invention from the content disclosed in this specification. The present invention can also be implemented or applied through other different specific embodiments, and various details in this specification can be modified or changed based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that the illustrations provided in the following embodiments are only schematic representations of the basic concept of the present invention. Unless otherwise specified, the following embodiments and features can be combined with each other.

[0052] The accompanying drawings are for illustrative purposes only and are schematic diagrams, not actual pictures. They should not be construed as limiting the invention. To better illustrate the embodiments of the invention, some parts in the drawings may be omitted, enlarged, or reduced, and do not represent the actual product dimensions. It is understandable to those skilled in the art that some well-known structures and their descriptions may be omitted in the drawings.

[0053] In the accompanying drawings of the embodiments of the present invention, the same or similar reference numerals correspond to the same or similar components. In the description of the present invention, it should be understood that if terms such as "upper," "lower," "left," "right," "front," and "rear" indicate the orientation or positional relationship based on the orientation or positional relationship shown in the drawings, they are only for the convenience of describing the present invention and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, the terms used to describe positional relationships in the drawings are only for illustrative purposes and should not be construed as limiting the present invention. For those skilled in the art, the specific meaning of the above terms can be understood according to the specific circumstances.

[0054] Please see Figures 1-5 This is a method for detecting prohibited items in X-ray security images based on an improved YOLOv7.

[0055] like Figure 1 As shown, the method includes the following steps:

[0056] Step 1: Download the public X-ray security inspection dataset SIXray, remove hammer types with very few samples, and then randomly divide the SIXray dataset into training and validation sets.

[0057] In this step, the adjusted SIXray dataset contains 8909 X-ray images and corresponding label information, covering five categories of prohibited items: guns, knives, scissors, wrenches, and pliers. The adjusted SIXray dataset is randomly divided into training and validation sets in a 9:1 ratio. The training set, consisting of 90% of the adjusted SIXray dataset, will be used to train the MME-YOLO security image prohibited item detection network model. The validation set, consisting of 10% of the adjusted SIXray dataset, will be used to validate the MME-YOLO security image prohibited item detection network model, and the validation results will be used as the model's performance metric.

[0058] Step 2: Combine the Multidimensional Efficient Channel Attention (MECA) module with the backbone network to construct an efficient backbone network.

[0059] In this step, please refer to Figure 2 The efficient backbone network consists of a CBS module, an ELAN module, an MP module, and a multi-dimensional efficient channel attention MECA module, achieving better feature extraction performance. The CBS module incorporates convolutional layers, batch normalization layers, and activation functions, primarily used for extracting target features. The ELAN module contains four branches: one CBS module in the first branch, one in the second branch, two in the third branch, and two in the fourth branch. Dense residual structures are used between different branches, stacking feature layers from different gradient paths to enable the model to learn more information. The MP module uses convolution and max pooling to maximize the retention of target feature information during downsampling. The multi-dimensional efficient channel attention MECA module is an improvement upon the efficient channel attention ECA module. Its specific structure involves first applying global max pooling and global average pooling to obtain corresponding feature description vectors from the input feature map, then stacking the obtained max-pooled and average-pooled feature description vectors dimensionally to maximize the preservation of effective feature information. Then, one-dimensional convolution is used to complete the information exchange across channels. The kernel size is determined by an adaptive function, allowing longer feature description vectors to interact across more channels. The kernel adaptive function is shown below:

[0060]

[0061] Where k represents the kernel size, C represents the number of channels, b and γ are used to change the ratio between the number of channels and the kernel size, and odd indicates that the kernel size can only be an odd number.

[0062] Step 3: Construct a transition network between the backbone network and the neck network using the multi-scale feature aggregation (MFA) module.

[0063] In this step, please refer to Figure 3 The transition network, composed of a CBS module, a SPPCSPC module, and a multi-scale feature aggregation (MFA) module, is positioned between the backbone network and the neck network. Its purpose is to re-aggregate the feature maps output by the backbone network from both dimensional and spatial perspectives, providing the neck network with multi-scale aggregated features rich in perceptual field and feature information, thereby increasing the overall feature extraction capability of the model. The SPPCSPC module uses spatial pyramid pooling to adapt the model to images of different resolutions, increasing the receptive field. The MFA module, by combining the spatial pyramid pooling (SPP) module and the convolutional block attention (CBAM) module, enhances the network's attention to contraband feature information from a multi-scale perspective. Specifically, the SPP module uses a variety of pooling kernel numbers; after the feature maps pass through pooling kernels of different sizes, their dimensions are re-stacked, and they are then passed sequentially through two convolutional layers of different scales, better capturing the global contextual prior information of the contraband. The Convolutional Block Attention Module (CBAM) is composed of a Channel Attention Module (CAM) and a Spatial Attention Module (SAM) connected in series. First, the input feature map is aggregated using global max pooling and global average pooling to gather spatial information of the feature map, learning the importance of different channels. Then, a multilayer perceptron captures the interdependencies of feature information between multiple channels. Finally, an activation function generates a channel attention map, which is combined with the input feature map to generate a channel attention feature map. The channel attention feature map is then input into the spatial attention module (SAM), where the average and maximum values ​​are extracted for each channel, placing the network's focus on a specific local area of ​​the feature map, making the features of contraband items more easily noticed by the network. Finally, the input feature map, the multi-scale feature map output by the Spatial Pyramid Pooling (SPP) module, and the attention feature map output by the CBAM module are combined, and deep fusion of features from different dimensions is performed through convolutional layers, ultimately forming a multi-scale aggregated feature map containing rich perceptual field and feature information.

[0064] Step 4: Design the precise bounding box regression loss EIoUer Loss to replace CIoU Loss as the model's localization loss.

[0065] In this step, the construction of the precise bounding box regression loss EIoUer Loss first involves replacing the bounding box loss function CIoU Loss in YOLOv7 with EIoULoss, calculating the width and height losses separately. This addresses the issue where the width and height loss functions in CIoU Loss lose their constraint effect when the aspect ratios of the predicted and ground truth boxes are the same. Then, considering that when the aspect ratios of the predicted and ground truth boxes remain constant, the aspect ratio loss in EIoU Loss gradually increases as the predicted box approaches the ground truth box, leading to chaotic losses that hinder model convergence, EIoULoss is improved. A width and height loss function is defined that is only related to the aspect ratios of the predicted and ground truth boxes, and constraint functions are designed to limit the values ​​of the width and height losses to prevent loss explosion. Compared to CIoU Loss and EIoU Loss, EIoUer Loss has better convergence speed and accuracy, as detailed below:

[0066]

[0067]

[0068] Among them, b and b gt w and w' represent the center points of the predicted bounding box and the ground truth bounding box, respectively. gt H and h represent the widths of the predicted bounding box and the ground truth bounding box, respectively. gt ρ represents the height of the predicted bounding box and the ground truth bounding box, respectively; ρ represents the Euclidean distance between the two parameters; c represents the diagonal length of the smallest bounding rectangle containing the predicted bounding box and the ground truth bounding box; and S(x) represents the constraint function.

[0069] Step 5: Construct the MME-YOLO security inspection image prohibited item detection network model through the image preprocessing module, high-efficiency backbone network, transition network, neck network, and detection head.

[0070] In this step, please refer to Figure 4The MME-YOLO security image prohibited item detection network model consists of an image preprocessing module, an efficient backbone network, a transition network, a neck network, and a detection head. The image preprocessing module resizes all images in the SIXray security dataset to 640*640 pixels and uses Mosaic and Mixup as data augmentation strategies. After data augmentation, the training data is input into the efficient backbone network for feature extraction. The efficient backbone network is responsible for capturing features at different levels from the input images and gradually reducing the resolution of the feature maps. The transition network is responsible for processing the feature maps output by the backbone network. The system re-aggregates features to output multi-scale aggregated features rich in perceptual field and feature information. The neck network first uses a feature pyramid network structure to progressively upsample from bottom to top and fuse the multi-scale aggregated features output by the transition network. Then, it uses a path aggregation network structure to progressively downsample from top to bottom and fuse the feature map output by the feature pyramid network. Finally, the feature map is divided into three different scales and fed into the detection head in sequence. The detection head contains three detection heads designed for different target sizes, which respectively receive large-size features, medium-size features, and small-size features from the neck network. The system uses the precise bounding box regression loss (EIoUer Loss) to measure the difference between the model prediction result and the true bounding box, and finally outputs the detection result.

[0071] Step 6: Use the training set image data obtained in Step 1 to train the MME-YOLO security inspection image prohibited item detection network model obtained in Step 5.

[0072] In this step, training the MME-YOLO security image contraband detection network model specifically involves: using a Core™ i9-10920X processor, a GeForce RTX 3090 graphics card, and the torch 1.8.0 deep learning framework. The experiment employs a transfer learning strategy, using pre-trained weights on the Pascal VOC2012 dataset. The input image size for the training set is adjusted to a fixed resolution of 640×640, the training epoch is set to 200 epochs, the training batch size is 16, the maximum learning rate is set to 0.01, and the SGD stochastic gradient descent algorithm is used for optimization. Cosine annealing is used to decrease the learning rate, resulting in the trained weights of the MME-YOLO security image contraband detection network model.

[0073] Step 7: Use the validation set image data obtained in Step 1 to validate the trained MME-YOLO security inspection image prohibited item detection network model obtained in Step 6, and obtain the final detection results.

[0074] In this step, to comprehensively evaluate the performance of the MME-YOLO security image prohibited item detection network model in automatically detecting prohibited items, Average Precision (AP), mean Average Precision (mAP), and frames per second (FPS) are selected as evaluation metrics. AP, calculated by comprehensively considering precision and recall under a fixed IoU threshold, reflects the model's detection accuracy for a single category. mAP is obtained by averaging the AP values ​​across all categories, reflecting the model's detection accuracy across all categories. FPS refers to the number of images that can be detected per second, reflecting the model's detection speed. Precision is the proportion of correctly predicted positive samples out of all predicted samples, and recall is the proportion of correctly predicted positive samples out of all positive samples. The formulas for both are as follows:

[0075]

[0076]

[0077] Where TP is the number of samples that the classifier predicts as positive and which are actually positive, i.e., the number of positive samples correctly identified. FP is the number of samples that the classifier predicts as positive but are actually negative, i.e., the number of falsely detected negative samples. FN is the number of samples that the classifier predicts as negative but are actually positive, i.e., the number of missed positive samples. The formulas for calculating AP and mAP are as follows:

[0078]

[0079]

[0080] Where P represents accuracy, R represents recall, n represents the total number of categories, and AP(n) represents the AP value of the nth category of prohibited items.

[0081] In this step, the IoU threshold was set to 0.5. Ablation experiments were used to study the effects of the original backbone network, the high-efficiency channel attention (ECA) backbone network, and the multidimensional high-efficiency channel attention (MECA) backbone network on detection accuracy and detection speed. The experimental results are shown in Table 1, which compares the multidimensional high-efficiency channel attention (MECA) module with other methods.

[0082] Table 1

[0083]

[0084]

[0085] As shown in Table 1, compared to YOLOv7, adding the efficient channel attention (ECA) module to the backbone network, which only increases computational cost, increases mAP50 by 0.6% while maintaining FPS at 61.5 frames / s. Furthermore, using the improved multidimensional efficient channel attention (MECA) module increases the model's mAP50 by 0.9% while maintaining FPS at 61.1 frames / s. This verifies the effectiveness of the backbone network improvement strategy.

[0086] In this step, the IoU threshold was set to 0.5, and ablation experiments were used to study the effects of localization losses CIoU Loss, EIoULoss, and EIoUer Loss on detection accuracy and detection speed. The experimental results are shown in Table 2, which compares EIoUer Loss with other loss functions.

[0087] Table 2

[0088]

[0089] As shown in Table 2, replacing CIoU Loss with EIoU Loss as the model's localization loss improved the model's mAP50 by 0.5%; while using the improved EIoUer Loss improved the model's mAP50 by 0.7%, reaching 91.2%. This verifies the effectiveness of the loss function improvement strategy.

[0090] In this step, the IoU threshold was set to 0.5. Ablation experiments were conducted to study the impact of adding the Multidimensional Efficient Channel Attention (MECA) module, the Multi-Scale Feature Aggregation (MFA) module, and the Precise Bounding Box Regression Loss (EIoUer Loss) on detection performance. The experimental results are shown in Table 3, Ablation Experiments of the MME-YOLO Model on SIXray.

[0091] Table 3

[0092]

[0093] As shown in Table 3, compared to the YOLOv7 model, adding the Multi-Dimensional Efficient Channel Attention (MECA) module to the backbone network improved the model's mAP50 by 0.9%. This is because the MECA module increases the attention to effective features, enhancing the backbone network's feature extraction capabilities. Using the Multi-Scale Feature Aggregation (MFA) module in the transition network improved the model's mAP50 by 0.9% while reducing FPS by 11.7 frames / s. This is because the MFA module outputs multi-scale aggregated features containing rich perceptual field and feature information, improving the network's attention to contraband features from a multi-scale perspective. However, due to the increased number of parameters, the model's detection speed decreased. Using the Precise Bounding Box Regression Loss (EIoUer Loss) improved the model's mAP50 by 0.7%. This is because the EIoUer Loss accelerates the convergence speed and accuracy of predicted boxes, improving the model's detection performance. The final improved model, MME-YOLO, achieved a 1.3% improvement over the YOLOv7 model on mAP50, reaching 91.8%, with an FPS of 48.3 frames / s.

[0094] In this step, the IoU threshold was set to 0.5. A comparative experiment was conducted to compare five classic mainstream object detection models, namely Faster R-CNN, SSD, RetinaNet, YOLOv3, and YOLOv7. The experimental results are shown in Table 4, which compares the detection performance of the MME-YOLO model with other network models.

[0095] Table 4

[0096] Models mAP50 / (%) FPS / (frame / s) SSD 79.7 62.2 YOLOv3 86.9 58.0 RetinaNet 87.4 24.7 Faster R-CNN 88.6 30.9 YOLOv7 90.5 62.3 MME-YOLO 91.8 48.3

[0097] Table 4 shows the performance comparison of MME-YOLO with other networks on the SIXray dataset. As can be seen from Table 4, compared with RetinaNet and Faster R-CNN, MME-YOLO maintains its lead in both detection accuracy and speed. Compared with SSD, YOLOv3, and YOLOv7, although MME-YOLO's detection speed is slightly lower, its detection speed of 48.3 frames / s is sufficient for real-time detection tasks in practical applications, and it achieves the best detection performance in detecting contraband in X-ray security images. This verifies the effectiveness of the improved strategy.

[0098] In this step, please refer to Figure 5 The results show that the detection method of the present invention significantly improves the detection accuracy of prohibited items in X-ray security inspection images, the predicted box is closer to the target item, and higher detection accuracy is achieved, while reducing the false detection rate and the missed detection rate.

[0099] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the present invention, and all such modifications or substitutions should be covered within the scope of the claims of the present invention.

Claims

1. A method for detecting prohibited items in X-ray security images based on an improved YOLOv7, characterized in that: Includes the following steps: S1: Preprocess the security check dataset and randomly divide it into training and validation sets; S2: Combine the Multidimensional High-Efficiency Channel Attention Module (MECA) with the backbone network to construct a high-efficiency backbone network; S3: A transition network is constructed between the backbone network and the neck network using the multi-scale feature aggregation module (MFA). S4: Design the precise bounding box regression loss EIoUer Loss to replace CIoU Loss as the model's localization loss; S5: Construct the MME-YOLO security inspection image prohibited item detection network model through image preprocessing module, high-efficiency backbone network, transition network, neck network and detection head; S6: Input the training set data into the MME-YOLO model and train it; S7: Input the validation set data into the trained MME-YOLO model for validation to obtain the final detection results; The transition network consists of a CBS module, an SPPCSPC module, and a multi-scale feature aggregation module (MFA), and is positioned between the backbone network and the neck network. It is used to re-aggregate the feature maps output by the backbone network from the perspectives of dimension and space. The SPPCSPC module uses spatial pyramid pooling to adapt the model to images of different resolutions, thereby increasing the receptive field. The Multi-Scale Feature Aggregation Module (MFA) improves the network's attention to the feature information of contraband items from a multi-scale perspective by combining the Spatial Pyramid Pooling (SPP) module and the Convolutional Block Attention (CBAM) module. The SPP module has a variety of pooling kernels. After the feature map passes through pooling kernels of different sizes, the dimensions are stacked again and then passed through two convolutional layers of different scales to capture the global contextual prior information of the contraband items. The convolutional block attention module (CBAM) is composed of a channel attention module (CAM) and a spatial attention module (SAM) connected in series. First, the input feature map is aggregated by global max pooling and global average pooling to learn the spatial information of the feature map and learn the importance of different channels. Then, a multilayer perceptron is used to capture the interdependence of feature information between multiple channels. Finally, an activation function is used to generate a channel attention map and it is combined with the input feature map to generate a channel attention feature map. The channel attention feature map is input into the spatial attention module (SAM), and the average and maximum values ​​are extracted from each channel. This places the focus of the network on a local part of the feature map, making it easier for the network to pay attention to the feature information of contraband items. Finally, the input feature map, the multi-scale feature map output by the Spatial Pyramid Pooling (SPP) module, and the attention feature map output by the Convolutional Block Attention (CBAM) module are combined, and the features in different dimensions are deeply fused through convolutional layers to form a multi-scale aggregated feature containing rich perceptual field and feature information.

2. The method for detecting prohibited items in X-ray security images based on improved YOLOv7 according to claim 1, characterized in that: The preprocessing described in step S1 includes removing images of hammer types that have only a small number of samples in the security inspection dataset, as these will not be used for model training and validation.

3. The method for detecting prohibited items in X-ray security images based on improved YOLOv7 according to claim 1, characterized in that: The high-efficiency backbone network consists of a CBS module, an ELAN module, an MP module, and a multi-dimensional high-efficiency channel attention module (MECA). The CBS module includes convolutional layers, batch normalization layers, and activation functions for extracting target features; The ELAN module contains four branches: one CBS module in the first branch, one CBS module in the second branch, two CBS modules in the third branch, and two CBS modules in the fourth branch. Dense residual structures are used between different branches, and feature layers from different gradient paths are superimposed to enable the model to learn more information. The MP module uses convolution and max pooling to preserve the target's feature information to the maximum extent during downsampling; In the Multidimensional Efficient Channel Attention Module (MECA), the input feature map is first processed by global max pooling and global average pooling to obtain corresponding feature description vectors. Then, the obtained max-pooling and average-pooling feature description vectors are stacked dimensionally to maximize the preservation of effective feature information. Finally, one-dimensional convolution is used to complete information exchange between channels. The convolution kernel size is determined by an adaptive function, as shown below: in, Indicates the kernel size. Indicates the number of channels. and Used to change the ratio between the number of channels and the convolution kernel. This represents the kernel size and can only be an odd number.

4. The method for detecting prohibited items in X-ray security images based on improved YOLOv7 according to claim 1, characterized in that: The construction of the precise bounding box regression loss EIoUer Loss first involves replacing the bounding box loss function CIoU Loss in YOLOv7 with EIoU Loss, calculating the width loss and height loss separately. Then, the EIoU Loss is improved by defining a width-height loss function that is only related to the width and height values ​​of the predicted and ground truth boxes, and designing constraint functions to limit the values ​​of the width and height losses to prevent loss explosion. Specifically, it is shown in the following formula: in, and These represent the center points of the predicted bounding box and the ground truth bounding box, respectively. and These represent the widths of the predicted bounding box and the ground truth bounding box, respectively. and These represent the heights of the predicted bounding box and the ground truth bounding box, respectively. This indicates the calculation of the Euclidean distance between two of the parameters. This represents the diagonal length of the smallest bounding rectangle containing both the predicted and ground truth boxes. This represents the constraint function.

5. The method for detecting prohibited items in X-ray security images based on improved YOLOv7 according to claim 1, characterized in that: The MME-YOLO security inspection image prohibited item detection network model consists of an image preprocessing module, a high-efficiency backbone network, a transition network, a neck network, and a detection head; The image preprocessing module adjusts all images in the security inspection dataset SIXray to a size of 640*640, and uses Mosaic and Mixup as data augmentation strategies. After data augmentation, the training data is input into an efficient backbone network to extract its features. The efficient backbone network is responsible for capturing features at different levels from the input image and gradually reducing the resolution of the feature maps; The transition network is responsible for re-aggregating the feature maps output by the backbone network, and outputting multi-scale aggregated features with rich perceptual field and feature information. The neck network first uses a feature pyramid network structure to progressively upsample from bottom to top and fuse the multi-scale aggregated features output by the transition network. Then, it uses a path aggregation network structure to progressively downsample from top to bottom and fuse the feature map output by the feature pyramid network. Finally, the feature map is divided into three different scales and sequentially fed into the detection head. The detection head includes three detection heads designed for different target sizes, which respectively receive large-size features, medium-size features, and small-size features from the neck network, and use the exact bounding box regression loss (EIoUer Loss) to measure the degree of difference between the model prediction results and the true bounding boxes, and finally output the detection results.