X-ray security image contraband detection method and system based on scan line context

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing a scan-line context-based X-ray security image detection method, this method utilizes techniques such as a hierarchical multi-scale encoder, a lightweight feature recalibration module, and a contrast-driven feature aggregation module to address the problem of insufficient detection accuracy caused by overlapping and occlusion of items in X-ray security images, thereby achieving efficient detection of contraband.

CN122265633APending Publication Date: 2026-06-23NANJING FOREST POLICE COLLEGE

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: NANJING FOREST POLICE COLLEGE
Filing Date: 2026-03-27
Publication Date: 2026-06-23

AI Technical Summary

Technical Problem

Existing X-ray security inspection image detection methods suffer from insufficient detection accuracy and high false negative rates when dealing with situations involving severe overlap of objects, low contrast between the target and the background, and difficulty in modeling long-range spatial dependencies. They also perform poorly in occluded scenarios.

Method used

A detection method based on scanline context is adopted. It improves detection accuracy and inference efficiency by using a hierarchical multi-scale encoder, a lightweight feature recalibration module, a contrast-driven feature aggregation module, and a bidirectional spatial context module, combined with a decoupled scale-aware detection head and zoom loss and full intersection-union loss for end-to-end training.

Benefits of technology

It significantly improves the detection accuracy of contraband in obscured and low-contrast scenarios, reduces the false negative rate, and maintains high inference efficiency, making it suitable for deployment in resource-constrained security inspection systems.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122265633A_ABST

Patent Text Reader

Abstract

The application discloses an X-ray security image contraband detection method and system based on a scanning line context, and belongs to the field of computer vision and public safety detection. In view of the detection precision problem of the X-ray security image, which is characterized by serious overlap, low contrast and difficulty in modeling long-range spatial dependence, an end-to-end SCXNet network is designed. A hierarchical multi-scale encoder integrates a lightweight feature re-calibration module and a contrast-driven feature aggregation module to extract 8 / 16 multi-scale features. A bidirectional convolutional long short-term memory network aggregates global context along the scanning line to establish long-range spatial dependence. A decoupled scale perception detection head is independently designed to have a classification branch and a positioning branch. A zoom loss + complete intersection over union loss is used for joint training, and a customized label allocation strategy is used. The application has an mAP of 73.61% and 91.67% on the PIDray and OPIXray datasets, which is significantly better than existing methods, effectively improves the detection precision in the occlusion / low-contrast scene, and has high industrial practicability while considering the inference efficiency and adapting to the deployment of security equipment.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of computer vision and public security inspection technology, specifically to a method and system for detecting contraband in X-ray security images based on scan line context, applicable to X-ray baggage security inspection scenarios in public transportation hubs such as airports, train stations, and subways. Background Technology

[0002] X-ray baggage security checks are a core component of maintaining public safety at public transportation hubs. Traditional security procedures rely on manual interpretation of X-ray images by security personnel, which is susceptible to factors such as fatigue and psychological state, leading to risks of missed or false detections. With the development of deep learning technology, automatic X-ray security image detection technology based on object detection networks has become a research hotspot, effectively improving security check efficiency.

[0003] However, X-ray security inspection images are fundamentally different from natural optical images. They are obtained through transmission imaging, where multiple items along the X-ray path are projected onto the same plane, resulting in severe overlap of items, cluttered images, and loss of depth information. Furthermore, the similar attenuation coefficients of adjacent materials can easily cause blurred or semi-transparent boundaries of items. Contraband may be obscured or visually split into multiple unconnected parts, posing challenges to automated detection.

[0004] Currently, most mainstream X-ray security inspection image detection methods are based on improvements to general target detection networks, which have three main drawbacks: First, they rely on local convolution operations, making it difficult to model long-range spatial dependencies and unable to associate the same object parts separated in cluttered scenes, resulting in a high false negative rate under severe occlusion; second, standard convolutional backbone networks process all channels uniformly, generating redundant features and weakening the contrast sensitivity of low-contrast X-ray images; and third, they treat X-ray images as isotropic static two-dimensional grids, failing to utilize the structured spatial layout features acquired line by line by the X-ray scanner.

[0005] In existing technologies, some solutions attempt to integrate edge feature extractors or attention modules to adapt to X-ray scenes, but these only address single problems and fail to achieve synergistic optimization of channel redundancy suppression, boundary enhancement, and long-range spatial dependency modeling. Furthermore, general loss functions and label assignment strategies cannot address the severe imbalance between foreground and background samples in X-ray security inspection datasets, resulting in insufficient detection accuracy. Therefore, a customized detection method specifically tailored to the characteristics of X-ray security inspection images is urgently needed to significantly improve the detection accuracy of contraband in occluded and low-contrast scenes while maintaining inference efficiency. Summary of the Invention

[0006] The purpose of this invention is to overcome the shortcomings of the prior art and provide a method and system for detecting contraband in X-ray security inspection images based on scan line context. This solves the problem of insufficient detection accuracy in X-ray security inspection images caused by severe overlap of items, low contrast between the target and the background, and difficulty in modeling long-range spatial dependencies. At the same time, it ensures the inference efficiency of detection and adapts to the deployment requirements of security inspection systems with limited resources.

[0007] To achieve the above objectives, the present invention provides a method for detecting contraband in X-ray security images based on scan line context, comprising the following steps: S1. The input three-channel X-ray security inspection image is fed into a hierarchical multi-scale encoder for feature extraction. The hierarchical multi-scale encoder contains four encoding stages, each consisting of two 3×3 convolutional layers combined with batch normalization and ReLU activation function. The number of channels in the four stages are 64, 128, 256 and 512 respectively. The first convolutional layer in each stage has a stride of 2 for spatial downsampling, and the second convolutional layer has a stride of 1 for feature refinement. A lightweight feature recalibration module is inserted after the second encoding stage for channel response recalibration, and a contrast-driven feature aggregation module is inserted after the fourth encoding stage to enhance boundary sensitivity. Finally, a shallow feature map with a stride of 8 and a deep feature map with a stride of 16 are output. S2. The deep feature map with a stride of 16 output from step S1 is fed into the bidirectional spatial context module. The bidirectional spatial context module first projects the input features to the reduced-dimensional channel space through a 1×1 convolution, and then divides the spatial dimension of the feature map along the X-ray scan line direction into an ordered strip sequence. A bidirectional convolutional long short-term memory network is used to process the strip sequence from the forward and backward directions to aggregate global context information. The hidden states in the two directions are concatenated and then projected back to the original channel dimension through a 1×1 convolution. Then, they are added to the original input features through residual connections to obtain the context-enhanced deep feature map. The shallow feature map with a stride of 8 remains unchanged. S3. The context enhancement features and shallow feature maps obtained in step S2 are fed into the decoupled scale-aware detection head. The detection head designs the classification branch and regression branch independently at each scale. Each branch contains two 3×3 convolutional layers and one 1×1 convolutional layer. The classification branch outputs the class confidence score, and the regression branch outputs the bounding box parameters. S4. The zoom loss is used as the classification loss and the complete intersection-union loss is used as the localization loss to perform end-to-end joint training on the network.

[0008] Furthermore, the processing procedure of the lightweight feature recalibration module in step S1 is as follows: The input feature map is equally divided into a first part and a second part along the channel dimension. The first part serves as the identity branch to preserve the original feature distribution. The second part is subjected to a 3×3 depthwise separable convolution to extract local spatial patterns, and then global average pooling and a sigmoid activation function are used to generate channel attention vectors. These vectors are then multiplied element-wise with the features of the second part to obtain recalibrated features. The identity branch and the recalibrated features are concatenated along the channel dimension for output.

[0009] Furthermore, the processing procedure of the contrast-driven feature aggregation module in step S1 is as follows: S11. Edge Enhancement Feature Extraction: Fixed horizontal and vertical Sobel operators are used as untrainable depthwise convolutions and applied to each channel of the input feature map. The square root of the sum of the squares of the horizontal and vertical gradients is calculated to obtain the gradient magnitude. The gradient magnitude is then processed by a compressed activation channel attention operator to generate channel attention weights, which are multiplied element-wise with the input feature map to obtain the edge enhancement features. S12. Spatial context feature extraction: The input feature map is expanded by a 5×5 depthwise separable convolution, and then channel attention weights are generated by the same compressed excitation channel attention operator. The spatial context features are obtained by multiplying them element-wise with the input feature map. S13. Adaptive Spatial Attention Fusion: After concatenating edge enhancement features with spatial context features, a single-channel spatial attention map is generated by an attention generator consisting of two 1×1 convolutional layers. The two types of features are weighted and fused, and the residual connections of the input feature map are superimposed as the final output.

[0010] Furthermore, the compressed activation channel attention operator is as follows: global average pooling is performed on the input features, the number of channels is compressed to one-quarter of the original number of channels and ReLU is activated through the first 1×1 convolutional layer, and then the original number of channels is restored and Sigmoid is activated through the second 1×1 convolutional layer, and the channel attention weights are output.

[0011] Furthermore, in step S2, the bidirectional convolutional long short-term memory network uses 3×1 convolutional kernels for all gating operations, maintains independent learnable parameters for both forward and backward directions, does not use peephole connections, and sets the number of dimensionality reduction channels to 256.

[0012] Furthermore, it also includes a label allocation strategy: at each feature level, the top 10 candidate positions ranked by the combined metric of classification confidence and intersection-union ratio for each ground truth box are selected as positive samples. The quality target of the positive samples is the intersection-union ratio between the predicted box and the matched ground truth box, and the remaining positions are negative samples.

[0013] This invention also provides an X-ray security image contraband detection system based on scan line context for implementing the above method, characterized in that it includes: Hierarchical multi-scale encoder module: used to extract shallow feature maps with a stride of 8 and deep feature maps with a stride of 16 from input X-ray security inspection images. It integrates a lightweight feature recalibration module and a contrast-driven feature aggregation module. Bidirectional Spatial Context Module: Used to establish bidirectional long-range spatial dependencies along the X-ray scan line direction for deep feature maps with a step size of 16, and output context-enhanced features; Decoupled scale-aware detection head module: used to independently perform classification and bounding box regression tasks at two scales and output detection results; Joint Loss Training Module: The system is optimized end-to-end using zoom loss and full intersection-union loss.

[0014] Furthermore, the lightweight feature recalibration module adopts a channel segmentation-transformation-merging structure: half of the channels are retained as identity branches, and the other half of the channels are recalibrated through depthwise separable convolution and channel attention, and then spliced with the identity branches along the channels for output.

[0015] Furthermore, the contrast-driven feature aggregation module: uses a fixed, untrainable Sobel gradient operator to extract edge information as an explicit prior, and expands the spatial context features obtained by the receptive field with a 5×5 depth-separable convolution. The features are then weighted and fused through an adaptive spatial attention mechanism, and residual connections are superimposed to ensure the stability of the gradient flow.

[0016] Furthermore, the bidirectional convolutional long short-term memory network in the bidirectional spatial context module uses a 3×1 convolutional kernel for gating operations, has no peephole connections, and only operates on deep feature maps with a stride of 16.

[0017] The beneficial effects of this invention are reflected in the following aspects: (1) The lightweight feature recalibration module recalibrates only half of the channels through the channel segmentation-transformation-merging structure, which enhances the information channels while protecting the weak signal response of low-contrast items. Furthermore, the depthwise separable convolution introduces very few additional parameters, thus controlling the computational overhead while suppressing channel redundancy.

[0018] (2) The contrast-driven feature aggregation module introduces the fixed Sobel operator as an explicit gradient prior into feature extraction, which complements and fuses with the learned spatial context features, effectively enhancing the blurred object boundary information in overlapping scenes. Furthermore, the Sobel operator does not participate in training, and the attention generator is a lightweight structure with extremely low computational overhead.

[0019] (3) Based on the physical characteristics of X-ray scanners acquiring data line by line, the bidirectional spatial context module uses a bidirectional convolutional long short-term memory network to aggregate the global context along the scan line direction. It establishes long-range spatial dependencies that are difficult to capture efficiently by traditional convolution and standard self-attention mechanisms. It can associate objects and parts that are spatially separated due to occlusion or splitting, and significantly reduce the false negative rate in severely occluded scenarios.

[0020] (4) The decoupled scale-aware detection head designs the classification and localization branches independently, so that each branch learns specialized feature representations to avoid mutual interference. The classification branch focuses on material-sensitive channel patterns, while the localization branch focuses on robust spatial features that resist boundary ambiguity.

[0021] (5) The combined use of zoom loss and full intersection-union loss, along with a customized label allocation strategy, effectively alleviates the problem of imbalance between foreground and background samples in the X-ray security inspection dataset, and improves the accuracy of classification and localization.

[0022] (6) Each module of the present invention is designed to address three approximately orthogonal technical issues: channel redundancy, boundary enhancement, and spatial context modeling. The synergistic effect achieves cumulative performance improvement. On the PIDray dataset, the overall mAP reaches 73.61%, which is 5.81% higher than the best baseline. On the OPIXray dataset, the mAP reaches 91.67%, which is 1.85% higher than the best baseline. At the same time, it maintains an acceptable inference efficiency of 35.37 GFLOPs and 83 FPS, balancing detection accuracy and deployment efficiency. Attached Figure Description

[0023] Figure 1 is a flowchart of the overall implementation of the SCXNet detection framework of the present invention; Figure 2 is a schematic diagram of the structure of the hierarchical multiscale encoder (HMSE) of the present invention; Figure 3 is a schematic diagram of the lightweight feature recalibration module (LFRM) of the present invention; Figure 4 is a schematic diagram of the contrast-driven feature aggregation module (CDFA) of the present invention; Figure 5 is a schematic diagram of the structure of the bidirectional spatial context module (BSCM) of the present invention. Detailed Implementation

[0024] To facilitate understanding and implementation of the present invention by those skilled in the art, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments.

[0025] I. Overall Framework like Figure 1As shown, the overall process of the SCXNet detection framework proposed in this invention is as follows: Given a three-channel X-ray security inspection image with a resolution of H×W, it is first fed into a hierarchical multiscale encoder (HMSE) to extract hierarchical feature representations, generating two feature maps with a step size of 8 and a step size of 16.

[0026] The feature maps with a stride of 16 are then fed into the Bidirectional Spatial Context Module (BSCM) for long-range spatial context augmentation. Finally, the augmented features at both scales are fed into the Decoupled Scale Aware Detector Head (DSADH) for classification and localization prediction, respectively. The entire network is jointly trained end-to-end using zoom loss and full intersection-union loss.

[0027] II. Hierarchical Multiscale Encoder like Figure 2 As shown, the hierarchical multiscale encoder comprises four encoding stages. Each encoding stage consists of two 3×3 convolutional layers, followed by batch normalization and a ReLU activation function. The number of output channels for the four stages are set to 64, 128, 256, and 512, respectively.

[0028] To match computational complexity with model efficiency, the stride of the first 3×3 convolutional layer in each stage is set to 2 to achieve spatial downsampling, and the stride of the second 3×3 convolutional layer is set to 1 for feature refinement. No additional 2×2 max pooling operation is used.

[0029] The spatial resolutions of the output feature maps in the four stages are H / 2×W / 2, H / 4×W / 4, H / 8×W / 8, and H / 16×W / 16, respectively. To balance the detection accuracy of targets at different scales, a shallow feature map (256 channels) with a stride of 8 is taken from the third stage, and a deep feature map (512 channels) with a stride of 16 is taken from the fourth stage after processing by the contrast-driven feature aggregation module.

[0030] In practice, the lightweight feature recalibration module is inserted after the second encoding stage, and the contrast-driven feature aggregation module is inserted after the fourth encoding stage. Both are serial inline modules and do not introduce parallel branches.

[0031] III. Lightweight Feature Recalibration Module like Figure 3 As shown, let the input feature map of the lightweight feature recalibration module be X, with dimensions H'×W'×C. The specific processing procedure of this module is as follows: First, X is equally divided into X1 and X2 along the channel dimension, with each part having a dimension of H'×W'×C / 2, as shown in equation (1).

[0032] (1) X1 is retained directly as the identity branch, maintaining the original feature distribution without any transformation. X2 enters the recalibration branch, first undergoing a 3×3 depthwise separable convolution to extract the local spatial pattern to obtain X2', as shown in equation (2).

[0033] (2) in, Indicates the core size is The depthwise separable convolution is then performed. Then, global average pooling is applied to X2' to compress the spatial dimension to obtain a global descriptor with a dimension of 1×1×C / 2, which is then activated by the Sigmoid activation function to generate the channel attention vector A, as shown in equation (3).

[0034] (3) Multiply A and X2' element by element to obtain the recalibrated features, as shown in equation (4).

[0035] (4) Finally, the identity branch X1 and the recalibrated feature are concatenated along the channel dimension and output, and the dimension is restored to H'×W'×C, as shown in equation (5).

[0036] (5) The key to this design is retaining half of the channels without transformation. In X-ray imaging, different channels may carry differential information related to specific material categories. Traditional SE attention blocks apply uniform attention weights to all channels, which may oversuppress the weaker channel responses that are crucial for identifying low-contrast objects.

[0037] This module enhances informative channels while protecting potentially valuable weak signals by recalibrating only half of the channels and maintaining identity propagation on the other half. Depthwise separable convolution introduces only 9×C / 2 additional parameters, resulting in extremely low computational overhead.

[0038] IV. Contrast-Driven Feature Aggregation Module like Figure 4 As shown, let the input feature map of the contrast-driven feature aggregation module be F, with dimensions H'×W'×C. The specific processing procedure of this module is as follows: For edge enhancement feature extraction, the horizontal Sobel operator is used. and vertical Sobel operator It is implemented as a depthwise convolution with fixed weights, as shown in equation (6).

[0039] (6) and The horizontal gradient is obtained by applying it independently to each channel of F. and vertical gradient As shown in equation (7).

[0040] (7) in, Indicates the input feature map, This is a convolution operation. The gradient magnitude is calculated. , as in equation (8).

[0041] (7) Will Input-compressed excitation channel attention operator The operator sequentially performs global average pooling, a first 1×1 convolution (compressing the number of channels from C to C / 4), ReLU activation, a second 1×1 convolution (restoring to C channels), and Sigmoid activation, outputting channel attention weights. These weights are then element-wise multiplied by F to obtain the edge enhancement features. .

[0042] (8) in, For the compressed-excitation style channel attention operator, its compression ratio r is defined by equation (9), and is set to r = 4 throughout the network.

[0043] (9) For spatial context feature extraction, F is expanded to have a receptive field through a 5×5 depthwise separable convolution, and then input into a compressed activated channel attention operator with the same structure as above. The generated attention weights are then multiplied element-wise by F to obtain the spatial context features. , as in equation (10).

[0044] (10) in, Indicates the core size is Depthwise separable convolutions. For adaptive fusion, and The joint features of the 2C channels are obtained by concatenating along the channel dimension, and a lightweight attention generator is then used. Generate a spatial attention map, as shown in equation (11).

[0045] (11) Attention generator It consists of two 1×1 convolutional layers, as shown in Equation (12). The first 1×1 convolution compresses the number of channels from 2C to C / 8, and after ReLU activation, the second 1×1 convolution generates a single-channel attention map.

[0046] (12) By combining attention maps and And through residual connection, the final output is as shown in equation (13).

[0047] (13) The Sobel operator is fixed and not updated during training, serving as an explicit gradient prior bias that provides physical edge information to the learned convolutional features. This design is particularly important in X-ray images because the similar attenuation coefficients of adjacent materials can lead to extremely low contrast at object boundaries, which purely learned convolutional features may not adequately capture.

[0048] V. Two-way Spatial Context Module like Figure 5 As shown, this module is only applied to feature maps with a stride of 16 to balance context modeling capability with computational overhead. Let the input feature map dimension be... ,in , .

[0049] First, the input features are processed by 1×1 convolution. Projected onto the reduced-dimensional channel space, number of reduced-dimensional channels Set it to 256, as in equation (14).

[0050] (14) Then, select a spatial dimension (corresponding to the direction of conveyor belt movement) and transfer the feature map. Divided into w dimensions The vertical stripes form an ordered sequence {x1, x2, ..., x_w}, as shown in equation (15).

[0051] (15) The sequence is processed using a bidirectional convolutional long short-term memory network. Forward ConvLSTM from... arrive Process each stripe sequentially, reverse ConvLSTM from arrive Process each strip sequentially, as shown in equations (16) to (17).

[0052] (16) (17) The gating operation at each time step includes the input gate. Forgotten Gate Output gate For candidate memory units, all convolution operations in the gating system use 3×1 convolution kernels and do not use peephole connections. The forward and backward directions share the same gating structure but maintain their own independent learnable parameters, as shown in equations (18) to (23).

[0053] (18) (19) (20) (twenty one) (twenty two) (twenty three) At each position t, the forward hidden state and the reverse hidden state are concatenated along the channel dimension and projected back to the original channel dimension C through a 1×1 convolution to obtain the fused hidden state with dimension h×1×C, as shown in equation (24).

[0054] (twenty four) The fused hidden states at all positions are stacked along the sequence dimension to restore a two-dimensional feature map of h×w×C, as shown in Equation (25).

[0055] (25) Then, it is added to the original input features through residual connection, as shown in equation (26).

[0056] (26) Stride-8 features remain unchanged, resulting in the final context-enhanced features, as shown in Equation (27).

[0057] (27) The core design concept of this module is as follows: X-ray baggage images are acquired line by line through a conveyor belt system, and the images contain structured spatial layout patterns introduced by the acquisition process. Standard convolution aggregates information from small neighborhoods and cannot efficiently propagate context across long spatial regions. While standard self-attention mechanisms can model global relationships, they are invariant to permutations and do not naturally utilize this structured spatial pattern. The introduction of bidirectional ConvLSTM allows each spatial location to receive contextual information from the global context, and this information propagation follows a structured spatial order, which helps to reconnect spatially separated object components due to occlusion or disassembly.

[0058] VI. Detection Head and Loss Function The decoupled scale-aware detection head performs detection at two scales with strides of 8 and 16 respectively. At each scale, the classification and regression branches use the same network structure: two 3×3 convolutional layers (256 channels, ReLU activation) followed by a 1×1 convolutional layer. The classification branch outputs a score map with dimensions (H / s)×(W / s)×K, where K is the number of target categories, as shown in Equation (28); the regression branch outputs a bounding box parameter map with dimensions (H / s)×(W / s)×4, as shown in Equation (29). The two branches maintain independent parameters, enabling classification to focus on material-sensitive channel patterns and localization to focus on robust spatial features that resist boundary ambiguity.

[0059] (28) (29) The classification loss uses the variable loss (VFL). For positive samples, the quality target q is set as the IoU value between the predicted box and the ground truth box, and the loss is calculated as shown in Equation (30). For negative samples, the loss is calculated as shown in Equation (31), where α = 0.75 and γ = 2.0. The total classification loss is shown in Equation (32).

[0060] (30) (31) (32) The positioning loss adopts the complete intersection-union ratio loss (CIoU), which is calculated as shown in equations (33) to (34).

[0061] (33) (34) in, Let Euclidean distance be the center point of the predicted bounding box and the ground truth bounding box. The length of the diagonal of the smallest bounding rectangle. Measuring aspect ratio consistency These are adaptive weighting coefficients.

[0062] The total loss is calculated as shown in equation (35). .

[0063] (35) VII. Training Configuration In a specific implementation, the PyTorch framework is used to train the GPU in a distributed data-parallel manner on an NVIDIA RTX 4090 GPU.

[0064] The input image was scaled to 640×640 and normalized to [0,1]. Training was performed for 300 epochs using the SGD optimizer with an initial learning rate of 0.01, momentum of 0.937, and weight decay of 5×10⁻⁶. -4 The first three cycles use linear warm-up, followed by cosine annealing to decay the learning rate to 1×10⁻⁶. -4 Batch size is 16. No data augmentation is used during training. Non-maximum suppression is used for post-processing during inference, with an IoU threshold of 0.65 and a confidence threshold of 0.25.

[0065] VIII. Experimental Verification Evaluations were performed on the OPIXray dataset (8885 images, 5 categories) and the PIDray dataset (47677 images, 12 categories). OPIXray focuses on detection evaluation under occlusion conditions, while PIDray provides a hierarchical evaluation protocol with three difficulty levels: easy, hard, and hidden.

[0066] Experimental results show that SCXNet achieves an overall mAP of 73.61% on the PIDray dataset, with 78.77% on the easy subset, 77.92% on the hard subset, and 61.53% on the hidden subset. On the OPIXray dataset, the mAP reaches 91.67%. Compared to the best baseline YOLO11, it represents an overall improvement of 5.81% on PIDray, a 9.20% improvement on the hard subset, a 3.84% improvement on the hidden subset, and a 1.85% improvement on OPIXray. As the severity of occlusion increases, SCXNet's advantage over general detectors continues to widen, validating the effectiveness of its domain-specific architecture design.

[0067] The ablation experiments were conducted on a subset of PIDray hidden data. Starting with a general baseline of 54.21% mAP, a lightweight feature recalibration module (improved to 55.11%, +0.90%), a contrast-driven feature aggregation module (improved to 57.41%, +2.30%), and a bidirectional spatial context module (improved to 61.53%, +4.12%) were added sequentially. The total improvement of 7.32% is approximately equal to the sum of the individual contributions of each module, indicating that the three modules address three approximately orthogonal failure modes: channel recalibration, boundary enhancement, and spatial context modeling, respectively, and work synergistically with minimal redundancy.

Claims

1. A method for detecting contraband in X-ray security images based on scan line context, characterized in that, Includes the following steps: S1. The input three-channel X-ray security inspection image is fed into a hierarchical multi-scale encoder for feature extraction. The hierarchical multi-scale encoder contains four encoding stages, each consisting of two 3×3 convolutional layers combined with batch normalization and ReLU activation function. The number of channels in the four stages are 64, 128, 256 and 512 respectively. The first convolutional layer in each stage has a stride of 2 for spatial downsampling, and the second convolutional layer has a stride of 1 for feature refinement. A lightweight feature recalibration module is inserted after the second encoding stage for channel response recalibration, and a contrast-driven feature aggregation module is inserted after the fourth encoding stage to enhance boundary sensitivity. Finally, a shallow feature map with a stride of 8 and a deep feature map with a stride of 16 are output. S2. The deep feature map with a stride of 16 output from step S1 is fed into the bidirectional spatial context module. The bidirectional spatial context module first projects the input features to the reduced-dimensional channel space through a 1×1 convolution, and then divides the spatial dimension of the feature map along the X-ray scan line direction into an ordered strip sequence. A bidirectional convolutional long short-term memory network is used to process the strip sequence from the forward and backward directions to aggregate global context information. The hidden states in the two directions are concatenated and then projected back to the original channel dimension through a 1×1 convolution. Then, they are added to the original input features through residual connections to obtain the context-enhanced deep feature map. The shallow feature map with a stride of 8 remains unchanged. S3. Input the context enhancement features and shallow feature maps obtained in step S2 into the decoupled scale-aware detection head; The detection head designs classification and regression branches independently at each scale. Each branch contains two 3×3 convolutional layers and one 1×1 convolutional layer. The classification branch outputs the class confidence score, and the regression branch outputs the bounding box parameters. S4. The zoom loss is used as the classification loss and the complete intersection-union loss is used as the localization loss to perform end-to-end joint training on the network.

2. The method according to claim 1, characterized in that, The processing procedure of the lightweight feature recalibration module in step S1 is as follows: The input feature map is equally divided into a first part and a second part along the channel dimension. The first part serves as the identity branch to preserve the original feature distribution. The second part is subjected to a 3×3 depthwise separable convolution to extract local spatial patterns, and then global average pooling and a sigmoid activation function are used to generate channel attention vectors. These vectors are then multiplied element-wise with the features of the second part to obtain recalibrated features. The identity branch and the recalibrated features are concatenated along the channel dimension for output.

3. The method according to claim 1, characterized in that, The processing procedure of the contrast-driven feature aggregation module in step S1 is as follows: S11, Edge Enhancement Feature Extraction: Fixed horizontal and vertical Sobel operators are used as untrainable depth convolutions and applied to each channel of the input feature map respectively. The gradient magnitude is obtained by calculating the square root of the sum of the squares of the horizontal and vertical gradients. The gradient magnitude is used to generate channel attention weights through a compressed, activated channel attention operator, and then multiplied element-wise with the input feature map to obtain edge enhancement features. S12. Spatial context feature extraction: The input feature map is expanded by a 5×5 depthwise separable convolution, and then channel attention weights are generated by the same compressed excitation channel attention operator. The spatial context features are obtained by multiplying them element-wise with the input feature map. S13. Adaptive Spatial Attention Fusion: After concatenating edge enhancement features with spatial context features, a single-channel spatial attention map is generated by an attention generator consisting of two 1×1 convolutional layers. The two types of features are weighted and fused, and the residual connections of the input feature map are superimposed as the final output.

4. The method according to claim 3, characterized in that, The compressed activation channel attention operator is as follows: global average pooling is performed on the input features, the number of channels is compressed to one-quarter of the original number of channels and ReLU is activated through the first 1×1 convolutional layer, and the original number of channels is restored and Sigmoid is activated through the second 1×1 convolutional layer, and the channel attention weights are output.

5. The method according to claim 1, characterized in that, The bidirectional convolutional long short-term memory network described in step S2 uses 3×1 convolutional kernels for all gating operations, maintains independent learnable parameters for both forward and backward directions, does not use peephole connections, and sets the number of dimensionality reduction channels to 256.

6. The method according to claim 1, characterized in that, It also includes a label assignment strategy: at each feature level, the top 10 candidate positions ranked by the combined metric of classification confidence and intersection-union ratio for each ground truth box are selected as positive samples. The quality target of the positive samples is the intersection-union ratio between the predicted box and the matched ground truth box, and the remaining positions are negative samples.

7. A contraband detection system for X-ray security images based on scan line context, characterized in that, include: Hierarchical multi-scale encoder module: used to extract shallow feature maps with a stride of 8 and deep feature maps with a stride of 16 from input X-ray security inspection images. It integrates a lightweight feature recalibration module and a contrast-driven feature aggregation module. Bidirectional Spatial Context Module: Used to establish bidirectional long-range spatial dependencies along the X-ray scan line direction for deep feature maps with a step size of 16, and output context-enhanced features; Decoupled scale-aware detection head module: used to independently perform classification and bounding box regression tasks at two scales and output detection results; Joint Loss Training Module: The system is optimized end-to-end using zoom loss and full intersection-union loss.

8. The system according to claim 7, characterized in that, The lightweight feature recalibration module adopts a channel segmentation-transformation-merging structure: half of the channels are retained as identity branches, and the other half of the channels are recalibrated through depthwise separable convolution and channel attention, and then spliced with the identity branches along the channels for output.

9. The system according to claim 7, characterized in that, The contrast-driven feature aggregation module uses a fixed, untrainable Sobel gradient operator to extract edge information as an explicit prior, and expands the spatial context features obtained by the receptive field with a 5×5 depth-separable convolution. The features are then weighted and fused through an adaptive spatial attention mechanism, and residual connections are superimposed to ensure the stability of the gradient flow.

10. The system according to claim 7, characterized in that, The bidirectional spatial context module contains a bidirectional convolutional long short-term memory network. The gating operation uses a 3×1 convolutional kernel, has no peephole connection, and only operates on deep feature maps with a stride of 16.