A construction site personnel intrusion detection method based on an improved YOLO model
By improving the YOLOv13 model and combining multi-level feature fusion, dual-stream collaborative attention enhancement, and high-frequency edge feature enhancement, the problem of target detection in complex construction site environments was solved, achieving efficient and accurate intrusion detection of personnel at construction sites.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANDONG UNIV OF SCI & TECH
- Filing Date
- 2026-05-19
- Publication Date
- 2026-06-19
AI Technical Summary
Existing target detection algorithms struggle to achieve fast and accurate identification in construction site environments, especially in complex backgrounds, dense occlusion, and small target detection, where they suffer from missed detections and false detections. Furthermore, traditional YOLO series algorithms struggle to balance lightweight design with high detection accuracy.
An improved YOLOV13 model is adopted, which enhances feature representation capabilities and improves detection accuracy and efficiency by designing a multi-level feature fusion detection module MFFDM, a dual-stream collaborative enhanced attention module DSSEAM, and a Sobel HyperACE module in the backbone network and neck network.
It achieves efficient and accurate target detection in complex construction site environments, reduces the rate of missed detections and false detections, is suitable for edge devices with limited computing power, and meets the real-time requirements of construction sites.
Smart Images

Figure CN122244805A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of personnel intrusion detection technology, specifically relating to a method for detecting personnel intrusion at construction sites based on an improved YOLO model. Background Technology
[0002] With the rapid development of the construction industry and increasingly stringent safety management standards at construction sites, efficient and accurate detection and management of workers has become particularly urgent. Traditional detection methods (such as manual inspection and conventional video surveillance) heavily rely on the real-time observation and personal experience of managers, which is not only time-consuming and labor-intensive but also prone to missed detections and false detections, making it difficult to meet the efficient management needs of modern "smart construction sites." Therefore, the introduction of deep learning-based target detection technology has gradually become the mainstream solution to this problem. In recent years, although deep learning has made breakthrough progress in the field of target detection, conventional algorithms still face many challenges when applied to actual construction scenarios. Construction sites often have complex background environments, dense personnel distribution, and are prone to occlusion and overlap. In addition, the scale changes caused by the distance between the target and the camera, as well as the low resolution of the images captured by the camera, all contribute to the high likelihood of missed detections or false detections by the algorithm, making it difficult to achieve fast and accurate recognition.
[0003] In existing model architectures, two-stage detection algorithms (such as R-CNN, Faster R-CNN, and Mask R-CNN) first generate candidate regions, and then classify and regress these regions. While these algorithms have high accuracy, their inference speed is slow, making them unsuitable for real-time applications in construction sites. Emerging Transformer-based detection algorithms, while boasting excellent accuracy, often have large model parameters and rely heavily on massive amounts of data for training, hindering deployment on edge devices with limited computing power. In contrast, one-stage detection algorithms, represented by the YOLO series, offer a significant speed advantage; they can solve the object detection problem with just one regression step. These algorithms directly feed the image into a convolutional neural network to extract features, generate feature maps, and then generate a series of anchor points on the feature maps. They then obtain the class label and bounding box regression value for each anchor point, simultaneously completing the regression and classification tasks. These algorithms offer high speed and low computational complexity, making them suitable for scenarios with high real-time requirements. Although the YOLO series of algorithms perform well in inference speed, their ability to extract features from complex backgrounds, dense occlusions and small targets at construction sites is insufficient, resulting in a decrease in detection accuracy. Such algorithms cannot achieve an ideal balance between lightweight models and high detection accuracy, making them difficult to apply in practice on edge devices at construction sites with limited computing power. Summary of the Invention
[0004] The purpose of this invention is to propose a method for detecting personnel intrusion at construction sites based on an improved YOLO model, so as to improve the accuracy and efficiency of personnel intrusion detection.
[0005] To achieve the above objectives, the present invention adopts the following technical solution: A construction site intrusion detection method based on an improved YOLO model includes the following steps: Step 1. Collect on-site image data, preprocess the on-site image data, and construct a training dataset; Step 2. Build a construction site personnel intrusion detection model based on the improved YOLOv13. This model includes a backbone network, a neck network, and a detection head. The following improvements are made to the original YOLOv13 architecture: A multi-level feature fusion detection module (MFFDM) is designed in the backbone network and the neck network. It extracts and fuses features of different scales through the multi-scale fusion module (MSFM) to enhance the expressive power of the features. The MSFM module extracts features from multiple scales of an image through convolution and pooling operations, and then fuses them. A dual-stream collaborative attention enhancement module (DSSEAM) is designed in the backbone network. It uses a dual-stream collaborative processing approach for feature extraction and enhances the information in the input feature map by introducing an attention mechanism. The Sobel attention module is introduced into the original HyperACE module to enhance high-frequency edge features, thus forming the SobelHyperACE module; Step 3. Train the construction site personnel intrusion detection model based on the training dataset from Step 1, and use the trained model to identify real-time images of the construction site, thereby achieving personnel intrusion detection.
[0006] Furthermore, based on the aforementioned construction site intrusion detection method based on the improved YOLO model, this invention also proposes a corresponding construction site intrusion detection system based on the improved YOLO model, which adopts the following technical solution: The construction site personnel intrusion detection system based on the improved YOLO model includes the following modules: The preprocessing module is used to collect on-site image data, preprocess the on-site image data, and build a training dataset. The personnel intrusion detection module is used to build a construction site personnel intrusion detection model based on the improved YOLOv13. This model includes a backbone network, a neck network, and a detection head. It is improved upon the original YOLOv13 architecture as follows: The Sobel attention module is introduced into the original HyperACE module to enhance high-frequency edge features, thus forming the SobelHyperACE module; A multi-level feature fusion detection module (MFFDM) is designed in the backbone network and the neck network. It extracts and fuses features of different scales through the multi-scale fusion module (MSFM) to enhance the expressive power of the features. The MSFM module extracts features from multiple scales of an image through convolution and pooling operations, and then fuses them. A dual-stream collaborative attention enhancement module (DSSEAM) is designed in the backbone network. It uses a dual-stream collaborative processing approach for feature extraction and enhances the information in the input feature map by introducing an attention mechanism. The intrusion detection model for construction sites is trained based on the training dataset, and the trained model is used to identify real-time images of construction sites, thereby realizing intrusion detection.
[0007] The present invention has the following advantages: As described above, this invention discloses a method for detecting intrusion at construction sites based on an improved YOLO model. This method constructs an intrusion detection model for construction sites based on an improved YOLOv13. In terms of model architecture, this invention designs an MFFDM module in both the backbone and neck networks. This module integrates features at different levels, such as shallow detail features and deep semantic features of the same modality, enabling the model to acquire more comprehensive and richer information, thereby more accurately understanding target objects in the construction site scene. The MFFDM module can detect small targets using shallow high-resolution features, and identify large targets using deep semantic features, effectively solving the problem of large target scale variations in construction site scenes and improving the model's robustness in complex environments. Meanwhile, this invention incorporates a DSSEAM module in the backbone network. This module utilizes an attention mechanism to assign weights to different regions in the feature map, enabling the model to focus more on key parts of the detected target (such as key body parts of construction workers or important components of mechanical equipment) and reduce attention to irrelevant background areas, thereby improving the accuracy and efficiency of personnel intrusion target detection. The dual-stream collaborative structure allows the DSSEAM module to analyze features from different angles. In complex construction site environments, one branch focuses on local detailed features, while the other branch considers overall contextual features. The two work together to enhance feature representation and effectively address issues such as target scale changes and occlusion. Furthermore, this invention designs a Sobel HyperACE module as a "global external brain," which uses the Sobel module to explicitly enhance high-frequency edge information, aiding in target stripping in complex backgrounds. This module also introduces a dual residual injection mechanism, directly injecting high-order guidance features extracted from the backbone network across layers into the entry points of the neck network and the detection head. This design breaks the black box of YOLO serial fusion, ensuring that the purest edge and high-order features reach the prediction end without loss. Furthermore, this invention can still work stably even under limited computing power and can be deployed on edge devices at construction sites, making it highly valuable for practical application. Attached Figure Description
[0008] Figure 1 This is a flowchart of the construction site personnel intrusion detection method based on the improved YOLO model in an embodiment of the present invention; Figure 2 This is a structural diagram of the construction site personnel intrusion detection model based on the improved YOLOV13 in an embodiment of the present invention; Figure 3 This is a structural diagram of the MFFDM module in an embodiment of the present invention; Figure 4 This is a structural diagram of the MSFM module in an embodiment of the present invention; Figure 5 This is a structural diagram of the DSSEAM module in an embodiment of the present invention; Figure 6 This is a structural diagram of the Sobel HyperACE module in an embodiment of the present invention; Figure 7 This is a structural diagram of the Sobel attention module in an embodiment of the present invention; Figure 8 This is a structural diagram of the C3AH module in an embodiment of the present invention; Figure 9 This is a structural diagram of the adaptive hyperedge generation module in an embodiment of the present invention; Figure 10 This is a structural diagram of the hypergraph convolution module in an embodiment of the present invention. Detailed Implementation
[0009] The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments: Example 1 This embodiment 1 describes a method for detecting personnel intrusion at construction sites based on an improved YOLO model, such as... Figure 1 As shown, the method includes the following steps: Step 1. Collect on-site image data, preprocess the on-site image data, and construct a training dataset.
[0010] Image data of people is collected through image acquisition devices such as cameras and surveillance cameras at the construction site, and then packaged and organized. Labelimg (a labeling tool) is used to outline the people in each image to create labels, and then converted into the label format commonly used by YOLO (with coordinates and category). After the labels are created, they are divided into training set, validation set and test set required by the model in a ratio of 8:1:1.
[0011] Step 2. Build a construction site intrusion detection model based on the improved YOLOv13; such as... Figure 2 As shown, the model includes a backbone network, a neck network, and a detection head; it is an improvement on the original YOLOv13 architecture as follows: The Sobel attention module is introduced into the original HyperACE module to enhance high-frequency edge features, thus forming the SobelHyperACE module; A multi-level feature fusion detection module (MFFDM) is designed in the backbone network and the neck network. It extracts and fuses features of different scales through the multi-scale fusion module (MSFM) to enhance the expressive power of the features. The MSFM module extracts features from multiple scales of an image through convolution and pooling operations, and then fuses them. A dual-stream collaborative attention enhancement module DSSEAM is designed in the backbone network. It adopts a dual-stream collaborative processing method, which sends the input feature map to two parallel branches, the left stream and the right stream, for processing. The information in the input feature map is extracted and enhanced through the attention mechanism.
[0012] In this embodiment, the backbone network includes five convolutional modules, two MFFDM modules, and two DSSEAM modules.
[0013] Define five convolutions as the first, second, third, fourth, and fifth convolutions; define two MFFDM modules as the first and second MFFDM modules; and define two DSSEAM modules as the first and second DSSEAM modules.
[0014] The processing flow of the backbone network is as follows: First, the input image undergoes downsampling and preliminary feature extraction through the first and second convolutions; then it is fed into the first MFFDM module to fuse features at different scales; the fused features are further processed through the third convolution and the second MFFDM module; then, they are processed sequentially through the fourth convolution, the first DSSEAM module, and the fifth convolution before being fed into the second DSSEAM module.
[0015] Define the output of the second MFFDM module as B3, the output of the first DSSEAM module as B4, and the output of the second DSSEAM module as B5.
[0016] B3, B4, and B5 are fed into the Sobel HyperACE module for processing to obtain feature tensors H3, H4, and H5; B3, B4, and B5 are then added element-wise to H3, H4, and H5, respectively.
[0017] In the backbone network, conventional convolutional modules undertake basic feature downsampling and preliminary extraction tasks, laying the foundation for subsequent feature processing. The MFFDM module plays a crucial role in fusing multi-scale features, greatly enhancing the expressive power of features by integrating feature information at different scales, enabling the model to better capture various details and structures in the image. The DSSEAM module focuses on spatial and channel dimensions, applying an attention mechanism that allows the model to pay more attention to important feature regions, effectively preserving feature information of small objects and preventing them from being weakened or lost during network propagation. After a series of processing steps, the backbone network outputs feature maps at three different scales: B3, B4, and B5. These feature maps contain semantic information at different levels of the image.
[0018] The Sobel HyperACE module, a key component of the personnel intrusion detection model, receives three feature maps (B3, B4, and B5) at different scales from the backbone network. It highlights edge information in the image through edge enhancement techniques, improves feature extraction efficiency through multi-dimensional parallel processing, and then leverages hypergraph convolutional high-order inference to mine high-order semantic relationships between features. Finally, it outputs feature tensors H3, H4, and H5 containing rich high-order semantic relationships and precise localization information. To ensure the effective utilization of high-order information during the detection process, this invention employs a dual residual injection mechanism. Before B3, B4, and B5 enter the neck network, and before the final fused feature map output from the neck network enters the detection head, they are added element-wise with H3, H4, and H5, respectively. This design allows high-order semantic priors and edge localization information to directly reach the prediction end, effectively preventing information loss during propagation in deep networks.
[0019] The neck network consists of two upsampling modules, four Concat modules, four MFFDM modules, and two convolutional modules.
[0020] Define two upsampling modules as the first and second upsampling modules; define four concat modules as the first, second, third, and fourth concat modules; define four MFFDM modules as the third, fourth, fifth, and sixth MFFDM modules; define two convolutions as the sixth and seventh convolutions. Define the feature obtained by adding B3 and H3 element by element as the first H3 feature; define the feature obtained by adding B4 and H4 element by element as the first H4 feature; define the feature obtained by adding B5 and H5 element by element as the first H5 feature; The processing flow of the neck network is as follows: First, the first H5 feature is processed by the first upsampling module and then fused with the first H4 feature in the first Concat module; the fused feature is then processed by the third MFFDM module, and then processed by the second upsampling module before being fused with the first H3 feature in the second Concat module. The features fused by the second Concat module are then sent to the fourth MFFDM module for processing. The features output by the third MFFDM module are added element-wise to H4 to obtain the second H4 feature; the features output by the fourth MFFDM module are added element-wise to H3 to obtain the second H3 feature. Then, the second H3 feature is processed by the sixth convolution and then fused with the second H4 feature in the third Concat module; The features fused by the third Concat module are sent to the fifth MFFDM module for processing, and after the seventh convolution process, they are fused with the first H5 features in the fourth Concat module. The features fused by the fourth Concat module are then sent to the sixth MFFDM module for processing. The outputs of the fourth, fifth, and sixth MFFDM modules are added element-wise to H3, H4, and H5 respectively and then sent to the detection head.
[0021] The neck network is primarily responsible for bidirectional, multi-scale feature fusion. The top-down fusion process utilizes upsampling and concatenation operations to combine deep, high-semantic features with shallow, high-resolution features, resulting in features that possess both rich semantic information and high spatial resolution. The bottom-up fusion further integrates features from different scales through convolution and concatenation. After each concatenation and fusion, the MFFDM module processes the fused features again, effectively eliminating the aliasing effect caused by multi-scale feature fusion and further enhancing feature representation. Finally, the three fused feature maps output by the neck network are fed into three detection heads at different scales. Each detection head independently handles the classification, scoring, and bounding box regression tasks for large, medium, and small targets, achieving accurate classification and localization for target detection at the construction site. This design significantly improves the model's detection performance in complex construction environments such as occlusion, density, and multi-scale conditions, providing strong support for safety management and intelligent monitoring at construction sites.
[0022] like Figure 3 As shown, the processing flow of the MFFDM module in this embodiment is as follows: First, the input feature map is subjected to preliminary feature extraction and smoothing through conventional convolution, laying the foundation for subsequent processing.
[0023] The feature map after convolution is fed into two parallel branches for further processing.
[0024] One branch processes data through the MSFM module, generating feature maps rich in multi-scale local details. This module, with its unique multi-scale feature extraction and fusion mechanism, can deeply mine feature information from different scales, and is particularly adept at capturing local features of targets and key features at different scales. This branch can accurately capture the local features of small targets, avoiding the loss of feature information during transmission in deep networks.
[0025] Another branch constructs a "Conv - Dilated Conv - Conv" feature extraction path, sequentially processing convolution, dilated convolution, and convolution to obtain a contextual feature map with a broad receptive field. Specifically, the features are first further transformed through the first convolutional layer (Conv); then, a dilated convolutional layer (Dilated Conv) is used to significantly expand the receptive field of the convolutional kernel without increasing the number of model parameters or computational complexity, thereby obtaining a wider range of contextual information, which is crucial for grasping the overall structure of the scene and identifying large-scale targets; subsequently, the features are further optimized and adjusted through a second convolutional layer (Conv). This branch, leveraging the broad contextual information obtained through dilated convolution, helps to better understand the overall scene situation and identify large-scale targets, such as surrounding security facilities and site layout.
[0026] The feature maps output from the two branches—the feature map rich in multi-scale local details output by the MSFM module and the context feature map with a wide receptive field—are concatenated along the channel dimension to obtain a composite feature map. This concatenated composite feature map serves as the final output of the MFFDM module.
[0027] In construction site intrusion detection scenarios, the MFFDM module plays a crucial role. Construction sites present a wide range of target scales, from small, distant intruders or objects to larger, nearby equipment. By combining feature maps rich in multi-scale local details with contextual feature maps possessing a broad receptive field through a stitching operation, the output features contain both refined local features and comprehensive contextual information. This effectively enhances the model's ability to detect targets at different scales, reduces the probability of missed and false detections, and achieves efficient and accurate performance in construction site intrusion detection tasks.
[0028] The multi-level feature fusion detection module exhibits unique advantages in the field of feature processing, and is particularly suitable for tasks such as construction site intrusion detection, which have high requirements for multi-scale target recognition.
[0029] The core of the MSFM module lies in integrating and utilizing feature information of images at different scales. For example... Figure 4 As shown, the processing flow of the MSFM module is as follows: First, the input feature map undergoes global average pooling, which compresses the spatial dimensional information and extracts the global statistical features of each channel, providing macro-level guidance for subsequent operations.
[0030] Then, the number of channels in the feature map is flexibly adjusted through 1×1 convolution to achieve preliminary feature transformation; and it is fused with the original input feature map to form residual connections, effectively avoiding gradient vanishing and ensuring smooth information transmission in the network.
[0031] The fused feature map is then processed by dividing it into main paths and auxiliary paths.
[0032] In the main path, the fused feature map undergoes two-dimensional max pooling to accurately extract local salient features while reducing the spatial size. It is then divided into two branches for processing. The upper branch further optimizes the features through 1×1 convolution, while the lower branch undergoes two-dimensional max pooling and convolution in sequence. Finally, the feature map output from the upper branch is fused with the feature map output from the lower branch to enhance expressive power.
[0033] In the auxiliary path, the fused feature map is initially processed by a 1×1 convolution, and then two parallel 3×3 and 5×5 convolutions are used to extract local fine features and broad context features, respectively. Then the outputs of the two parallel convolutions are concatenated. Finally, the feature maps output by the main path and the auxiliary path are concatenated along the channel dimension to obtain the concatenated feature map. The concatenated feature map is then fed into two parallel branches for processing: one branch undergoes global average pooling and convolution, while the other branch is left unprocessed. Finally, the outputs of these two branches are concatenated to obtain the multi-scale feature map.
[0034] Multi-scale feature maps not only contain rich local details but also possess global context awareness. Ultimately, these multi-scale feature maps are used in subsequent detection or recognition tasks, significantly improving the model's detection performance and robustness for multi-scale targets.
[0035] At construction sites, intrusion targets vary in size. The MSFM module, with its multi-scale feature extraction and fusion capabilities, can accurately identify targets of different sizes, effectively reducing the rate of missed detections and false detections, and providing a reliable guarantee for building a strong technical defense for safety monitoring at construction sites.
[0036] The DSSEAM module performs fine processing on the feature maps. This module receives feature maps passed from the previous network layer, such as... Figure 5 As shown, the processing flow of the DSSEAM module is as follows: First, the input feature map is fed into two parallel branches, the left stream and the right stream, to enable dual-stream collaborative processing mode, which aims to comprehensively improve the feature representation capability.
[0037] In the left-side flow branch, the input feature map is first convolved to obtain a left-side flow local feature map. The convolution kernel slides orderly across the feature map, extracting and transforming local features, and initially adjusting the number of channels and spatial relationships of the features, laying the foundation for subsequent processing. Simultaneously, the MSFM module processes the data to obtain a multi-scale feature map, and the left-side flow local feature map is multiplied element-wise with the multi-scale feature map, followed by convolution to obtain the left-side flow fused feature map. The MSFM module, with its unique design, uses convolution kernels of different sizes or pooling operations to deeply analyze the input feature map from multiple scales, accurately capturing local details and global contextual information. Through multiplication operations, this multi-scale information is effectively integrated into the left-side flow, significantly enhancing the feature's ability to represent targets at different scales.
[0038] In the right-side flow branch, the input feature map is first processed through convolution to extract and transform local features, adjusting the feature structure to obtain the right-side flow local feature map. Simultaneously, the Sobel attention module processes the input feature map to obtain an attention weight map. The right-side flow local feature map and the attention weight map are then multiplied element-wise, followed by convolution to obtain the right-side flow fused feature map. The Sobel attention module utilizes the classic Sobel operator, which includes horizontal and vertical convolution kernels. These kernels are convolved with the feature map to calculate spatial gradient approximations, explicitly extracting high-frequency edge information from the feature map. The multiplication operation highlights target edges and contour information, allowing the model to focus more on the target shape features.
[0039] Finally, the left-side and right-side fused feature maps are concatenated along the channel dimension and then processed sequentially through convolution, two depthwise separable convolutions, and another convolution to obtain a refined feature map. The convolution further extracts and combines features, while the depthwise separable convolution significantly reduces the number of parameters and computation while ensuring feature expressiveness, thus gradually refining the feature map and fully exploring the intrinsic relationships between features.
[0040] The refined feature map is processed by the Sigmoid activation function, mapping each element value to the (0, 1) interval to generate a fused weight map. This accurately highlights key feature regions and suppresses irrelevant information. This fused weight map combines multi-scale features and edge attention information, forming a more discriminative and richer feature representation. As the module output, it is passed to subsequent networks, helping the model to perform excellent target recognition and detection tasks in complex scenes. The fused weight map is then added element-wise to the original input feature map to obtain the final output.
[0041] Traditional YOLO neck networks (such as BiFPN or PANet) employ top-down and bottom-up implicit feature fusion, where boundary information at different scales is easily blurred or submerged during multiple convolutions and stitching. This invention designs the SobelHyperACE module as a "global external brain." For example... Figure 6 As shown, the processing flow of the Sobel HyperACE module in this embodiment is as follows: First, multi-scale feature maps from different depths of the backbone network are received, namely, the shallow feature map B3 output by the second MFFDM module, the mid-level feature map B4 output by the first DSSEAM module, and the high-level feature map B5 output by the second DSSEAM module. B3 is a shallow high-resolution feature map, rich in detailed information; B5 is a deep low-resolution feature map, containing sufficient global semantic information.
[0042] To enhance the model's ability to perceive target contours in complex backgrounds, adjacent scale features B3 and B5 are input into the Sobel attention mechanism module for processing. The Sobel attention mechanism utilizes the Sobel operator, a classic discrete differential operator, which can effectively extract high-frequency edge signals in space. Through this operation, an attention mask is generated, which explicitly enhances the boundary and contour features of the target at these two scales while effectively suppressing irrelevant background noise.
[0043] Because feature maps at different depths have varying resolutions, to enable their fusion in the same spatial dimension, the attention-enhanced B3 feature is downsampled to obtain an aligned B3, thus reducing its spatial size. Simultaneously, the attention-enhanced B5 feature is upsampled to obtain an aligned B5, thus increasing its spatial size. After this operation, the spatial resolutions of B3, B4 (the central reference feature without scale transformation), and B5 are perfectly aligned.
[0044] Then, the aligned B3, B5, and B4 are spliced together along the channel dimension to form a large channel feature map.
[0045] The large-channel feature map undergoes a 1×1 convolution, which enables cross-channel information exchange and preliminary feature fusion while reducing channel dimensionality, thus alleviating subsequent computational burden. The preliminarily dimensionality-reduced feature map then enters a Split operation, uniformly dividing it into four equal-width feature sub-maps along the channel dimension. Assuming a total number of channels is c, each feature sub-map has c / 4 channels. The four equal-width feature sub-maps are fed into four parallel branches for differential processing to extract representations of different dimensions.
[0046] The feature subgraphs of the first and second branches are processed by the C3AH module. The C3AH module uses an adaptive hypergraph convolution method, which can effectively infer the high-order logical relationships between multiple target entities in complex, dense and occluded scenes.
[0047] The feature sub-map of the third branch is processed by the MSFM module; the MSFM module can analyze and process features from multiple scales, fully explore the multi-scale information in the features, accurately capture the feature performance of the target at different scales, and enhance the model's comprehensive perception of the target features.
[0048] The fourth branch, acting as a residual bypass, is passed directly without any processing. This design not only preserves the most original basic feature information but also provides an unobstructed shortcut for gradient backpropagation, effectively preventing degradation in deep networks.
[0049] The outputs of the four branches (high-order semantic features from the two C3AH modules, multi-scale features from the MSFM module, and basic features from the residual bypass) are then concatenated along the channel dimension and processed by a 1×1 convolution to obtain the output feature map, which is then used by the subsequent detection head. This convolutional layer can eliminate feature artifacts caused by direct concatenation of multiple branches and achieve deep and smooth fusion of representations in different dimensions.
[0050] The Sobel attention module aims to spatially recalibrate features using prior edge information from the image, receiving feature maps from the previous network layer. For example... Figure 7 As shown, the processing flow of Sobel's attention module is as follows: First, the input feature map is divided into two parallel branches.
[0051] One branch is left unprocessed to preserve the identity mapping of the original information.
[0052] Another branch extracts high-frequency edge features using the Sobel module, then processes them using the Sigmoid activation function to obtain a spatial attention weight map based on edge saliency. Specifically, the input feature map is first processed by the Sobel module; this module typically contains horizontal and vertical Sobel convolution kernels. The Sobel convolution kernel is a classic discrete differential operator used to calculate the gradient approximation of the feature map in space. Through this operation, the module can explicitly extract high-frequency edge features (i.e., the edges, contours, and boundaries of the target) from the feature map. Subsequently, the extracted high-frequency edge features are processed by the Sigmoid activation function; this function maps the edge gradient values to the interval (0, 1), generating a spatial attention weight map (Attention Mask) based on edge saliency. In this map, regions with more pronounced edge contours have weight values closer to 1; while smooth background regions have weight values closer to 0.
[0053] Finally, the feature maps output from the two branches are multiplied element-wise to obtain the attention weight map; the result after multiplication modulation is used as the final output of the Sobel attention module and passed to subsequent networks.
[0054] The C3AH module employs a dual-branch structure, splitting the input features for processing. This module receives feature maps from the previous network layer. For example... Figure 8 As shown, the processing flow of the C3AH module is as follows: First, the input feature map is fed into two parallel branches for processing: the left branch and the right branch.
[0055] In the left branch, the input feature map goes through only one 1×1 convolutional layer to obtain basic local features. This operation is mainly used to compress the channel dimension of the input feature map, while preserving the spatial positioning information and basic local features in the original feature map to the greatest extent, providing a direct gradient backpropagation path for subsequent operations.
[0056] In the right-hand branch, the input feature map first undergoes a convolution to adjust its feature channels, reducing the number of parameters and computational overhead in subsequent hypergraph calculations. Subsequently, the channel-adjusted feature map enters the Adaptive Hyperedge Generation module, which combines global feature prototypes with local dynamic features to construct a high-order relational topology (i.e., hyperedge) adapted to the current scene. The feature map after hyperedge generation is further input into the Hypergraph Convolution module to obtain high-order inference features. In this structure, bidirectional information aggregation and distribution occur between nodes and hyperedges, completing high-order contextual information interaction and inference of deep features.
[0057] The basic local features output from the left branch and the higher-order inference features output from the right branch are concatenated along the channel dimension.
[0058] The concatenated features are fused and smoothed across channels using a 1×1 convolution, and then restored to the required output channel dimension as the final output of the C3AH module.
[0059] The adaptive hyperedge generation module receives input features and aims to construct adaptive higher-order relations (hyperedges) by combining global prototypes and local dynamic features. For example... Figure 9 As shown, the processing flow of the adaptive hyperedge generation module is as follows: The input features are first flattened to reshape the multidimensional feature tensor into a sequence of feature vectors.
[0060] The flattened features are processed in two parallel branches: one is a pooling-projection branch (left and middle paths) for generating dynamic offsets, and the other is a direct projection branch (right path) for feature mapping.
[0061] In the pooling-projection branch, the flattened features are simultaneously input into the max pooling layer and the average pooling layer. The salient features output by the max pooling layer and the global background features output by the average pooling layer are then fused in the channel or feature dimension through a concat operation. The fused features are then passed through a projection layer (usually a fully connected layer or a linear transformation layer) for dimensionality reduction and feature mapping, ultimately generating a dynamic offset for the current input.
[0062] In the direct projection branch, the flattened features are directly passed through an independent projection layer, which maps them onto the feature space where the target hyperedge is located, serving as the basic features to be modulated.
[0063] Meanwhile, a learnable or pre-defined global prototype feature is introduced, and this global prototype feature is added element-wise to the dynamic offset generated above. This step generates adaptive adjustment weights that contain both global prior knowledge and adapt to the current specific input.
[0064] The output of the additive fusion is multiplied element-wise with the basic features obtained from the direct projection branch. The fused weights are then used to recalibrate and modulate the basic features. The final result after multiplication and modulation is the generated adaptive hyperedges.
[0065] The hypergraph convolution module receives the flattened features output by the adaptive hyperedge generation module. By using nodes (the set is denoted as Node), ) and hyperedge (set denoted as Information is exchanged between them to update feature representations. For example... Figure 10 As shown, the processing flow of the hypergraph convolution module is as follows: The module receives a flattened one-dimensional feature sequence. This serves as the initial input. The input features first enter the information aggregation stage. In this stage, the system uses the hypergraph's association matrix to aggregate all relevant nodes (or the information of the hyperedge itself) connected by the same hyperedge. This step achieves the collection of local neighborhood information, enabling each node to integrate contextual information with which it has higher-order relationships.
[0066] The aggregated features are input to the feature projection layer. This layer performs linear or nonlinear mapping on the aggregated high-dimensional features to extract deeper semantic representations and adjust the feature dimensions.
[0067] After being processed by the feature projection layer, the features enter the information distribution / propagation stage. In this stage, the updated node features are back-broadcast or distributed to the hyperedges to which they belong, thereby updating the global state representation of the hyperedges. This bidirectional information flow of "node-hyperedge-node" completes the full interaction of higher-order information.
[0068] The distributed features re-enter the feature projection layer for final feature space transformation and smoothing, outputting a feature tensor modulated with higher-order relations.
[0069] Furthermore, the module introduces a residual network structure. This structure transforms the original input features... The feature output after processing by the second feature projection layer is added element-wise, and the result of residual fusion is used as the final output of the hypergraph convolution module. .
[0070] Step 3. Train the construction site personnel intrusion detection model based on the training dataset from Step 1, and use the trained model to identify real-time images of the construction site, thereby achieving personnel intrusion detection.
[0071] The training set is fed into the model for training. A validation machine is used to validate the model after each training round. After a certain number of training rounds, the best model weights are obtained. The model is then tested, and the model with the optimal parameters is deployed.
[0072] The scope of the dangerous area on the actual construction site is marked, and its coordinates are calculated. When the coordinates of personnel detected by the model overlap with the dangerous area, a dangerous intrusion alarm is triggered.
[0073] Compared to advanced one-stage pure convolutional detection algorithms such as YOLOv8, this invention effectively overcomes their performance bottlenecks of "limited local feature perception" and "loss of deep edge information" in complex construction site conditions. By innovatively introducing the C3AH module, this invention extends traditional local feature extraction to "multi-entity high-order logical reasoning," enabling the inverse inference of occluded personnel using related objects in the scene, significantly reducing the false negative rate in densely occluded scenes. Simultaneously, this invention combines the explicit edge enhancement and dual residual injection mechanism of the Sobel HyperACE module with the multi-scale receptive field parallel processing of the MSFM module, solving the problems of blurred contours and poor adaptability to extreme size changes in deep feature fusion of traditional YOLO networks, achieving accurate stripping and localization of multi-scale targets in complex backgrounds.
[0074] Compared to global attention detection algorithms based on Transformers (such as the DETR series), this invention achieves effective fusion of multi-level features while maintaining the algorithm's lightweight characteristics. It avoids introducing excessive parameters and computational load due to the fusion of complex features, integrating information in a YOLO-like efficient manner. This enables the model to quickly process multi-source data on edge devices with limited computing power at construction sites, providing comprehensive information for accurate detection.
[0075] Example 2 This embodiment 2 describes a construction site personnel intrusion detection system based on an improved YOLO model. This system is based on the same inventive concept as the construction site personnel intrusion detection method based on an improved YOLO model in embodiment 1 above.
[0076] The construction site personnel intrusion detection system based on the improved YOLO model includes the following modules: The preprocessing module is used to collect on-site image data, preprocess the on-site image data, and build a training dataset. The personnel intrusion detection module is used to build a construction site personnel intrusion detection model based on the improved YOLOv13 architecture. This model includes a backbone network, a neck network, and a detection head. It is improved upon the original YOLOv13 architecture as follows: The Sobel attention module is introduced into the original HyperACE module to enhance high-frequency edge features, thus forming the SobelHyperACE module; A multi-level feature fusion detection module (MFFDM) is designed in the backbone network and the neck network. It extracts and fuses features of different scales through the multi-scale fusion module (MSFM) to enhance the expressive power of the features. The MSFM module extracts features from multiple scales of an image through convolution and pooling operations, and then fuses them. A dual-stream collaborative attention enhancement module (DSSEAM) is designed in the backbone network. It uses a dual-stream collaborative processing approach for feature extraction and enhances the information in the input feature map by introducing an attention mechanism. The intrusion detection model for construction sites is trained based on the training dataset, and the trained model is used to identify real-time images of construction sites, thereby realizing intrusion detection.
[0077] It should be noted that any content not mentioned in the above-described functional modules of the system described in Embodiment 2 can be referred to the step description of the corresponding method in Embodiment 1 above, and will not be repeated in detail here.
[0078] Of course, the above description is only a preferred embodiment of the present invention. The present invention is not limited to the above-described embodiments. It should be noted that any equivalent substitutions or obvious modifications made by those skilled in the art under the guidance of this specification fall within the scope of this specification and should be protected by the present invention.
Claims
1. A construction site personnel intrusion detection method based on an improved YOLO model, characterized in that, Includes the following steps: Step 1. Collect on-site image data, preprocess the on-site image data, and construct a training dataset; Step 2. Build a construction site personnel intrusion detection model based on the improved YOLOv13. This model includes a backbone network, a neck network, and a detection head; it is obtained by improving the original YOLOv13 architecture as follows: A multi-level feature fusion detection module (MFFDM) is designed in the backbone network and the neck network. It extracts and fuses features of different scales through the multi-scale fusion module (MSFM) to enhance the expressive power of the features. The MSFM module extracts features from multiple scales of an image through convolution and pooling operations, and then fuses them. A dual-stream collaborative attention enhancement module (DSSEAM) is designed in the backbone network. It uses a dual-stream collaborative processing approach for feature extraction and enhances the information in the input feature map by introducing an attention mechanism. The Sobel attention module is introduced into the original HyperACE module to enhance high-frequency edge features, thus forming the Sobel HyperACE module; Step 3. Train the construction site personnel intrusion detection model based on the training dataset from Step 1, and use the trained model to identify real-time images of the construction site, thereby achieving personnel intrusion detection.
2. The construction site personnel intrusion detection method based on the improved YOLO model according to claim 1, characterized in that, In step 2, the backbone network includes five convolutional modules, two MFFDM modules, and two DSSEAM modules. Define five convolutions as the first, second, third, fourth, and fifth convolutions; define two MFFDM modules as the first and second MFFDM modules; define two DSSEAM modules as the first and second DSSEAM modules. The processing flow of the backbone network is as follows: First, the input image undergoes downsampling and preliminary feature extraction through the first and second convolutions; then it is fed into the first MFFDM module to fuse features at different scales; the fused features are further processed through the third and second MFFDM modules; then, they are processed sequentially through the fourth, first, and fifth convolutions before being fed into the second DSSEAM module. Define the output of the second MFFDM module as B3, the output of the first DSSEAM module as B4, and the output of the second DSSEAM module as B5; B3, B4, and B5 are respectively fed into the Sobel HyperACE module for processing to obtain feature tensors H3, H4, and H5; B3, B4, and B5 are added element-wise to H3, H4, and H5 respectively.
3. The construction site personnel intrusion detection method based on the improved YOLO model according to claim 2, characterized in that, In step 2, the neck network includes two upsampling modules, four concat modules, four MFFDM modules, and two convolutions; the two upsampling modules are defined as the first and second upsampling modules; the four concat modules are defined as the first, second, third, and fourth concat modules; the four MFFDM modules are defined as the third, fourth, fifth, and sixth MFFDM modules; and the two convolutions are defined as the sixth and seventh convolutions. Define the feature obtained by adding B3 and H3 element by element as the first H3 feature; define the feature obtained by adding B4 and H4 element by element as the first H4 feature; define the feature obtained by adding B5 and H5 element by element as the first H5 feature; The processing flow of the neck network is as follows: First, the first H5 feature is processed by the first upsampling module and then fused with the first H4 feature in the first Concat module; the fused feature is then processed by the third MFFDM module, and then processed by the second upsampling module before being fused with the first H3 feature in the second Concat module. The features fused by the second Concat module are then sent to the fourth MFFDM module for processing. The features output by the third MFFDM module are added element-wise to H4 to obtain the second H4 feature; the features output by the fourth MFFDM module are added element-wise to H3 to obtain the second H3 feature. Then, the second H3 feature is processed by the sixth convolution and then fused with the second H4 feature in the third Concat module; The features fused by the third Concat module are sent to the fifth MFFDM module for processing, and after the seventh convolution process, they are fused with the first H5 features in the fourth Concat module. The features fused by the fourth Concat module are then sent to the sixth MFFDM module for processing. The outputs of the fourth, fifth, and sixth MFFDM modules are added element-wise to H3, H4, and H5 respectively and then sent to the detection head.
4. The construction site personnel intrusion detection method based on the improved YOLO model according to claim 1, characterized in that, In step 2, the processing flow of the MFFDM module is as follows: First, the input feature map undergoes preliminary feature extraction and smoothing through convolution. The convolution-processed feature map is then fed into two parallel branches. One branch processes the feature map through the MSFM module to generate a feature map containing multi-scale local details. The other branch processes the feature map through convolution, dilated convolution, and convolution in sequence to obtain a contextual feature map with a wide receptive field. The feature maps output from the two branches are then concatenated along the channel dimension to obtain a comprehensive feature map.
5. The construction site personnel intrusion detection method based on the improved YOLO model according to claim 4, characterized in that, The processing flow of the MSFM module is as follows: First, the input feature map is compressed using global average pooling to reduce its spatial dimensionality. Then, the number of channels in the feature map is adjusted using 1×1 convolution, and the feature map is then fused with the original input feature map. The fused feature map is processed by dividing it into a main path and an auxiliary path. In the main path, the fused feature map is processed by two-dimensional max pooling to extract local salient features, and then divided into two branches for processing. The upper branch further optimizes the features through 1×1 convolution, and the lower branch is processed by two-dimensional max pooling and convolution in sequence. Then the feature map output by the upper branch is fused with the feature map output by the lower branch to enhance the expressive power. In the auxiliary path, the fused feature map is initially processed by 1×1 convolution, and then local fine features and broad context features are extracted by parallel 3×3 convolution and 5×5 convolution respectively. Then the outputs of the two parallel convolutions are concatenated. Finally, the feature maps output by the main path and the auxiliary path are concatenated along the channel dimension to obtain the concatenated feature map. The concatenated feature map is fed into two parallel branches for processing: one branch performs global average pooling and convolution, while the other branch does not perform any processing; finally, the outputs of the two branches are concatenated to obtain a multi-scale feature map.
6. The construction site personnel intrusion detection method based on the improved YOLO model according to claim 1, characterized in that, In step 2, the processing flow of the DSSEAM module is as follows: First, the input feature map is fed into two parallel branches, the left stream and the right stream, for processing. In the left-side flow branch, the input feature map is first processed by convolution to obtain the left-side flow local feature map, and then processed by the MSFM module to obtain the multi-scale feature map. The left-side flow local feature map and the multi-scale feature map are multiplied element-wise, and then processed by convolution to obtain the left-side flow fused feature map. In the right-side flow branch, the input feature map is first processed by convolution to obtain the right-side flow local feature map, and then processed by the Sobel attention module to obtain the attention weight map. The right-side flow local feature map and the attention weight map are multiplied element-wise, and then processed by convolution to obtain the right-side flow fused feature map. The left and right flow fusion feature maps are concatenated along the channel dimension and then processed sequentially through convolution, two depthwise separable convolutions, and another convolution to obtain a refined feature map. The refined feature map is then processed by the Sigmoid activation function to obtain a fusion weight map. The fusion weight map is then added element-wise to the original input feature map to obtain the final output.
7. The construction site personnel intrusion detection method based on the improved YOLO model according to claim 1, characterized in that, In step 2, the processing flow of the Sobel HyperACE module is as follows: First, multi-scale feature maps from different depths of the backbone network are received, namely, shallow feature map B3 output by the second MFFDM module, mid-level feature map B4 output by the first DSSEAM module, and high-level feature map B5 output by the second DSSEAM module. B3 is then processed by the Sobel attention module and downsampling to obtain the aligned B3. B5 is then processed by the Sobel attention module and upsampling to obtain the aligned B5. Then, the aligned B3, aligned B5 and B4 are spliced together along the channel dimension to form a large channel feature map; After the large channel feature map is processed by 1×1 convolution, it is divided into four feature sub-maps of equal width in the channel dimension and fed into four parallel branches for processing: the feature sub-maps of the first and second branches are processed by the C3AH module; the feature sub-maps of the third branch are processed by the MSFM module; and the fourth branch is not processed. The outputs of the four branches are then concatenated along the channel dimension and processed by a 1×1 convolution to obtain the output feature map.
8. The method for detecting personnel intrusion at construction sites based on an improved YOLO model according to claim 1, characterized in that, In step 2, the processing flow of the Sobel attention module is as follows: First, the input feature map is divided into two parallel branches; one branch is left unprocessed to preserve the identity mapping of the original information; the other branch extracts high-frequency edge features through the Sobel module and then processes them through the Sigmoid activation function to obtain a spatial attention weight map based on edge saliency. Finally, the feature maps output from the two branches are multiplied element by element to obtain the attention weight map.
9. The method for detecting personnel intrusion at construction sites based on an improved YOLO model according to claim 1, characterized in that, Step 1, the preprocessing process, specifically involves: creating labels for the collected image data using labelimg, and dividing them into training set, validation set, and test set in a ratio of 8:1:
1.
10. A construction site personnel intrusion detection system based on an improved YOLO model, characterized in that, Includes the following modules: The preprocessing module is used to collect on-site image data, preprocess the on-site image data, and build a training dataset. The personnel intrusion detection module is used to build a construction site personnel intrusion detection model based on the improved YOLOv13. This model includes a backbone network, a neck network, and a detection head. It is improved upon the original YOLOv13 architecture as follows: The Sobel attention module is introduced into the original HyperACE module to enhance high-frequency edge features, thus forming the Sobel HyperACE module; A multi-level feature fusion detection module (MFFDM) is designed in the backbone network and the neck network. It extracts and fuses features of different scales through the multi-scale fusion module (MSFM) to enhance the expressive power of the features. The MSFM module extracts features from multiple scales of an image through convolution and pooling operations, and then fuses them. A dual-stream collaborative attention enhancement module (DSSEAM) is designed in the backbone network. It uses a dual-stream collaborative processing approach for feature extraction and enhances the information in the input feature map by introducing an attention mechanism. The intrusion detection model for construction sites is trained based on the training dataset, and the trained model is used to identify real-time images of construction sites, thereby realizing intrusion detection.