Three-dimensional safety monitoring method based on improved point column network under construction scene

By improving the point-column network for three-dimensional safety monitoring of construction scenarios, the problems of high false alarm rate, poor multi-scale target detection capability and insufficient dynamic prediction have been solved, realizing intelligent safety monitoring in all weather and all space, and improving the safety perception and early warning capabilities of construction sites.

CN122200540APending Publication Date: 2026-06-12岳忠俊

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
岳忠俊
Filing Date
2026-03-11
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Safety monitoring in construction scenarios suffers from problems such as high false alarm rate, poor multi-scale target detection capability, lack of dynamic prediction capability, and insufficient fusion of point cloud and vision.

Method used

An improved point-column network, including an adaptive dynamic voxel partitioning module, a multi-scale feature extraction module, and an attention mechanism module, is adopted. Combined with a multi-target tracking algorithm and a BIM model, it is used for 3D target detection, dynamic hazardous area generation, and safety risk classification and early warning.

Benefits of technology

It achieves intelligent safety perception in all weather and all spaces, improves detection accuracy and robustness, has dynamic prediction capabilities, adapts to complex construction environments, and enhances the proactive prevention and control capabilities for construction safety.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122200540A_ABST
    Figure CN122200540A_ABST
Patent Text Reader

Abstract

The application discloses a kind of three-dimensional safety monitoring methods based on improved point column network under construction scene, comprising: step one, obtains original three-dimensional point cloud data of construction scene and is preprocessed;Step two, construct improved point column network for three-dimensional target detection, the network includes adaptive dynamic voxel division module, multiscale feature extraction module, attention mechanism module and prior knowledge constraint module based on construction safety specification;Step three, continuously track and predict the motion trajectory of dynamic target;Step four, combine BIM model preset dangerous area with predicted trajectory to construct dynamic dangerous area, and carry out hierarchical early warning.The application solves the problem of low detection accuracy caused by large scale difference of targets and complex background under construction scene through multi-level network structure improvement, realizes the spatio-temporal continuity perception from single-frame detection to dynamic trajectory prediction, and significantly improves the proactive warning capability of construction safety monitoring.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of building construction safety monitoring technology, specifically to a three-dimensional safety monitoring method based on an improved point-column network in a construction scenario. Background Technology

[0002] Construction sites are characterized by frequent dynamic flows of personnel, machinery, and materials, and by complex and ever-changing environments, making safety monitoring extremely difficult. According to statistics from the Ministry of Housing and Urban-Rural Development, accidents such as falls from heights, being struck by objects, and mechanical injuries account for more than 80% of all construction safety accidents. Most of these accidents are directly related to a lack of spatial awareness and delayed early warnings. Traditional construction safety monitoring mainly relies on manual inspections or two-dimensional video surveillance, which has significant technical limitations: manual inspections cannot achieve 24 / 7 coverage and depend heavily on the experience and responsibility of the inspectors; while two-dimensional video surveillance has a wide coverage area, it is difficult to obtain precise three-dimensional spatial location information of targets, and the image quality deteriorates significantly in adverse environments such as changing lighting, nighttime construction, and dusty weather, easily leading to blind spots and misjudgments. Furthermore, traditional monitoring lacks the ability to predict the movement trends of dynamic targets, often only providing retrospective recordings after an accident occurs, failing to achieve true proactive early warning.

[0003] In recent years, lidar technology has been gradually applied to engineering monitoring. It can accurately perceive the spatial structure of the scene through three-dimensional point cloud data, providing a new technical path for construction safety monitoring. However, the original point cloud data has the characteristics of sparsity, disorder and massiveness. Direct processing has a large computational cost and real-time performance is difficult to guarantee. Point-column network, as an efficient point cloud target detection method, divides the three-dimensional point cloud into vertical columns in the bird's-eye view and extracts features, which has achieved good results in the field of autonomous driving. However, its direct application to construction scenarios has the following key technical bottlenecks: (1) Severe dynamic background interference: There are a large number of non-target point clouds in the construction scenario. Direct application of the original network is difficult to effectively distinguish between the background and the target, which is prone to generating a large number of false alarms; (2) Significant differences between multi-scale targets: Construction safety monitoring needs to pay attention to large machinery and small targets at the same time. The feature extraction network of a single scale is difficult to take into account the feature expression of targets of different scales, resulting in a high false alarm rate for small targets; (3) Lack of spatiotemporal continuity information: Existing methods are mostly based on single-frame point clouds for independent detection. They lack the ability to continuously track the target's motion trajectory and predict its future position, which cannot meet the early warning needs of construction scenarios for dynamic risks.

[0004] In addition to the aforementioned issues, existing technologies still have significant shortcomings in multimodal data fusion and adaptability to construction scenarios. On the one hand, relying solely on point cloud data inherently lacks color information and texture details, affecting the accurate identification of target categories. Furthermore, existing point cloud and image fusion methods often employ post-fusion strategies, failing to fully leverage the complementary advantages of multimodal data. On the other hand, construction scenarios exhibit distinct phased characteristics—as the project progresses, the distribution of hazardous areas, the location of machinery deployment, and the range of personnel activity all dynamically change. Existing monitoring methods often neglect deep integration with Building Information Models (BIM), failing to adaptively adjust monitoring priorities and warning thresholds according to the construction schedule. Moreover, harsh conditions in the construction environment, such as dust, rain, and fog, can lead to point cloud sparsity or even localized loss. The fixed grid partitioning strategy used in existing networks is ill-suited to such uneven density distributions, further restricting the reliability and practicality of the monitoring system. Therefore, there is an urgent need to develop a 3D safety monitoring method that can adapt to complex construction environments, accommodate multi-scale target detection, possess dynamic prediction capabilities, and deeply integrate with BIM. Summary of the Invention

[0005] This invention aims to provide a three-dimensional safety monitoring method based on an improved point-column network in construction scenarios, in order to solve the problems of high false alarm rate, poor multi-scale target detection capability, lack of dynamic prediction capability, and insufficient fusion of point cloud and vision in existing construction safety monitoring technologies.

[0006] This invention introduces a three-dimensional safety monitoring method based on an improved point-column network in construction scenarios, comprising the following steps: Step 1: Acquire the original 3D point cloud data of the construction scene and perform preprocessing, including time synchronization, spatial coordinate calibration, and coordinate transformation for alignment with the BIM model; Step 2: Construct an improved point-column network to perform 3D target detection on the preprocessed point cloud data. The improved point-column network includes an adaptive dynamic voxel partitioning module, a multi-scale feature extraction module, and an attention mechanism module; Step 3: Based on the detection results of Step 2, use a multi-target tracking algorithm to continuously track dynamic targets in the construction scene and predict their motion trajectories; Step 4: Combine the preset danger zones in the BIM model with the motion trajectories predicted in Step 3 to construct dynamic danger zones and perform safety risk classification and early warning.

[0007] This invention provides a complete three-dimensional safety monitoring technology solution for construction scenarios. By organically integrating four core steps of "collection-detection-tracking-early warning", it achieves for the first time fully automated processing from raw point cloud data to safety risk classification and early warning. This method deeply integrates the improved point-column network with the special needs of construction scenarios, solving the fundamental defects of traditional monitoring methods that rely on manual inspection, have blind spots in two-dimensional monitoring, and lack spatial perception capabilities. It builds an all-weather, all-space, and intelligent safety perception capability for construction sites, laying the overall architectural foundation for subsequent technological innovations to play a synergistic role.

[0008] Optimally, the adaptive dynamic voxel partitioning module in step 2 is specifically used to: dynamically adjust the grid size under the bird's-eye view according to the local point cloud density; use a first-size grid in areas where the point cloud density is higher than a first threshold; and use a second-size grid larger than the first size in areas where the point cloud density is lower than a second threshold, wherein the first threshold is greater than the second threshold; the formula for calculating the local point cloud density is:

[0009] Where N (x,y) Let Δx and Δy represent the number of point clouds in the neighborhood centered at coordinates (x, y), and Δx and Δy represent the length and width of the neighborhood. By introducing an adaptive dynamic voxel partitioning mechanism, this invention overcomes the processing defects caused by the fixed mesh size of traditional networks. It automatically refines the mesh in critical areas with dense crowds and frequent mechanical activity, accurately capturing the detailed features of small targets such as safety helmets and guardrails, thus improving the recall rate of small targets. In open or distant areas, it automatically coarsens the mesh, reducing redundant computation and significantly increasing the system's frame rate to over 25 FPS, meeting real-time monitoring requirements. This computational resource scheduling strategy achieves a dynamic optimal balance between detection accuracy and processing efficiency.

[0010] Optimally, the multi-scale feature extraction module in step 2 includes a multi-branch dilated convolutional structure, where each branch uses convolutional kernels with different dilation rates to process feature maps in parallel, and the outputs of each branch are fused through an adaptive weight fusion module; the multi-branch dilated convolutional structure specifically includes: The first branch uses a convolutional kernel with an inflation rate of 1 to extract local detail features; The second branch uses a convolutional kernel with an inflation rate of 3 to extract medium-range contextual features; The third branch uses a convolutional kernel with an inflation rate of 5 to extract large-scale global features; The adaptive weight fusion module performs a weighted summation of the feature maps from each branch using learnable weight parameters, which are dynamically adjusted based on the content of the input feature maps. The multi-scale feature extraction module constructs a complete feature pyramid from local details to global context through parallel processing of three-branch dilated convolutions. A branch with an inflation rate of 1 finely depicts target edges and local textures, ensuring that small targets such as safety helmets and tools are not missed; a branch with an inflation rate of 3 captures medium-range context, effectively distinguishing adjacent stacked building materials from personnel; and a branch with an inflation rate of 5 perceives large-scale scene structures, accurately identifying the overall outline of large targets such as tower crane booms and excavator bodies. The adaptive weight fusion mechanism dynamically adjusts the contribution of each branch based on the input features, significantly improving the average accuracy of the network in complex construction scenarios compared to the original network.

[0011] Optimally, the attention mechanism module in step 2 is a hybrid domain attention module, used to apply attention weights in both the channel and spatial dimensions to the feature map. The channel attention module extracts channel descriptors through global average pooling and global max pooling, and generates channel weights via a shared multilayer perceptron. The spatial attention module performs average pooling and max pooling along the channel axis, generating spatial weights via convolutional layers. Finally, the channel and spatial weights are applied sequentially to the input feature map to suppress background point cloud noise and enhance the feature representation of key target regions. The hybrid domain attention module adaptively recalibrates the feature map from both channel and spatial dimensions, enabling the channel attention mechanism to automatically enhance the response to key color features such as the orange of a safety helmet and the fluorescent yellow of a reflective vest, while suppressing interference from background materials such as reinforced concrete. The spatial attention mechanism focuses computational resources on accident-prone areas such as under tower cranes and near openings, automatically attenuating the feature response to non-monitored areas.

[0012] Optimally, the improved point-column network in step 2 further includes a prior knowledge constraint module based on construction safety regulations. This prior knowledge constraint module encodes the semantic prior relationships of targets in the construction scenario into the loss function of the detection network. These semantic prior relationships include: the spatial relative position of the safety helmet and the person's head in the vertical direction; the spatial relative position of the guardrail and the adjacent area; and the geometric connection relationship between the tower crane boom and the standard section of the tower body. When the detection result violates these semantic prior relationships, the loss function automatically increases the penalty term coefficient, guiding the network to learn target features that conform to engineering logic. This solution, based on the prior knowledge constraint module of construction safety regulations, transforms engineers' industry experience into semantic rules that the network can learn, endowing the monitoring system with engineering common sense. When abnormal results that violate physical laws are detected, the loss function automatically increases the penalty, guiding the network to output detection results that conform to engineering logic.

[0013] Optimally, the preprocessing in step 1 further includes: Point cloud enhancement steps: For sparse areas of point cloud caused by construction dust, rain and fog, the K-nearest neighbor interpolation algorithm is used to enhance the point cloud density and fill in the missing points in the sparse areas; Dynamic background filtering steps: A static background point cloud map of the construction scene is pre-constructed, which includes the existing permanent structures. By comparing the current frame point cloud with the background map, point clouds belonging to the static background are filtered out, while dynamic target point clouds such as personnel and machinery are retained. Coordinate system alignment steps: The improved Fast-ICP algorithm is used to transform the point cloud coordinates to the construction coordinate system, so as to achieve accurate spatial alignment with the BIM model.

[0014] The aforementioned point cloud enhancement module employs the K-nearest neighbor interpolation algorithm to intelligently complete sparse areas caused by construction dust, improving the integrity of the point cloud and effectively avoiding missed detections due to data loss. Dynamic background filtering filters out more than 40% of interfering point clouds, such as building materials and temporary facilities, by constructing a static background map, allowing the network to focus on dynamic targets such as personnel and machinery. Coordinate system alignment accurately registers the point cloud with the BIM model, providing a spatial reference for subsequent BIM-based hazardous area identification.

[0015] Optimally, step 3 specifically includes: Data association sub-step: The Hungarian algorithm is used in combination with 3D intersection-union ratio and Mahalanobis distance to calculate the association cost matrix between the detected target in the current frame and the historical trajectory, and to perform optimal matching; Motion state prediction sub-step: Introduce a Kalman filter or long short-term memory network to predict the target's position, velocity vector and orientation angle within the next 1-3 seconds based on the target's historical trajectory data; Trajectory management sub-steps: Maintain trajectory status for targets that are briefly missed, set a trajectory retention threshold, and terminate the trajectory when the number of missed frames exceeds the threshold; initialize a new trajectory for newly added targets, set a trajectory confirmation threshold, and confirm a stable trajectory when the number of consecutive detection frames exceeds the threshold.

[0016] This solution's multi-target tracking module endows the monitoring system with memory and prediction capabilities, achieving a leap from single-frame perception to continuous cognition. The combination of the Hungarian algorithm and 3D-IOU solves the identity swapping problem when targets are occluded or crossing paths; the motion prediction model using Kalman filtering and LSTM networks can predict the future position of targets 1-3 seconds in advance, gaining valuable reaction time for dynamic risk warnings; the trajectory management mechanism retains trajectory memory for targets that are briefly occluded, and in scenarios such as tower crane booms obscuring personnel, the identity retention rate is high after target loss and recapture, ensuring the continuity of risk assessment.

[0017] Optimally, the construction of the dynamic danger zone in step 4 specifically includes: The sub-step for generating the swing danger zone of the robotic arm is as follows: Based on the real-time rotation speed, direction and length of the robotic arm of rotating machinery such as excavators and tower cranes, a rigid body kinematics model is used to calculate the spatial area swept by the robotic arm in the next second, and a dynamic fan-shaped or cylindrical danger zone is generated. Vehicle driving hazard zone generation sub-step: Based on the real-time speed, steering angle and vehicle size of the transport vehicle, the vehicle kinematics model is used to predict its driving trajectory and generate a dynamic strip-shaped hazard zone. The step of generating the falling object risk zone is as follows: Based on the location, height, stability coefficient, and real-time wind speed and direction data of the high-altitude load, the possible falling object coverage area is calculated using a parabolic motion model to generate a dynamic circular or elliptical danger zone. The regional fusion sub-step involves spatially overlaying and merging the generated dynamic hazard zones with the static hazard zones (edge ​​openings, foundation pit edges, blasting warning zones) in the BIM model to form a comprehensive hazard map. This dynamic hazard zone generation technology overcomes the limitations of traditional static electronic fences, achieving precise prevention and control. Based on a rigid body kinematics-based robotic arm swing model, it calculates the rotation hazard zones of excavators and tower cranes in real time; based on vehicle kinematics-based trajectory prediction, it dynamically generates blind spot warning zones for vehicles; and combined with a falling object risk model based on wind speed, wind direction, and load stability, it scientifically assesses the coverage area of ​​falling objects from heights. The spatial overlay and fusion of the three types of dynamic hazard zones with the BIM static hazard zones constructs a comprehensive, accurate, and real-time construction site hazard map.

[0018] Optimally, step 4, which involves classifying and issuing early warnings for security risks, specifically includes: Risk assessment sub-steps: Calculate the spatial distance d between the dynamic target and the comprehensive hazard area, and the predicted time t for the dynamic target to enter the hazard area; when d is less than the first safety threshold or t is less than the first time threshold, a red warning is issued; when d is less than the second safety threshold and greater than the first safety threshold, or t is less than the second time threshold and greater than the first time threshold, an orange warning is issued; when d is less than the third safety threshold and greater than the second safety threshold, or t is less than the third time threshold and greater than the second time threshold, a yellow warning is issued. Multimodal early warning output sub-steps: output early warning information through various means such as audible and visual alarms, mobile terminal push notifications, and AR terminal overlay display in the tower crane operator's cab; Early warning visualization sub-step: The early warning level, danger zone boundary, and target trajectory prediction line are overlaid on the BIM model or construction site real-world image in a 3D rendering manner, allowing managers to view them in real time via web or mobile platforms.

[0019] This solution's tiered early warning mechanism enables refined management and differentiated responses to safety risks. Based on a dual-parameter judgment model using spatial distance *d* and prediction time *t*, risks are classified into three levels: red, orange, and yellow. A red warning triggers emergency equipment shutdown and personnel evacuation; an orange warning prompts on-site supervisors for close monitoring; and a yellow warning is included in the daily inspection checklist, forming a closed-loop control system with tiered response and progressive escalation. Multimodal early warning output ensures accurate information delivery to audible and visual alarms, alerts on-site personnel, pushes notifications to responsible management personnel via mobile terminals, and overlays the tower crane operator's field of vision onto AR terminals.

[0020] Optimized, step 1 further includes multimodal data acquisition and fusion, specifically including: Multiple lidar and high-definition cameras are deployed in the construction area, and the time synchronization of the equipment is achieved through hard triggering, with the time synchronization accuracy reaching the millisecond level. An improved Fast-ICP algorithm combined with a checkerboard calibration board was used to complete the joint spatial calibration of the lidar coordinate system and the camera coordinate system, and to establish a precise mapping relationship between point cloud pixels and image pixels. Point cloud data is projected onto the image plane, and semantic segmentation is performed on the corresponding regions in the image to obtain the color and texture features of the target. The image features and point cloud features are fused at the feature layer to generate a multimodal fused feature map, which is then input into the improved point-column network. The multimodal fusion adopts an attention-driven feature fusion strategy, which dynamically adjusts the fusion weights of point cloud features and image features based on point cloud quality and lighting conditions. The image feature weights are reduced at night or in poor lighting conditions, and increased when the point cloud is sparse.

[0021] The multimodal data fusion technology of this invention fully leverages the complementary advantages of LiDAR and visual cameras, ensuring precise alignment of point clouds and images in the spatiotemporal dimension, with a fusion error of less than 1 pixel. The feature layer fusion strategy preserves the three-dimensional spatial accuracy of the point cloud while introducing color and texture information from the image, thereby improving the accuracy of target classification. The attention-driven dynamic fusion weight mechanism fully utilizes the advantages of the image during good daylight conditions and automatically enhances the point cloud weight at night or in backlight conditions, increasing the stability of all-weather monitoring and solving the industry problem of single sensors being unsuitable for complex construction environments.

[0022] Compared with the prior art, the beneficial effects of the present invention are: 1. Significantly improves the accuracy and robustness of 3D target detection in complex construction scenarios. This invention overcomes the problems of missed detection of small targets and wasted computational resources caused by the fixed mesh size of traditional networks by introducing an adaptive dynamic voxel partitioning mechanism. It automatically refines the mesh in dense point cloud regions to capture detailed features and automatically coarsens the mesh in sparse regions to reduce redundant computation, achieving a dynamic balance between detection accuracy and computational efficiency. Simultaneously, the multi-scale feature extraction module designed in this invention processes feature maps under different receptive fields in parallel through multi-branch dilated convolution and adopts an adaptive weight fusion mechanism, effectively solving the problem of performance imbalance caused by the significant scale difference between large machinery and small targets in construction scenarios.

[0023] 2. This invention represents a technological leap from post-event traceability to pre-event early warning, enhancing proactive prevention and control capabilities for construction safety. This invention innovatively combines 3D target detection with multi-target tracking and motion trend prediction. Through the Hungarian algorithm and a 3D intersection-union ratio (IUU) data association strategy, coupled with Kalman filtering or LSTM networks, it accurately predicts the position and velocity vectors of dynamic targets within the next 1-3 seconds, constructing a target motion trajectory with spatiotemporal continuity. Based on this, this invention innovatively proposes a dynamic hazardous area generation mechanism. It not only predefines static hazardous areas based on the BIM construction schedule but also dynamically predicts the activity range of machinery in the next second based on real-time motion parameters, generating electronic fences. When the predicted minimum distance between the personnel trajectory and the machinery trajectory is less than a safety threshold and there is a temporal overlap, the system can trigger a tiered early warning, providing valuable reaction time for on-site personnel.

[0024] 3. This invention achieves deep fusion of point cloud data with BIM models and visual information, endowing the monitoring system with stronger engineering semantic understanding capabilities and environmental adaptability. This invention achieves precise time synchronization between LiDAR and high-definition cameras through hard triggering and uses an improved Fast-ICP algorithm to complete pixel-level spatial registration of point clouds and images, laying a solid foundation for subsequent multimodal feature fusion. More importantly, this invention is the first to introduce construction schedule plans and prior knowledge from BIM models into the target detection and risk identification process, embedding semantic prior constraints based on construction safety regulations into the detection head, improving the accuracy of violation identification; by constructing a static background point cloud map of the construction scene and aligning it with the BIM structural model, it achieves precise separation of dynamic targets and static structures, effectively suppressing interference from background point clouds such as stacked building materials and temporary facilities. Furthermore, addressing the point cloud sparsity problem caused by harsh environments such as construction dust, rain, and fog, this invention uses interpolation algorithms to enhance point cloud density and combines a hybrid domain attention module to adaptively weight the feature map in terms of channel and spatial dimensions, enabling the network to focus on key monitoring areas and improving the system's stability and reliability in complex environments. Attached Figure Description

[0025] Figure 1 This is an overall flowchart of the three-dimensional safety monitoring method based on an improved point-column network in the construction scenario of this invention; Figure 2 This is a diagram of the architecture of the improved point-column network in this invention; Figure 3 This is a flowchart of the multimodal data fusion processing in this invention. Detailed Implementation

[0026] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0027] like Figures 1-3 As shown, the three-dimensional safety monitoring method based on an improved point-column network in the construction scenario of the present invention includes the following steps: Step 1: Acquire the original 3D point cloud data of the construction scene and perform preprocessing, including time synchronization, spatial coordinate calibration, and coordinate transformation for alignment with the BIM model; Step 2: Construct an improved point-column network to perform 3D target detection on the preprocessed point cloud data. The improved point-column network includes an adaptive dynamic voxel partitioning module, a multi-scale feature extraction module, and an attention mechanism module; Step 3: Based on the detection results of Step 2, use a multi-target tracking algorithm to continuously track dynamic targets in the construction scene and predict their motion trajectories; Step 4: Combine the preset danger zones in the BIM model with the motion trajectories predicted in Step 3 to construct dynamic danger zones and perform safety risk classification and early warning.

[0028] In this invention, the adaptive dynamic voxel partitioning module in step 2 is specifically used to: dynamically adjust the grid size under the bird's-eye view according to the local point cloud density; use a first-size grid in areas where the point cloud density is higher than a first threshold; and use a second-size grid larger than the first size in areas where the point cloud density is lower than a second threshold, wherein the first threshold is greater than the second threshold; the formula for calculating the local point cloud density is:

[0029] Where N (x,y) Δx represents the number of point clouds in the neighborhood centered at coordinates (x, y), where Δx and Δy are the length and width dimensions of the neighborhood.

[0030] In this invention, the multi-scale feature extraction module in step 2 adopts a multi-branch dilated convolution structure, and achieves comprehensive perception of multi-scale targets in the construction scene by processing feature maps under different receptive fields in parallel. The multi-scale feature extraction module takes the shared feature map output by the feature encoding network as input and evenly divides it into three parallel branches along the channel dimension. Each branch uses a dilated convolutional kernel with a different dilation rate for feature extraction: The first branch uses a standard 3×3 convolutional kernel with a dilation rate of 1, keeping the receptive field within a 3×3 local range, focusing on extracting local detail features such as the edge of the safety helmet, the details of the guardrail, and the outline of the personnel's limbs, ensuring that the spatial structural information of small targets is completely preserved; The second branch uses a 3×3 dilated convolutional kernel with a dilation rate of 3, expanding the effective receptive field to 7×7, which can capture medium-range contextual information, such as the relative positional relationship between personnel and adjacent machinery, the distribution pattern of stacked materials, and the density of local areas, providing spatial contextual support for subsequent target discrimination; The third branch uses a 3×5 dilated convolutional kernel with a dilation rate of 5, expanding the effective receptive field to 11×11, responsible for perceiving large-scale global features, including the overall direction of the tower crane boom, the continuous extension of the foundation pit boundary, and the overall outline of large machinery, etc., as macroscopic structural information. The design of parallel processing of three branches enables the network to simultaneously acquire multi-scale feature representations from fine to macroscopic at the same level, avoiding the problem of dilution of small target features caused by the step-by-step transmission of information in traditional serial multi-scale structures.

[0031] After extracting features in parallel across multiple branches, this invention introduces an adaptive weight fusion module to dynamically fuse the output feature maps of each branch, replacing the traditional simple channel concatenation or fixed weight summation. Specifically, the three branches output feature maps respectively. The fusion module first compresses the feature maps of each branch into channel description vectors through global average pooling, generates initial weight coefficients through a shared fully connected layer and a non-linear activation function, and then normalizes them through the Softmax function to obtain learnable fusion weight parameters w1, w2, and w3. Based on this, the module introduces a content-aware dynamic adjustment mechanism, which fine-tunes the weight parameters according to the local statistical characteristics of the input feature map F: in regions rich in high-frequency details, the weight w1 of the first branch is automatically increased to enhance detail features; in regions with high texture repetition, the weight of the first branch is appropriately reduced while the weights of the second and third branches are increased to enhance context discrimination; in open or distant regions, the global features of the third branch are mainly relied upon. This dynamic weight adjustment strategy enables the network to adaptively adjust the contribution ratio of multi-scale features according to the content characteristics of the input image. Compared with fixed weight fusion, it further improves the average accuracy of target detection on the construction scene test set, and shows stronger robustness, especially under complex working conditions such as drastic changes in lighting and local occlusion.

[0032] In this invention, the attention mechanism module in step 2 employs a hybrid domain attention module, applying channel-dimensional and spatial-dimensional attention weights to the feature map in a concatenated manner to achieve adaptive enhancement of key features of the construction scene and effective suppression of background noise. The channel attention module first compresses the input feature map along the spatial dimension, employing both global average pooling and global max pooling operations in parallel to extract global statistical information and significant response information from the feature map. Global average pooling captures the overall response intensity of the channel dimension, reflecting the average activation level of various features globally; global max pooling focuses on the strongest response position within each channel, capturing the most significant feature cues of the target. The two pooling operations generate two different channel descriptors, which are then input into a shared multilayer perceptron for processing. This multilayer perceptron adopts a bottleneck structure of "dimensionality reduction-activation-dimensionality enhancement": first, a fully connected layer compresses the channel dimension to a lower dimension, a nonlinear transformation is introduced through a nonlinear activation function, and then a second fully connected layer restores the original number of channels. The two processed channel descriptors are added element-wise, and after normalization, the final channel attention weights are generated. The channel attention weights recalibrate the channel dimensions of the original feature map, thereby enhancing important channels and suppressing background noise channels, enabling the network to pay more attention to feature types that are strongly related to construction safety monitoring.

[0033] After performing channel attention weighting, this invention further introduces a spatial attention module to adaptively focus on the spatial dimension of the feature map. The spatial attention module compresses along the channel axis, employing both average pooling and max pooling operations in parallel to generate two two-dimensional spatial descriptors, representing the average and maximum response intensities of each spatial location across all channels, respectively. These two descriptors are concatenated along the channel dimension to form a joint spatial feature map, which is then input into a larger convolutional layer for feature fusion and mapping. The convolutional layer uses appropriate boundary padding to maintain the spatial dimension, outputting a single-channel spatial attention map, which is then processed by a normalization function to generate the final spatial attention weights. These weights are applied element-wise to the channel-weighted feature map to recalibrate the spatial dimension. The overall mechanism is a progressive attention enhancement: channel attention first addresses the question of what type of features to focus on, filtering channels based on feature importance; spatial attention then addresses the question of which locations to focus on, further focusing on the spatial region where the target is located while retaining key channels. This serial structure enables the network to accurately locate key monitoring targets in the construction scenario. In the area beneath the tower crane, channel attention enhances the characteristic channels of the robotic arm and personnel, while spatial attention focuses on the intersection of the tower crane's slewing radius and the personnel's activity area. In the area near openings, channel attention enhances the structural characteristic channels of the guardrail, while spatial attention focuses on the opening's boundary line. Through this dual attention mechanism, the network significantly reduces the misjudgment rate of background point clouds and greatly improves the stability of target detection under complex conditions such as dust, rain, fog, and backlighting. Particularly in the detection of small targets, the average detection accuracy is significantly improved, enhancing the reliability and practicality of the monitoring system in harsh construction environments.

[0034] In this invention, the improved point-column network in step 2 further introduces a prior knowledge constraint module based on construction safety regulations. This prior knowledge constraint module explicitly encodes the semantic prior relationships of the targets in the construction scenario into the loss function of the detection network, enabling the network to not only rely on data-driven feature learning but also incorporate expert knowledge from the engineering field, thereby outputting detection results that conform to construction safety logic. The semantic prior relationships mainly cover three types of core constraints in construction scenarios: First, the spatial relative position relationship between the safety helmet and the worker's head—based on safety regulations, the safety helmet worn by the worker should be located within a vertical range of 0 to 0.3 meters directly above the head, and its horizontal projection position should basically coincide with the center of the head. If the detection results show that the distance between the safety helmet and the head is too far or the relative position is abnormal, it is judged as a violation of spatial constraints; Second, the spatial relative position relationship between the guardrail and the edge area—according to the safety regulations for working at heights, guardrails must be installed in dangerous areas such as edge openings and foundation pit edges. Furthermore, the railing should be tightly attached to the adjacent boundary, with a horizontal distance of no more than 0.1 meters from the adjacent area. If no protective railing features are detected around the adjacent area, or if the railing position is significantly offset from the adjacent area, it is judged as a lack of protection or an abnormal position. Thirdly, the geometric connection relationship between the tower crane boom and the standard tower section—based on the structural principle of tower cranes, the boom should form a rigid connection with the standard tower section at the slewing platform. The two should coincide in spatial position at the connection point and their orientation should conform to mechanical kinematic constraints. If the boom is detected to be separated from the tower, misaligned, or with an abnormal connection angle, it is judged as a structural recognition error. In the specific implementation, the prior knowledge constraint module first extracts the category labels, 3D center point coordinates, 3D dimensions, and orientation angle information of various targets from the target candidate boxes output by the detection network. Then, it constructs spatial relationship judgment functions for the above three types of semantic prior relationships and calculates the degree of deviation between the detection results and the specification requirements. Each deviation is quantified and introduced as a penalty term into the total loss function, participating in backpropagation optimization together with the classification loss and regression loss. When the detection result fully conforms to the semantic prior relationship, the penalty term coefficient is set to zero, without affecting the original loss calculation; when the detection result partially deviates from the specification requirements, the penalty term coefficient increases linearly with the degree of deviation; when the detection result seriously violates common sense physics, the penalty term coefficient amplifies exponentially, forcing the network to quickly correct erroneous outputs in subsequent iterations. Through this optimization strategy that integrates knowledge-driven and data-driven approaches, this invention significantly improves the credibility of detection results at the engineering semantic level—the accuracy of safety helmet wearing recognition is greatly improved, the false alarm rate of edge protection deficiency detection is significantly reduced, and the consistency and stability of tower crane structure recognition are significantly enhanced, making the early warning information output by the monitoring system more in line with the actual safety management needs of the construction site.

[0035] In this invention, the preprocessing in step 1 further includes: Point cloud enhancement steps: For sparse areas of point cloud caused by construction dust, rain and fog, the K-nearest neighbor interpolation algorithm is used to enhance the point cloud density and fill in the missing points in the sparse areas; Dynamic background filtering steps: A static background point cloud map of the construction scene is pre-constructed, which includes the existing permanent structures. By comparing the current frame point cloud with the background map, point clouds belonging to the static background are filtered out, while dynamic target point clouds such as personnel and machinery are retained. Coordinate system alignment steps: The improved Fast-ICP algorithm is used to transform the point cloud coordinates to the construction coordinate system, so as to achieve accurate spatial alignment with the BIM model.

[0036] In this invention, step 3 specifically includes: Data association sub-step: The Hungarian algorithm is used in combination with 3D intersection-union ratio and Mahalanobis distance to calculate the association cost matrix between the detected target in the current frame and the historical trajectory, and to perform optimal matching; Motion state prediction sub-step: Introduce a Kalman filter or long short-term memory network to predict the target's position, velocity vector and orientation angle within the next 1-3 seconds based on the target's historical trajectory data; Trajectory management sub-steps: Maintain trajectory status for targets that are briefly missed, set a trajectory retention threshold, and terminate the trajectory when the number of missed frames exceeds the threshold; initialize a new trajectory for newly added targets, set a trajectory confirmation threshold, and confirm a stable trajectory when the number of consecutive detection frames exceeds the threshold.

[0037] In this invention, step 4, constructing the dynamic hazardous area, specifically includes: The sub-step for generating the swing danger zone of the robotic arm is as follows: Based on the real-time rotation speed, direction and length of the robotic arm of rotating machinery such as excavators and tower cranes, a rigid body kinematics model is used to calculate the spatial area swept by the robotic arm in the next second, and a dynamic fan-shaped or cylindrical danger zone is generated. Vehicle driving hazard zone generation sub-step: Based on the real-time speed, steering angle and vehicle size of the transport vehicle, the vehicle kinematics model is used to predict its driving trajectory and generate a dynamic strip-shaped hazard zone. The step of generating the falling object risk zone is as follows: Based on the location, height, stability coefficient, and real-time wind speed and direction data of the high-altitude load, the possible falling object coverage area is calculated using a parabolic motion model to generate a dynamic circular or elliptical danger zone. Regional integration sub-step: Spatially overlay and integrate the generated dynamic hazard zones with the static hazard zones (edge ​​openings, foundation pit edges, blasting warning zones) in the BIM model to form a comprehensive hazard zone map.

[0038] In this invention, step 4, which involves classifying and issuing early warnings for security risks, specifically includes: Risk assessment sub-steps: Calculate the spatial distance d between the dynamic target and the comprehensive hazard area, and the predicted time t for the dynamic target to enter the hazard area; when d is less than the first safety threshold or t is less than the first time threshold, a red warning is issued; when d is less than the second safety threshold and greater than the first safety threshold, or t is less than the second time threshold and greater than the first time threshold, an orange warning is issued; when d is less than the third safety threshold and greater than the second safety threshold, or t is less than the third time threshold and greater than the second time threshold, a yellow warning is issued. Multimodal early warning output sub-steps: output early warning information through various means such as audible and visual alarms, mobile terminal push notifications, and AR terminal overlay display in the tower crane operator's cab; Early warning visualization sub-step: The early warning level, danger zone boundary, and target trajectory prediction line are overlaid on the BIM model or construction site real-world image in a 3D rendering manner, allowing managers to view them in real time via web or mobile platforms.

[0039] In this invention, step 1 further includes multimodal data acquisition and fusion, specifically including: Multiple lidar and high-definition cameras are deployed in the construction area, and the time synchronization of the equipment is achieved through hard triggering, with the time synchronization accuracy reaching the millisecond level. An improved Fast-ICP algorithm combined with a checkerboard calibration board was used to complete the joint spatial calibration of the lidar coordinate system and the camera coordinate system, and to establish a precise mapping relationship between point cloud pixels and image pixels. Point cloud data is projected onto the image plane, and semantic segmentation is performed on the corresponding regions in the image to obtain the color and texture features of the target. The image features and point cloud features are fused at the feature layer to generate a multimodal fused feature map, which is then input into the improved point-column network. The multimodal fusion adopts an attention-driven feature fusion strategy, which dynamically adjusts the fusion weights of point cloud features and image features based on point cloud quality and lighting conditions. The image feature weights are reduced at night or in poor lighting conditions, and increased when the point cloud is sparse.

[0040] In summary, this invention proposes a 3D safety monitoring method based on an improved point-column network for construction scenarios, aiming to address the technical bottlenecks of traditional construction monitoring methods in terms of spatial perception, dynamic early warning, and environmental adaptability. This method constructs a complete technical closed loop from data acquisition to risk early warning through multi-level innovative design. In the data preprocessing stage, multi-sensor synchronous calibration, dynamic background filtering, point cloud enhancement, and BIM coordinate alignment lay the data foundation for accurate detection. At the core detection network level, the classic point-column network is systematically improved: adaptive dynamic voxel partitioning is introduced to intelligently adjust the grid size according to the point cloud density; a multi-scale feature extraction module is designed to capture features from local details to global context through multi-branch dilated convolution; a hybrid domain attention mechanism is embedded to suppress background noise and focus on key areas from channel and spatial dimensions; and a prior knowledge constraint module based on construction safety specifications is innovatively added, encoding engineering semantics such as the relative position of the safety helmet and head, and the spatial relationship between the guardrail and the edge into the loss function, guiding the network to learn target features that conform to engineering logic. Based on target detection, multi-target tracking and motion trend prediction are introduced to achieve accurate perception of the spatiotemporal continuity of dynamic targets. Ultimately, by combining BIM-preset hazardous areas with real-time predicted trajectories, multiple types of hazardous areas, such as robotic arm swing zones and vehicle driving zones, are dynamically generated. Risk classification is then performed based on spatial distance and prediction time. Through multimodal early warning output and 3D visualization, intelligent safety monitoring capabilities are provided to construction sites around the clock and across all spaces, achieving a technological leap from passive post-event tracing to proactive pre-event early warning. It should be noted that the above embodiments are only used to illustrate the technical solution of the present invention and not to limit the technical solution. Although the applicant has described the present invention in detail with reference to preferred embodiments, those skilled in the art should understand that any modifications or equivalent substitutions made to the technical solution of the present invention, without departing from the spirit and scope of the present invention, should be covered within the scope of the claims of the present invention.

Claims

1. A three-dimensional safety monitoring method based on an improved point-column network in construction scenarios, characterized in that, Includes the following steps: Step 1: Obtain the original 3D point cloud data of the construction scene and perform preprocessing. The preprocessing includes time synchronization, spatial coordinate calibration, and coordinate transformation to align with the BIM model. Step 2: Construct an improved point-pillar network to perform 3D target detection on the preprocessed point cloud data. The improved point-pillar network includes an adaptive dynamic voxel partitioning module, a multi-scale feature extraction module, and an attention mechanism module. Step 3: Based on the detection results of Step 2, a multi-target tracking algorithm is used to continuously track dynamic targets in the construction scene and predict their motion trajectories; Step 4: Combine the preset danger zones in the BIM model with the motion trajectories predicted in Step 3 to construct dynamic danger zones and conduct safety risk classification and early warning.

2. The three-dimensional safety monitoring method based on an improved point-column network in a construction scenario according to claim 1, characterized in that, The adaptive dynamic voxel partitioning module in step 2 is specifically used to: dynamically adjust the grid size under the bird's-eye view based on the local point cloud density; use a first-size grid in areas where the point cloud density is higher than a first threshold, and use a second-size grid larger than the first size in areas where the point cloud density is lower than a second threshold, wherein the first threshold is greater than the second threshold; the formula for calculating the local point cloud density is: ; Where N (x,y) Δx represents the number of point clouds in the neighborhood centered at coordinates (x, y), where Δx and Δy are the length and width dimensions of the neighborhood.

3. The three-dimensional safety monitoring method based on an improved point-column network in a construction scenario according to claim 1, characterized in that, The multi-scale feature extraction module in step 2 includes a multi-branch dilated convolutional structure. Each branch uses convolutional kernels with different dilation rates to process the feature maps in parallel, and the outputs of each branch are fused through an adaptive weight fusion module. The multi-branch dilated convolutional structure specifically includes: The first branch uses a convolutional kernel with an inflation rate of 1 to extract local detail features; The second branch uses a convolutional kernel with an inflation rate of 3 to extract medium-range contextual features; The third branch uses a convolutional kernel with an inflation rate of 5 to extract large-scale global features; The adaptive weight fusion module performs a weighted summation of the feature maps of each branch using learnable weight parameters, which are dynamically adjusted according to the content of the input feature maps.

4. The three-dimensional safety monitoring method based on an improved point-column network in a construction scenario according to claim 1, characterized in that, The attention mechanism module in step 2 is a hybrid domain attention module, which is used to apply attention weights of the channel dimension and the spatial dimension to the feature map respectively. The channel attention module extracts channel descriptors through global average pooling and global max pooling, and generates channel weights through a shared multilayer perceptron; the spatial attention module performs average pooling and max pooling along the channel axis, and generates spatial weights through a convolutional layer; finally, the channel weights and spatial weights are applied to the input feature map in sequence to suppress background point cloud noise and enhance the feature representation of key target regions.

5. The three-dimensional safety monitoring method based on an improved point-column network in a construction scenario according to claim 1, characterized in that, The improved point-column network in step 2 also includes a prior knowledge constraint module based on construction safety specifications. The prior knowledge constraint module encodes the semantic prior relationships of the targets in the construction scenario into the loss function of the detection network. The semantic prior relationships include: the spatial relative position of the safety helmet and the person's head in the vertical direction, the spatial relative position of the guardrail and the edge area, and the geometric connection relationship between the tower crane boom and the standard section of the tower body. When the detection result violates the semantic prior relationships, the loss function automatically increases the penalty term coefficient to guide the network to learn target features that conform to engineering logic.

6. The three-dimensional safety monitoring method based on an improved point-column network in a construction scenario according to claim 1, characterized in that, The preprocessing in step 1 also includes: Point cloud enhancement steps: For sparse areas of point cloud caused by construction dust, rain and fog, the K-nearest neighbor interpolation algorithm is used to enhance the point cloud density and fill in the missing points in the sparse areas; Dynamic background filtering steps: A static background point cloud map of the construction scene is pre-constructed, which includes the existing permanent structures. By comparing the current frame point cloud with the background map, point clouds belonging to the static background are filtered out, while dynamic target point clouds such as personnel and machinery are retained. Coordinate system alignment steps: The improved Fast-ICP algorithm is used to transform the point cloud coordinates to the construction coordinate system, so as to achieve accurate spatial alignment with the BIM model.

7. The three-dimensional safety monitoring method based on an improved point-column network in a construction scenario according to claim 1, characterized in that, Step 3 specifically includes: Data association sub-step: The Hungarian algorithm is used in combination with 3D intersection-union ratio and Mahalanobis distance to calculate the association cost matrix between the detected target in the current frame and the historical trajectory, and to perform optimal matching; Motion state prediction sub-step: Introduce a Kalman filter or long short-term memory network to predict the target's position, velocity vector and orientation angle within the next 1-3 seconds based on the target's historical trajectory data; Trajectory management sub-steps: Maintain trajectory status for targets that are briefly missed, set a trajectory retention threshold, and terminate the trajectory when the number of missed frames exceeds the threshold; initialize a new trajectory for newly added targets, set a trajectory confirmation threshold, and confirm a stable trajectory when the number of consecutive detection frames exceeds the threshold.

8. The three-dimensional safety monitoring method based on an improved point-column network in a construction scenario according to claim 1, characterized in that, The construction of the dynamic hazardous area in step 4 specifically includes: The sub-step for generating the swing danger zone of the robotic arm is as follows: Based on the real-time rotation speed, direction and length of the robotic arm of rotating machinery such as excavators and tower cranes, a rigid body kinematics model is used to calculate the spatial area swept by the robotic arm in the next second, and a dynamic fan-shaped or cylindrical danger zone is generated. Vehicle driving hazard zone generation sub-step: Based on the real-time speed, steering angle and vehicle size of the transport vehicle, the vehicle kinematics model is used to predict its driving trajectory and generate a dynamic strip-shaped hazard zone. The step of generating the falling object risk zone is as follows: Based on the location, height, stability coefficient, and real-time wind speed and direction data of the high-altitude load, the possible falling object coverage area is calculated using a parabolic motion model to generate a dynamic circular or elliptical danger zone. Regional integration sub-step: Spatially overlay and integrate the generated dynamic hazard zones with the static hazard zones (edge ​​openings, foundation pit edges, blasting warning zones) in the BIM model to form a comprehensive hazard zone map.

9. The three-dimensional safety monitoring method based on an improved point-column network in a construction scenario according to claim 1, characterized in that, Step 4, which involves classifying and issuing early warnings of security risks, specifically includes: Risk assessment sub-steps: Calculate the spatial distance d between the dynamic target and the comprehensive hazard area, and the predicted time t for the dynamic target to enter the hazard area; when d is less than the first safety threshold or t is less than the first time threshold, a red warning is issued; when d is less than the second safety threshold and greater than the first safety threshold, or t is less than the second time threshold and greater than the first time threshold, an orange warning is issued; when d is less than the third safety threshold and greater than the second safety threshold, or t is less than the third time threshold and greater than the second time threshold, a yellow warning is issued. Multimodal early warning output sub-steps: output early warning information through various means such as audible and visual alarms, mobile terminal push notifications, and AR terminal overlay display in the tower crane operator's cab; Early warning visualization sub-step: The early warning level, danger zone boundary, and target trajectory prediction line are overlaid on the BIM model or construction site real-world image in a 3D rendering manner, allowing managers to view them in real time via web or mobile platforms.

10. The three-dimensional safety monitoring method based on an improved point-column network in a construction scenario according to claim 1, characterized in that, Step 1 further includes multimodal data acquisition and fusion, specifically including: Multiple lidar and high-definition cameras are deployed in the construction area, and the time synchronization of the equipment is achieved through hard triggering, with the time synchronization accuracy reaching the millisecond level. An improved Fast-ICP algorithm combined with a checkerboard calibration board was used to complete the joint spatial calibration of the lidar coordinate system and the camera coordinate system, and to establish a precise mapping relationship between point cloud pixels and image pixels. Point cloud data is projected onto the image plane, and semantic segmentation is performed on the corresponding regions in the image to obtain the color and texture features of the target. The image features and point cloud features are fused at the feature layer to generate a multimodal fused feature map, which is then input into the improved point-column network. The multimodal fusion adopts an attention-driven feature fusion strategy, which dynamically adjusts the fusion weights of point cloud features and image features based on point cloud quality and lighting conditions. The image feature weights are reduced at night or in poor lighting conditions, and increased when the point cloud is sparse.