An intelligent identification and early warning system for unsafe behaviors of site personnel
By using multimodal data fusion and causal reasoning, the problems of high false alarm rate, high missed detection rate and insufficient risk prediction in the construction site safety management system have been solved. It has achieved real-time identification and early warning of unsafe behaviors, adapts to complex construction site environments, and has the ability to quickly adapt to different scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- 陕西建工集团股份有限公司
- Filing Date
- 2026-05-21
- Publication Date
- 2026-06-19
AI Technical Summary
Existing construction site safety management systems suffer from high false alarm and false alarm rates, lack physiological state perception, are unable to predict potential risks, and are unable to effectively identify and warn of unsafe behaviors in complex construction site environments.
By employing a multimodal perception layer, a data preprocessing layer, a cross-modal spatiotemporal attention fusion layer, an unsafe behavior identification and prediction layer, and a hierarchical early warning and response layer, and combining physiological-behavioral-environmental data, potential risks are predicted through cross-modal spatiotemporal attention fusion and causal reasoning, thus achieving edge-cloud collaborative computing.
It effectively reduces false alarm and false alarm rates, can identify and predict unsafe behaviors in real time, adapts to complex construction site environments, has good modular design and adaptive capabilities, and supports rapid scene migration.
Smart Images

Figure CN122245076A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of construction site safety technology, and more specifically, to an intelligent identification and early warning system for unsafe behaviors of construction site personnel. Background Technology
[0002] The construction industry is one of the most dangerous industries globally. Statistics show that approximately 60,000 people die in construction accidents worldwide each year, accounting for more than 30% of all occupational deaths. Research indicates that up to 80% of construction accidents are caused by unsafe worker behaviors. Therefore, timely identification and warning of unsafe worker behaviors are crucial for reducing the incidence of construction accidents.
[0003] Traditional construction site safety management relies primarily on manual inspections, which suffers from drawbacks such as low efficiency, limited coverage, susceptibility to omissions, and delayed response. With the development of computer vision and deep learning technologies, unsafe behavior recognition systems based on video surveillance have been widely adopted. However, existing technologies still have the following limitations: (1) The construction site environment is complex, with problems such as dust, strong light, dynamic occlusion, and complex light and shadow. Traditional single-modal vision algorithms are prone to feature extraction failure in these environments, resulting in high false alarm and false negative rates.
[0004] (2) Existing systems mainly focus on workers’ external unsafe behaviors, such as not wearing safety helmets or entering dangerous areas in violation of regulations, but cannot capture the internal physiological factors that cause these behaviors, such as fatigue, abnormal heart rate, and reduced blood oxygen.
[0005] (3) Existing systems only perform single-frame or short-time analysis, lack long-term spatiotemporal trajectory analysis and behavioral pattern learning of worker behavior, and cannot identify some complex unsafe behaviors that require contextual judgment, nor can they predict potential risks.
[0006] (4) Most existing systems are post-event identification, and can only issue alarms after unsafe behavior occurs. They cannot predict future unsafe behavior based on multimodal data and cannot achieve true proactive safety management.
[0007] In view of this, the present invention is proposed to solve the above-mentioned technical problems. Summary of the Invention
[0008] The purpose of this invention is to provide an intelligent identification and early warning system for unsafe behaviors of construction workers, so as to solve the technical problems of high false alarm rate, high missed detection rate, lack of physiological state perception, and inability to predict potential risks in the prior art.
[0009] To achieve the above objectives, the present invention provides the following technical solution: A smart identification and early warning system for unsafe behaviors of construction workers includes: The multimodal perception layer is used to collect physiological data, motion data, visual data, and environmental data of construction site personnel. The data preprocessing layer communicates with the multimodal perception layer and is used to perform spatiotemporal registration, outlier removal, and standardization on the acquired multimodal data. The cross-modal spatiotemporal attention fusion layer communicates with the data preprocessing layer to extract features from different modal data and dynamically fuse multimodal features through a cross-modal spatiotemporal attention mechanism. The unsafe behavior identification and prediction layer communicates with the cross-modal spatiotemporal attention fusion layer to identify unsafe behaviors that have occurred based on the fused multimodal features and to predict potential unsafe behaviors based on causal reasoning. The tiered early warning and response layer communicates with the unsafe behavior identification and prediction layer to generate early warning signals of different levels based on the identification and prediction results, and to execute corresponding response measures. The edge-cloud collaborative computing layer communicates with the multimodal perception layer, data preprocessing layer, cross-modal spatiotemporal attention fusion layer, unsafe behavior recognition and prediction layer, and hierarchical early warning and response layer, respectively, to realize collaborative computing between the edge and the cloud. Lightweight models are deployed at the edge for real-time detection, while large models are deployed in the cloud for complex analysis and model updates.
[0010] Furthermore, the multimodal sensing layer includes: The smart safety helmet terminal integrates a heart rate sensor, blood oxygen sensor, 6-axis IMU sensor, helmet removal detection sensor, front-facing high-definition camera, Beidou-3 positioning module, 5G communication module and voice prompt module. It is used to collect the wearer's heart rate, blood oxygen, movement posture, location information and first-person perspective video, and realize real-time data transmission and voice warning prompts. Fixed visual perception units include multiple high-definition cameras and infrared thermal imaging cameras deployed in key areas of the construction site to collect global visual data of key areas of the construction site. The environmental sensing unit includes temperature and humidity sensors, dust concentration sensors, wind speed sensors, and harmful gas sensors, which are used to collect environmental data at the construction site. Mobile sensing units, including multimodal sensors mounted on drones and inspection robots, are used to supplement monitoring of fixed sensing blind spots.
[0011] Furthermore, the data preprocessing layer includes: The spatiotemporal registration module is used to align data collected by different sensors in time and space based on the timestamps and location information of the BeiDou-3 positioning module. The outlier removal module is used to remove outliers from multimodal data using a sliding window-based statistical method. The data standardization module is used to standardize data from different modalities, giving them the same dimensions and distribution.
[0012] Furthermore, the cross-modal spatiotemporal attention fusion layer includes: The single-modal feature extraction module includes a visual feature extraction submodule, a physiological feature extraction submodule, a motion feature extraction submodule, and an environmental feature extraction submodule, which are used to extract features from different modal data, respectively. The cross-modal spatiotemporal attention module is used to calculate the attention weights of different modal features at different spatiotemporal locations, and to perform weighted fusion of multimodal features based on the attention weights to generate a fused feature map; The feature fusion output module is used to perform dimensionality transformation and normalization on the fused feature map to generate a unified fused multimodal feature vector.
[0013] Furthermore, the working process of the cross-modal spatiotemporal attention module includes: For each modality, the feature data is positionally encoded, and spatiotemporal location information is added; Calculate the similarity between each spatiotemporal location in each modal feature data and all spatiotemporal locations in other modal feature data; The attention weight for each spatiotemporal location is calculated based on similarity. The features of different modalities are weighted and summed according to the attention weights to obtain the fused feature map.
[0014] Furthermore, the unsafe behavior identification and prediction layer includes: The unsafe behavior identification module is used to identify unsafe behaviors that have occurred based on the fused multimodal feature vectors, including not wearing a safety helmet, not wearing a safety belt, entering a dangerous area without permission, smoking, fighting, falling, climbing, and operating equipment without permission; The behavior pattern learning module is used to learn the normal behavior patterns of construction site workers and establish a behavior baseline. Specifically, it collects multimodal data of workers under normal working conditions for 7 consecutive days, uses the DBSCAN clustering algorithm to cluster the data, sets the neighborhood radius to 0.5 and the minimum number of samples to 5, and obtains different normal behavior patterns. When the worker's behavior deviates from the normal behavior pattern, an early warning prompt is triggered. The causal reasoning prediction module is used to predict potential unsafe behaviors within a preset time period by analyzing the probability and timing of unsafe behaviors based on multimodal feature vectors and behavioral baselines through causal reasoning. The risk level assessment module is used to assess the risk level based on the type of unsafe behavior, its probability of occurrence, and its potential consequences.
[0015] Furthermore, the causal reasoning prediction module adopts a causal reasoning model based on graph neural networks. This model treats individual construction workers, construction equipment entities, construction site environment areas, and behavioral events as different types of nodes. It uses directed edges to represent the interaction between construction workers and construction equipment, the positional relationship between construction workers and the construction site environment, the triggering relationship between construction workers and behavioral events, the adaptation relationship between construction equipment and the construction site environment, the association relationship between construction equipment and behavioral events, and the inducing relationship between the construction site environment and behavioral events, in chronological order of causal occurrence. By learning the causal relationship weights in historical multimodal data, it predicts unsafe behaviors that may occur within a preset time period in the future.
[0016] Furthermore, the tiered early warning and response layer includes: The early warning level classification module is used to classify early warning signals into Level 1, Level 2, and Level 3 early warnings based on the risk level output by the risk level assessment module. The multi-channel early warning release module is used to release early warning signals through voice prompts from smart safety helmets, on-site sound and light alarms, mobile apps for managers, and large monitoring screens at construction sites. The graded response execution module is used to execute corresponding response measures according to the warning level. Specifically, Level 1 warning corresponds to voice reminder, Level 2 warning corresponds to on-site intervention, and Level 3 warning corresponds to emergency shutdown and emergency rescue activation.
[0017] Furthermore, the edge-cloud collaborative computing layer includes: Edge computing units, deployed on smart helmet terminals, edge servers, and mobile sensing units, are used to run lightweight unsafe behavior recognition models to achieve real-time detection and local early warning. The cloud computing unit, deployed on a cloud server, is used to run large models for complex behavioral analysis, causal reasoning, model training, and adaptive scene transfer learning. The model update module is used to update the parameters of the model trained in the cloud to the edge computing unit, so as to achieve continuous optimization of the model. The adaptive scene transfer learning module is deployed in the cloud computing unit to collect a small amount of labeled data in new scenes. It fine-tunes the model feature extraction layer parameters through the domain adaptive algorithm to enable the model to quickly adapt to different construction stages and different construction site scenarios.
[0018] Compared with the prior art, the beneficial effects of the present invention are as follows: 1. A multimodal perception architecture integrating physiology, behavior, and environment is adopted, which integrates data from multiple sensors and can fully utilize the complementarity of different modalities. In particular, the proposed cross-modal spatiotemporal attention fusion network can dynamically learn the importance weights of different modalities at different spatiotemporal locations, effectively solving the problem of feature extraction failure in complex construction site environments.
[0019] 2. An unsafe behavior prediction model based on causal reasoning was constructed, which can not only identify unsafe behaviors that have already occurred, but also predict unsafe behaviors that may occur in the future based on multimodal data (especially physiological data).
[0020] 3. A hierarchical inference architecture with edge-cloud collaboration is adopted. Lightweight models are deployed at the edge for real-time detection, while large models are deployed in the cloud for complex analysis and model updates. Model lightweighting employs a combination of knowledge distillation and INT8 quantization. The large model trained in the cloud serves as the teacher model, and the small model at the edge serves as the student model. Knowledge distillation transfers knowledge from the teacher model to the student model, and INT8 quantization converts the model parameters from 32-bit floating-point numbers to 8-bit integers, further compressing the model size. Ultimately, the model size is compressed to 1 / 10 of its original size, reducing the inference latency on the Jetson Nano device to 35ms, meeting the requirements for real-time construction safety alarms. The system of this invention adopts a layered modular architecture, with each layer interacting through standardized feature interfaces and the MQTT communication protocol. Each module has independent input and output definitions, allowing for easy addition of new sensors and new unsafe behavior recognition types without significant modifications to the overall system architecture.
[0021] 4. The system architecture of this invention features a well-designed modular architecture, allowing for easy addition of new sensors and new types of unsafe behavior recognition. Furthermore, the adoption of an adaptive scene transfer learning module enables rapid adaptation to changes in construction stages and site conditions without requiring retraining of the entire model.
[0022] 5. This invention can not only identify and warn of unsafe behaviors, but also record all warning information and processing procedures, forming a complete safety management data chain, providing data support for subsequent safety analysis and management decisions. Attached Figure Description
[0023] The accompanying drawings, which form part of this application, are used to provide a further understanding of the invention. The illustrative embodiments and descriptions of the invention are used to explain the invention, but do not constitute an undue limitation of the invention. Obviously, the drawings described below are merely some embodiments, and those skilled in the art can obtain other drawings based on these drawings without creative effort. In the drawings: Figure 1 The system architecture diagram of the intelligent identification and early warning system for unsafe behaviors of construction workers provided in the embodiments of this application is shown. Detailed Implementation
[0024] The technical solutions in the embodiments of the present invention will be clearly and completely described below. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments.
[0025] See Figure 1 As shown, an intelligent identification and early warning system for unsafe behaviors of construction workers includes: The multimodal perception layer is used to collect physiological data, motion data, visual data, and environmental data of construction site personnel. Specifically, (1) the safety helmets of construction workers are integrated with: Heart rate sensor and blood oxygen sensor: Employs a PPG optical sensor, which can collect workers' heart rate and blood oxygen saturation data in real time, with a sampling frequency of 1Hz; 6-axis IMU sensor: including accelerometer and gyroscope, capable of collecting worker's motion posture data, with a sampling frequency of 50Hz; Helmet removal detection sensor: Employs a capacitive sensor to detect whether the helmet is being worn correctly; Front-facing HD camera: 1080P resolution, 30fps frame rate, capable of capturing first-person perspective video of workers; BeiDou-3 positioning module: Supports BeiDou-3 B1I and B1C signals, with outdoor positioning accuracy better than 1.5 meters, and can provide accurate timestamps and location information; 5G communication module: Supports 5G SA / NSA dual-mode, enabling real-time data transmission; Voice prompt module: capable of playing voice prompts and warning messages.
[0026] (2) Deploy multiple high-definition cameras and infrared thermal imaging cameras in key areas such as the entrances and exits of the construction site, the tower crane operation area, the high-altitude operation area, and the area around the deep foundation pit. The high-definition cameras have a resolution of 4K and a frame rate of 25fps, and are used to collect visual data during the day; the infrared thermal imaging cameras have a resolution of 640×512 and a frame rate of 15fps, and are used to collect visual data at night and under low light conditions.
[0027] (3) Deploy temperature and humidity sensors, dust concentration sensors, wind speed sensors, and hazardous gas sensors in different areas of the construction site to collect environmental data in real time. The temperature and humidity sensor has a measurement range of -40℃ to 85℃ and an accuracy of ±0.5℃; the dust concentration sensor has a measurement range of 0 to 1000μg / m³ and an accuracy of ±10μg / m³; the wind speed sensor has a measurement range of 0 to 60m / s and an accuracy of ±0.5m / s; the hazardous gas sensor can detect common hazardous gases such as carbon monoxide and hydrogen sulfide, with carbon monoxide having a measurement range of 0 to 1000ppm and an accuracy of ±5ppm; and hydrogen sulfide having a measurement range of 0 to 100ppm and an accuracy of ±1ppm.
[0028] (4) Mobile sensing unit, including multimodal sensors mounted on UAVs and inspection robots. Among them, the UAV is equipped with a high-definition camera, an infrared thermal imaging camera and a Beidou-3 positioning module, which can quickly inspect large areas of the construction site; the inspection robot is equipped with a high-definition camera, a lidar and an ultrasonic sensor. The lidar has a ranging range of 0.1 to 20 meters and an accuracy of ±2 cm; the ultrasonic sensor has a ranging range of 0.02 to 5 meters and an accuracy of ±1 cm, which can supplement the monitoring of fixed sensing blind spots.
[0029] The data preprocessing layer communicates with the multimodal perception layer and is used to perform spatiotemporal registration, outlier removal, and standardization on the acquired multimodal data. The data preprocessing layer includes: Spatiotemporal registration module: Due to the different sampling frequencies and locations of various sensors, the acquired data is asynchronous in time and space. Therefore, based on the precise timestamps and location information provided by the BeiDou-3 positioning module, a linear interpolation method is used to align the data from different sensors in time and space. For visual data, the Lucas-Kanade optical flow method is used for inter-frame alignment to ensure that the videos captured by different cameras are synchronized in time.
[0030] Outlier Removal Module: Due to sensor malfunctions, electromagnetic interference, and other reasons, outliers may exist in the collected data. A sliding window-based statistical method is used to remove outliers. For each data sequence, the mean and standard deviation within the sliding window are calculated. If a data point deviates from the mean by more than three times the standard deviation, it is considered an outlier and removed. The average of the preceding and following data points is then used to replace it.
[0031] Data Standardization Module: Data from different modalities have different dimensions and distributions. To facilitate subsequent feature extraction and fusion, the data needs to be standardized. The Z-score standardization method is used to transform each data sequence into a standard normal distribution with a mean of 0 and a standard deviation of 1.
[0032] The cross-modal spatiotemporal attention fusion layer communicates with the data preprocessing layer to extract features from different modal data and dynamically fuse multimodal features through a cross-modal spatiotemporal attention mechanism. Specifically, it includes: 1. Single-modal feature extraction module: Visual Feature Extraction Submodule: This submodule uses an improved YOLOv8n model as its backbone network to extract features from visual data. To adapt to edge deployment, the YOLOv8n model was lightweighted by removing the last two convolutional layers used for large object detection and their corresponding upsampling layers from the YOLOv8n backbone network, and replacing ordinary convolutions with depthwise separable convolutions. Furthermore, an ECA attention mechanism was added after each C2f module in the backbone network to improve the detection capability for small objects.
[0033] The physiological feature extraction submodule uses a one-dimensional convolutional neural network to extract features from heart rate and blood oxygen data. The network structure consists of three one-dimensional convolutional layers and two fully connected layers. The first convolutional layer has a kernel size of 3, a stride of 1, and 16 output channels; the second convolutional layer has a kernel size of 3, a stride of 1, and 32 output channels; the third convolutional layer has a kernel size of 3, a stride of 1, and 64 output channels. Each convolutional layer is followed by a batch normalization layer and a ReLU activation function. The first fully connected layer has 128 neurons, and the second fully connected layer has 64 neurons.
[0034] The motion feature extraction submodule uses an LSTM network to extract features from IMU data. LSTM networks are well-suited for processing temporal data and capturing dynamic changes in motion posture. The network structure consists of two LSTM layers and one fully connected layer. Each LSTM layer has 128 hidden units, and the fully connected layer has 64 neurons.
[0035] The environmental feature extraction submodule uses a multilayer perceptron to extract features from environmental data. The network structure consists of three fully connected layers: the first fully connected layer has 64 neurons, the second has 128 neurons, and the third has 64 neurons. Each fully connected layer is followed by a ReLU activation function.
[0036] 2. Cross-modal spatiotemporal attention module: This application proposes a novel cross-modal spatiotemporal attention mechanism that can dynamically learn the importance weights of different modal features at different spatiotemporal locations. Its working process is as follows: For each modality's feature map, positional encoding is performed, adding spatiotemporal positional information. For visual feature maps, two-dimensional positional encoding is used; for temporal feature maps (such as physiological and motion data), one-dimensional positional encoding is used.
[0037] Calculate the similarity between each location in each modality feature map and all locations in other modality feature maps. The similarity calculation formula is: Sim(m, p, n, q) = ; In the formula, Sim(m, p, n, q) represents the cosine similarity between the feature vector at position P in the m-th modal feature map and the feature vector at position q in the n-th modal feature map; This represents the feature vector at position p in the feature map of the m-th modality; This represents the feature vector at position q in the feature map of the nth modality; Attention weights are calculated for each position based on similarity. The formula for calculating attention weights is: ; In the formula, This represents the attention weight at position p in the m-th modality feature map; softmax(·) represents the softmax activation function, and the · in parentheses is an input placeholder used to convert the similarity value into a probability distribution between 0 and 1; This represents summing over all positions q in all modal feature maps; M represents the total number of modes, N n This represents the total number of positions in the nth modal feature map.
[0038] The features from different modalities are weighted and summed according to attention weights to obtain the fused feature map. The fusion formula is: = · ; In the formula, This represents the fused feature map; This represents the attention weight at position p in the m-th modality feature map; This represents the feature vector at position p in the feature map of the m-th mode; This represents summing over all positions p in all modal feature maps, N. m This represents the total number of positions in the m-th modal feature map.
[0039] 3. Feature fusion output module: The weighted and fused multimodal features are integrated to generate a unified fused feature vector, which is then input into the subsequent unsafe behavior recognition and prediction layer.
[0040] The unsafe behavior identification and prediction layer communicates with the cross-modal spatiotemporal attention fusion layer to identify unsafe behaviors that have occurred based on the fused multimodal features and to predict potential unsafe behaviors based on causal reasoning. Specifically, it includes: 1. Unsafe Behavior Recognition Module: This module employs a multi-classifier to identify unsafe behaviors that have occurred. The input to the multi-classifier is the fused feature vector, and the output is the probability of different unsafe behaviors. The unsafe behaviors supported for recognition in this embodiment include: not wearing a helmet, not wearing a seatbelt, unauthorized entry into a dangerous area, smoking, fighting, falling, climbing, and unauthorized operation of equipment. To improve recognition accuracy, an ensemble learning method is used to fuse the results of multiple different classifiers (such as support vector machines, random forests, and neural networks). Specifically, a weighted voting method is used for fusion, where the weight of the neural network classifier is 0.5, and the weights of the support vector machine and random forest classifiers are each 0.25.
[0041] 2. Behavioral Pattern Learning Module: This module employs unsupervised learning to learn the normal behavioral patterns of construction site workers and establish a behavioral baseline. Specifically, it collects multimodal data on workers under normal working conditions for seven consecutive days, uses the DBSCAN clustering algorithm to cluster the data, sets the neighborhood radius to 0.5, and the minimum sample size to 5, to obtain different normal behavioral patterns. When a worker's behavior deviates from the normal behavioral pattern, the system will issue an alert.
[0042] 3. Causal Inference Prediction Module: This application embodiment constructs a causal inference model based on a graph neural network (GAT) to predict potential unsafe behaviors. The model treats site personnel, equipment, environment, and behavior as nodes, and the causal relationships between them as edges. By learning causal relationships from historical data, the model can analyze the probability and timing of unsafe behaviors in the current state. For example, when the model detects that a worker's heart rate exceeds 100 beats / minute for more than 5 minutes, blood oxygen saturation is below 90% for more than 3 minutes, and the worker is in a high-altitude work area, it predicts that the worker may be experiencing fatigue and issues a warning.
[0043] 4. Risk Level Assessment Module: Risk levels are assessed using the risk matrix method. Risk value = probability of occurrence × severity of consequences. The probability of occurrence is divided into three levels: low (0.1), medium (0.5), and high (0.9). The severity of consequences is divided into three levels: minor (1), moderate (3), and severe (9). A level 1 warning is given when the risk value is ≤0.3, a level 2 warning is given when 0.3 < risk value ≤2.7, and a level 3 warning is given when the risk value >2.7. Based on the type of unsafe behavior, the probability of occurrence, and the possible consequences, the risk level is divided into three levels: Level 1 warning: Low risk, such as workers occasionally deviating from normal behavior patterns, but not causing immediate danger; Level 2 warning: Medium risk, such as workers not wearing safety helmets or entering general hazardous areas in violation of regulations; Level 3 warning: High risk, such as workers not wearing safety belts while working at heights or operating special equipment in violation of regulations.
[0044] The edge-cloud collaborative computing layer communicates with the multimodal perception layer, data preprocessing layer, cross-modal spatiotemporal attention fusion layer, unsafe behavior recognition and prediction layer, and hierarchical early warning and response layer, respectively, to realize collaborative computing between the edge and the cloud. Lightweight models are deployed at the edge for real-time detection, while large models are deployed in the cloud for complex analysis and model updates.
[0045] The multimodal sensing layer includes: The smart safety helmet terminal integrates a heart rate sensor, blood oxygen sensor, 6-axis IMU sensor, helmet removal detection sensor, front-facing high-definition camera, Beidou-3 positioning module, 5G communication module and voice prompt module. It is used to collect the wearer's heart rate, blood oxygen, movement posture, location information and first-person perspective video, and realize real-time data transmission and voice warning prompts. Fixed visual perception units include multiple high-definition cameras and infrared thermal imaging cameras deployed in key areas of the construction site to collect global visual data of key areas of the construction site. The environmental sensing unit includes temperature and humidity sensors, dust concentration sensors, wind speed sensors, and harmful gas sensors, which are used to collect environmental data at the construction site. Mobile sensing units, including multimodal sensors mounted on drones and inspection robots, are used to supplement monitoring of fixed sensing blind spots.
[0046] The data preprocessing layer includes: The spatiotemporal registration module is used to align data collected by different sensors in time and space based on the timestamps and location information of the BeiDou-3 positioning module. The outlier removal module is used to remove outliers from multimodal data using a sliding window-based statistical method. The data standardization module is used to standardize data from different modalities, giving them the same dimensions and distribution.
[0047] The cross-modal spatiotemporal attention fusion layer includes: The single-modal feature extraction module includes a visual feature extraction submodule, a physiological feature extraction submodule, a motion feature extraction submodule, and an environmental feature extraction submodule, which are used to extract features from different modal data, respectively. The cross-modal spatiotemporal attention module is used to calculate the attention weights of different modal features at different spatiotemporal locations, and to perform weighted fusion of multimodal features based on the attention weights to generate a fused feature map; The feature fusion output module is used to perform dimensionality transformation and normalization on the fused feature map to generate a unified fused multimodal feature vector.
[0048] The working process of the cross-modal spatiotemporal attention module includes: For each modality's feature data, positional encoding is performed, and spatiotemporal positional information is added. For visual feature data, two-dimensional sine and cosine positional encoding is used; for temporal feature data such as physiological, motion, and environmental features, one-dimensional sine and cosine positional encoding is used. Calculate the similarity between each spatiotemporal location in each modal feature data and all spatiotemporal locations in other modal feature data; The attention weight for each spatiotemporal location is calculated based on similarity. The features of different modalities are weighted and summed according to the attention weights to obtain the fused feature map.
[0049] The unsafe behavior identification and prediction layer includes: The unsafe behavior identification module is used to identify unsafe behaviors that have occurred based on the fused multimodal feature vectors, including not wearing a safety helmet, not wearing a safety belt, entering a dangerous area without permission, smoking, fighting, falling, climbing, and operating equipment without permission; The behavior pattern learning module is used to learn the normal behavior patterns of construction site workers and establish a behavior baseline. Specifically, it collects multimodal data of workers under normal working conditions for 7 consecutive days, uses the DBSCAN clustering algorithm to cluster the data, sets the neighborhood radius to 0.5 and the minimum number of samples to 5, and obtains different normal behavior patterns. When the worker's behavior deviates from the normal behavior pattern, an early warning prompt is triggered. The causal reasoning prediction module is used to predict potential unsafe behaviors within a preset time period by analyzing the probability and timing of unsafe behaviors based on multimodal feature vectors and behavioral baselines through causal reasoning. The risk level assessment module is used to assess the risk level based on the type of unsafe behavior, its probability of occurrence, and its potential consequences.
[0050] The causal reasoning prediction module adopts a causal reasoning model based on graph neural networks. This model treats individual construction workers, construction equipment entities, construction site environment areas, and behavioral events as different types of nodes. It uses directed edges to represent the interaction between construction workers and construction equipment, the positional relationship between construction workers and the construction site environment, the triggering relationship between construction workers and behavioral events, the adaptation relationship between construction equipment and the construction site environment, the association relationship between construction equipment and behavioral events, and the inducing relationship between the construction site environment and behavioral events, in chronological order of causal occurrence. By learning the causal relationship weights in historical multimodal data, it predicts unsafe behaviors that may occur within a preset time period in the future.
[0051] The tiered early warning and response system includes: The early warning level classification module is used to classify early warning signals into Level 1, Level 2, and Level 3 early warnings based on the risk level output by the risk level assessment module. For Level 1 early warnings, workers are reminded via voice prompts from the smart safety helmets. For Level 2 early warnings, in addition to voice prompts from the smart safety helmets, an alarm is triggered by on-site sound and light alarms, and the warning information is pushed to the management personnel's mobile app. For Level 3 early warnings, in addition to the above measures, the warning information is displayed on the construction site monitoring screen, and the project manager is notified.
[0052] The multi-channel early warning release module is used to release early warning signals through voice prompts from smart safety helmets, on-site sound and light alarms, mobile apps for managers, and large monitoring screens at construction sites. The graded response execution module is used to execute corresponding response measures according to the warning level, including voice reminders, on-site intervention, emergency shutdown, and emergency rescue activation. The corresponding response measures are executed according to the warning level: For a level 1 warning, the system will automatically record the warning information and remind workers to pay attention to safety; for a level 2 warning, the system will notify the nearby safety officer to intervene on-site; for a level 3 warning, the system will immediately stop the operation in the relevant area and activate the emergency rescue plan.
[0053] The edge-cloud collaborative computing layer includes: Edge computing units, deployed on smart helmet terminals, edge servers, and mobile sensing units, run lightweight unsafe behavior recognition models for real-time detection and local alerts. The smart helmet terminal deploys the lightest model for real-time detection of emergencies such as helmet removal and falls. Slightly larger models are deployed on the edge servers to process data from fixed cameras and environmental sensors. Medium-sized models are deployed on the mobile sensing units to process data collected by drones and inspection robots. Edge computing units enable local detection and alerts, eliminating the need to transmit all data to the cloud, significantly reducing network bandwidth requirements and response latency. The cloud computing unit, deployed on cloud servers, is used to run large models for complex behavioral analysis, causal inference, model training, and adaptive scene transfer learning. It can handle complex tasks that edge computing cannot, such as long-term behavioral pattern analysis and cross-regional security risk assessment. Simultaneously, the cloud computing unit is also responsible for collecting data from all edge computing devices for continuous model training and optimization. The model update module updates the parameters of the model trained in the cloud to the edge computing units, enabling continuous model optimization. It employs an incremental learning method based on Elastic Weight Consolidation (EWC) to update model parameters. Once a new model is trained in the cloud, only the updated parameters are transmitted to the edge, rather than the entire model, significantly reducing data transmission volume. The model update process does not affect the normal operation of the edge, achieving seamless updates. The adaptive scene transfer learning module, deployed in a cloud computing unit, consists of a data acquisition submodule, a feature extraction fine-tuning submodule, and a model validation submodule. Its working principle is as follows: When the system is deployed to a new construction site or enters a new construction phase, the data acquisition submodule automatically collects 100-200 labeled samples from the new scene; the feature extraction fine-tuning submodule uses a domain adversarial neural network algorithm to fine-tune only the bottom feature extraction layer of the pre-trained model in the cloud, freezing the parameters of the upper classification and prediction layers; the model validation submodule tests the accuracy of the fine-tuned model using a validation set. When the accuracy reaches a preset threshold (above 90%), the model update module sends the fine-tuned parameters to the edge, thereby quickly adapting to the visual features, environmental features, and behavioral patterns of the new scene while retaining the original model's general recognition capabilities, without retraining the entire model. This module is used to collect a small amount of labeled data in the new scene and fine-tune the model's feature extraction layer parameters through a domain adaptive algorithm, enabling the model to quickly adapt to different construction phases and different construction site scenarios.
[0054] To verify the effectiveness of this invention, we conducted a six-month experiment at three different types of construction sites (residential buildings, bridge projects, and subway construction). During the experiment, we collected more than 1,000 hours of multimodal data, including various complex construction site environments such as dust, strong light, nighttime, and dynamic occlusion, as well as 12 common unsafe behaviors such as not wearing a safety helmet, working while fatigued, and falls from heights.
[0055] Experimental results show that the system achieves a 97.2% accuracy rate in identifying common unsafe behaviors, with a false alarm rate of 4.3% and a false negative rate of 2.1%, demonstrating significant advantages over existing typical YOLOv8-based single-modal vision recognition systems (with an accuracy rate of approximately 85%, a false alarm rate of approximately 35%, and a false negative rate of approximately 28%). This system can predict typical potential unsafe behaviors such as fatigued work, falls from heights, and improper operation of special equipment 15-30 seconds in advance, achieving a prediction accuracy of 89.6%. On the Jetson Nano device, the system's inference latency is 35ms, meeting the requirements for real-time alarms. During the experiment, the accident rate at the three construction sites decreased by 72%, and safety management efficiency improved by 65%.
[0056] The foregoing has shown and described the basic principles, main features, and advantages of the present invention. Those skilled in the art should understand that the present invention is not limited to the above embodiments. The embodiments and descriptions in the specification are merely preferred examples and are not intended to limit the invention. Various changes and modifications can be made to the invention without departing from its spirit and scope, and all such changes and modifications fall within the scope of the present invention as claimed. The scope of protection of the present invention is defined by the appended claims and their equivalents.
Claims
1. A smart identification and early warning system for unsafe behaviors of construction workers, characterized in that, include: The multimodal perception layer is used to collect physiological data, motion data, visual data, and environmental data of construction site personnel. The data preprocessing layer, which is communicatively connected to the multimodal perception layer, is used to perform spatiotemporal registration, outlier removal, and standardization on the acquired multimodal data. A cross-modal spatiotemporal attention fusion layer is communicatively connected to the data preprocessing layer. It is used to extract features from different modal data and dynamically fuse multimodal features through a cross-modal spatiotemporal attention mechanism. The unsafe behavior identification and prediction layer is communicatively connected to the cross-modal spatiotemporal attention fusion layer. It is used to identify unsafe behaviors that have occurred based on the fused multimodal features and to predict potential unsafe behaviors based on causal reasoning. The graded early warning and response layer is communicatively connected to the unsafe behavior identification and prediction layer, and is used to generate early warning signals of different levels based on the identification and prediction results, and to execute corresponding response measures. The edge-cloud collaborative computing layer communicates with the multimodal perception layer, data preprocessing layer, cross-modal spatiotemporal attention fusion layer, unsafe behavior recognition and prediction layer, and hierarchical early warning and response layer, respectively, to realize collaborative computing between the edge and the cloud. Lightweight models are deployed at the edge for real-time detection, while large models are deployed in the cloud for complex analysis and model updates.
2. The intelligent identification and early warning system for unsafe behaviors of construction workers according to claim 1, characterized in that, The multimodal sensing layer includes: The smart safety helmet terminal integrates a heart rate sensor, blood oxygen sensor, 6-axis IMU sensor, helmet removal detection sensor, front-facing high-definition camera, Beidou-3 positioning module, 5G communication module and voice prompt module. It is used to collect the wearer's heart rate, blood oxygen, movement posture, location information and first-person perspective video, and realize real-time data transmission and voice warning prompts. Fixed visual perception units include multiple high-definition cameras and infrared thermal imaging cameras deployed in key areas of the construction site to collect global visual data of key areas of the construction site. The environmental sensing unit includes temperature and humidity sensors, dust concentration sensors, wind speed sensors, and harmful gas sensors, which are used to collect environmental data at the construction site. Mobile sensing units, including multimodal sensors mounted on drones and inspection robots, are used to supplement monitoring of fixed sensing blind spots.
3. The intelligent identification and early warning system for unsafe behaviors of construction workers according to claim 2, characterized in that, The data preprocessing layer includes: The spatiotemporal registration module is used to align data collected by different sensors in time and space based on the timestamps and location information of the BeiDou-3 positioning module. The outlier removal module is used to remove outliers from multimodal data using a sliding window-based statistical method. The data standardization module is used to standardize data from different modalities, giving them the same dimensions and distribution.
4. The intelligent identification and early warning system for unsafe behaviors of construction workers according to claim 3, characterized in that, The cross-modal spatiotemporal attention fusion layer includes: The single-modal feature extraction module includes a visual feature extraction submodule, a physiological feature extraction submodule, a motion feature extraction submodule, and an environmental feature extraction submodule, which are used to extract features from different modal data, respectively. The cross-modal spatiotemporal attention module is used to calculate the attention weights of different modal features at different spatiotemporal locations, and to perform weighted fusion of multimodal features based on the attention weights to generate a fused feature map; The feature fusion output module is used to perform dimensionality transformation and normalization on the fused feature map to generate a unified fused multimodal feature vector.
5. The intelligent identification and early warning system for unsafe behaviors of construction workers according to claim 4, characterized in that, The operation of the cross-modal spatiotemporal attention module includes: For each modality, the feature data is positionally encoded, and spatiotemporal location information is added; Calculate the similarity between each spatiotemporal location in each modal feature data and all spatiotemporal locations in other modal feature data; The attention weight for each spatiotemporal location is calculated based on similarity. The fused feature map is obtained by weighting and summing the features of different modalities according to the attention weights.
6. The intelligent identification and early warning system for unsafe behaviors of construction workers according to claim 4, characterized in that, The unsafe behavior identification and prediction layer includes: The unsafe behavior identification module is used to identify unsafe behaviors that have occurred based on the fused multimodal feature vectors, including not wearing a safety helmet, not wearing a safety belt, entering a dangerous area without permission, smoking, fighting, falling, climbing, and operating equipment without permission; The behavior pattern learning module is used to learn the normal behavior patterns of construction site workers and establish a behavior baseline. Specifically, it collects multimodal data of workers under normal working conditions for 7 consecutive days, uses the DBSCAN clustering algorithm to cluster the data, sets the neighborhood radius to 0.5 and the minimum number of samples to 5, and obtains different normal behavior patterns. When the worker's behavior deviates from the normal behavior pattern, an early warning prompt is triggered. The causal reasoning prediction module is used to predict potential unsafe behaviors within a preset time period by analyzing the probability and timing of unsafe behaviors based on multimodal feature vectors and behavioral baselines through causal reasoning. The causal reasoning prediction module is used to predict potential unsafe behaviors within a preset time period by analyzing the probability and timing of unsafe behaviors based on multimodal feature vectors and behavioral baselines through causal reasoning. The risk level assessment module is used to assess the risk level based on the type of unsafe behavior, its probability of occurrence, and its potential consequences.
7. The intelligent identification and early warning system for unsafe behaviors of construction workers according to claim 6, characterized in that, The causal reasoning prediction module adopts a causal reasoning model based on graph neural networks. This model treats individual construction workers, construction equipment entities, construction site environment areas, and behavioral events as different types of nodes. It uses directed edges to represent the interaction between construction workers and construction equipment, the positional relationship between construction workers and the construction site environment, the triggering relationship between construction workers and behavioral events, the adaptation relationship between construction equipment and the construction site environment, the association relationship between construction equipment and behavioral events, and the inducing relationship between the construction site environment and behavioral events, in chronological order of causal occurrence. By learning the causal relationship weights in historical multimodal data, it predicts unsafe behaviors that may occur within a preset time period in the future.
8. The intelligent identification and early warning system for unsafe behaviors of construction workers according to claim 7, characterized in that, The hierarchical early warning and response layer includes: The early warning level classification module is used to classify early warning signals into Level 1, Level 2, and Level 3 early warnings based on the risk level output by the risk level assessment module. The multi-channel early warning release module is used to release early warning signals through voice prompts from smart safety helmets, on-site sound and light alarms, mobile apps for managers, and large monitoring screens at construction sites. The graded response execution module is used to execute corresponding response measures according to the warning level. Specifically, Level 1 warning corresponds to voice reminder, Level 2 warning corresponds to on-site intervention, and Level 3 warning corresponds to emergency shutdown and emergency rescue activation.
9. The intelligent identification and early warning system for unsafe behaviors of construction workers according to claim 8, characterized in that, The edge-cloud collaborative computing layer includes: Edge computing units, deployed on smart helmet terminals, edge servers, and mobile sensing units, are used to run lightweight unsafe behavior recognition models to achieve real-time detection and local early warning. The cloud computing unit, deployed on a cloud server, is used to run large models for complex behavioral analysis, causal reasoning, model training, and adaptive scene transfer learning. The model update module is used to update the parameters of the model trained in the cloud to the edge computing unit, so as to achieve continuous optimization of the model. The adaptive scene transfer learning module is deployed in the cloud computing unit to collect a small amount of labeled data in new scenes. It fine-tunes the model feature extraction layer parameters through the domain adaptive algorithm to enable the model to quickly adapt to different construction stages and different construction site scenarios.