[0052]Example one
[0053]With the support of research and development funds from the Artificial Intelligence Research and Development Center of China Post and Communication Construction Consulting Co., Ltd., the invention adopts deep learning and computer vision technology to focus on the application of artificial intelligence in the field of construction site safety management and control. Committed to using high-performance image processing technology to intelligently detect the construction site, to study and judge and deal with potential dangerous factors, to ensure the timeliness and comprehensiveness of the safety information feedback on the construction site.
[0054]Such asfigure 1 ,2As shown, the intelligent identification system of hazards for outdoor construction sites includes: video surveillance module, face recognition module, mask recognition module, integrated hazard identification module, anti-intrusion detection module, helmet identification module and alarm module, alarm module Respectively connect with face recognition module, mask recognition module, comprehensive hazard source recognition module, anti-intrusion detection module and helmet recognition module, among which,
[0055]1. Face recognition module at the construction site. Use the camera to collect the face image, and input the face image to the face recognition module.
[0056]The specific process: using face recognition technology to authenticate the identity of the person entering the construction site to determine whether the person meets the qualifications for on-site construction, such as whether the construction personnel of special types of work holds a construction qualification certificate. The detection model is divided into two stages: face detection and face recognition. Using the two-stage model can simplify the data collection process and reduce project costs. The face detection model takes a screenshot of the face area and saves it, transfers the face image to the face detection module, and uses the face recognition module based on metric learning to identify and authenticate the construction personnel.
[0057]2. Mask recognition module. Since the outbreak of CONVID-19, whether personnel wear masks has become a new source of danger. Therefore, a cascaded recognition system using face detection and mask recognition has been developed to identify whether the constructor is wearing a mask.
[0058]Specific process: First, intercept the face area, input the face area image to the mask recognition system, and predict whether the construction worker will wear the mask. Combining layer-by-layer convolution technology to compress the parameters of the neural network, the mask recognition algorithm reduces the amount of calculation and improves the overall operating efficiency of the system.
[0059]Model training: Collect the heads of people wearing masks and those not wearing masks as a data set. The prediction model is a two-class model. The prediction results are divided into two types: masks and non-masks. First, face alignment is performed on the data set, and then the data set is input into the prediction model for end-to-end training.
[0060]3. Comprehensive hazard identification module, including conventional detection module, human ladder operation detection module and in-depth analysis module, where the conventional detection module is used to detect whether there are construction workers not wearing reflective clothing and warning signs on the construction site , Whether the construction tool is insulated; the human ladder operation detection module is used to detect whether there is a human ladder operation on the construction site, and if it exists, the detection video frame is sent to the depth analysis module; the depth analysis module is used to receive the video frame , And judge whether there is a single person ladder operation in the video frame, and/or the number of construction personnel on the ladder is greater than one, and/or the movements of the escalator personnel are not standardized. This system fully supports the identification of common hazards such as whether warning signs are placed in accordance with requirements in outdoor construction scenes, whether construction personnel wear reflective clothing, whether construction tools are insulated, and whether there are human ladders on-site operations at the construction site.
[0061]Specific process: First, use the high-definition camera on the construction site to collect the real-time scene of the construction site, and transmit each frame of pictures to the comprehensive hazard identification module. After that, the scene image is preprocessed and predicted, and the target detection is used to identify whether there are potential sources of danger at the construction site. If the system detects that there is a source of danger, for example, the construction personnel do not wear reflective clothing in accordance with the construction requirements, then the detection result is input to the alarm module to warn the construction personnel.
[0062]If it is recognized that there is a human ladder operation behavior at the scene, then the picture is transmitted to the in-depth analysis module of the human ladder operation scene based on the spatio-temporal convolution network, and the module is used to analyze whether the construction personnel meet the safety requirements of the human ladder operation.
[0063]Model training: Collect on-site hazard source photos as a data set, filter and manually label the data set, and use data enhancement technology to improve the generalization ability of the system. In order to enhance the accuracy of small target recognition, a gradient shunt algorithm supporting adaptive feature fusion learning is proposed. At the same time, the attention loss function is used to solve the problem of imbalance between positive and negative samples in the training phase, and improve the recognition accuracy.
[0064]4. In-depth analysis module of human ladder operation scene. The real-time scene of field work collected by the high-definition camera is used as the input of this module to perform in-depth semantic analysis of the human work scene. This module first judges whether there is a human-ladder operation. If there is a human-ladder operation, the operation behavior detection is performed. If a single person ladder operation or a multiple person ladder operation is detected but the operator does not support the operation ladder according to the specifications, it is judged as a source of danger.
[0065]Specific process: In the first stage, the comprehensive hazard identification module is used to identify whether there is a human ladder operation scene on the construction site. In the second stage, if there is a human ladder operation scene, the original human ladder operation scene specification recognition algorithm is used to determine whether there are at least two construction workers in the vertical space domain where the ladder is located. If a single-person escalator operation scene is found, the system directly judges it as a source of danger, otherwise, it enters the third stage. The third stage is to perform posture detection and behavior recognition in the space domain of the ladder, and use the high-level semantics in the picture to judge whether the worker is working in accordance with the specifications. If there are irregular operations, use the alarm module to give real-time alarms.
[0066]Model training: Collect photos of the construction site, use the target detection model pre-trained on the COCO data set to identify the construction personnel in the photos, and screenshot and save the personnel to form a construction personnel data set. Manually mark the posture points of the construction personnel data set to form the construction personnel posture data set. Subsequently, the construction staff pose dataset and the MPII dataset are used as the training set, and the training set is input into the single-person pose detection model for end-to-end training.
[0067]5. The identification module of the helmet on the construction site.
[0068]Specific process: A safety helmet recognition system based on convolutional neural network is proposed. The system uses a target detection algorithm to identify whether a construction worker wears a helmet or not. Those who do not wear a helmet are framed by a red warning box. After the module finds that there are people on the construction site who are not wearing helmets, it will input this information into the alarm module. Voice warning is given through the built-in voice system of the alarm module. Through intelligent identification means, the rate of wearing helmets on construction sites is increased by 50%.
[0069]Model training: Collect construction scenes with and without helmets as a dataset, and manually label the datasets to form a helmet dataset. Establish a multi-task detection model, and input the data of the helmet data set into the detection model to train the model.
[0070]6. Anti-intrusion detection module. The on-site high-definition camera captures real-time images of high-risk areas, and inputs the images into the anti-intrusion detection module for analysis.
[0071]Specific process: Use multi-target dynamic tracking technology to monitor high-risk areas that are not allowed to enter the construction site. The real-time scene of field operations collected by the high-definition camera is used as the input of this module to conduct anti-intrusion detection in high-risk construction areas. When a target is found in the camera's field of view, the dynamic tracking technology is used to track the person in the camera's field of view. When the person is found to be close to the critical point of the high-risk area, the information is transmitted to the alarm module, and the alarm module will automatically give a voice warning.
[0072]7. Weather adaptive module. When the construction personnel enter the construction site, they use the camera to collect the on-site work environment in real time, and use the weather adaptive module to enhance the image of the work environment. When the system judges that the current environment is underexposed or in a dark environment, the system automatically performs tone mapping, and uses gamma correction and histogram equalization techniques to adjust image contrast and tone. When the system judges that the current weather is in a foggy day, the system will automatically call the dark channel defogging algorithm, use the defogging model to defog the video stream, and add guided filtering to the defogging algorithm to enhance the quality of the image. The image enhancement technology of adaptive weather environment can minimize the impact of weather factors on the performance of the system when the intelligent emergency system is operating in an outdoor environment, and improve the robustness of the system. By adding image enhancement modules, the overall operating efficiency of the system is increased by 20%, the recognition accuracy of each module is increased by an average of 10.3%, the recall rate is increased by 5%, and the recognition accuracy is increased by 7%.
[0073]The intelligent identification method of hazard sources for outdoor construction sites includes the following steps:
[0074]First, the real-time sensing camera on the construction site is used to obtain the job site images, and the collected site images are used as the training data set. Then, the feature value of the video frame is extracted by the feature extraction network combined with gradient shunt technology. The adaptive feature fusion network is used to fuse feature values at different stages to obtain fusion feature values. Subsequently, using an adaptive training sample sampling algorithm, the anchor frame is divided into positive samples and negative samples to obtain the target value. Finally, the fusion feature value and target value are substituted into the attention loss function, and then the optimizer is used to optimize the loss function to continuously train the model. When the model is trained, we get the prediction model. Use predictive models to perform perceptual analysis on real-time video of the construction site. Among them, the specific instructions are as follows:
[0075]Use a new feature extractor. The hazard identification feature extractor uses 3*3 and 1*1 convolution kernels, uses 1*1 convolution kernels to reduce the dimensionality of the channel layer, reducing the amount of model calculations, and then uses 3*3 convolution kernels for local areas Feature extraction. Moreover, referring to the residual network, a long jump connection is introduced in the feature extraction stage to speed up the convergence of the model. Finally, the feature extractor contains a total of 50 convolutional layers, and the model architecture is asfigure 2 Shown.
[0076]Using gradient shunt technology. Convolutional neural networks have been widely used in computer vision because of their model performance capabilities. However, its performance ability benefits from expensive computing resources, and how to reduce computing costs has become the focus of the current target recognition field. In the hazard identification module, a new gradient shunt model is proposed, which uses gradient shunt to eliminate redundant gradient information in the model, so that the idle neurons in the model can be efficiently used. The gradient shunt algorithm divides the input feature into two parts, one part participates in the calculation of the local network, the other part directly crosses the local network, and connects with the result of the previous local network calculation in the channel dimension. The gradient shunt algorithm can not only reduce the computational cost and memory usage of the network, but also improve the accuracy and speed of the hazard identification module.
[0077]
[0078]The DenseNet shunt algorithm is as described in the above formula, which achieves the effect of gradient shunt through cross-region connections. Among them, matmul represents matrix multiplication, (X0, X1, .., Xk) Means to connect k+1 matrices according to the channel dimension; wn, N = (1, 2, 3,..., k) is the weight value of each layer, Xn, N = (1, 2, 3,..., k) is the input data of each layer; Xshortcut Is the cross-region eigenvalue that does not participate in the calculation, Xstage Is the output value of the neural network in this area.
[0079]Use adaptive feature fusion network. In traditional target detection algorithms, for features at different feature levels, it simply uses up-sampling and convolutional layers for channel fusion. However, the contribution of features at different levels to target recognition performance is different, and a feature fusion network with learnable parameters is proposed. In the up-sampling algorithm, the 1*1 convolution kernel is first used to unify the channels, and then the features are up-sampled; in the down-sampling algorithm, the 3*3 convolution kernel is simply used to adjust the channel and resolution at the same time. Then, the learnable parameter wij is introduced to each feature layer, and the learnable parameter is normalized using softmax. Finally, a weighted sum is performed on each feature layer. Through the self-adaptive fusion network, the correlation between different levels of features is directly displayed, which greatly improves the performance of the model.
[0080]
[0081]The feature fusion algorithm is as described in the above formula: Are the learnable weight parameters corresponding to the eigenvalues of the L-th layer at points (i, j), Indicates that the feature value of the nth layer is transformed into the feature value of the lth layer through convolution and pooling operations, Indicates the eigenvalue of the fused layer l at the point (i, j).
[0082]Use the attention loss function. In target detection, the imbalance in the number of positive and negative samples in the training phase will cause the positive samples to be overwhelmed by the negative samples in the training phase. When the hazard identification algorithm performs backpropagation, in the composition of the parameter gradient, the proportion of negative samples is much larger than that of the positive samples, which will bias the optimization method of the hazard identification model. At present, the mainstream technology is to use two-stage cascade and biased sampling methods to filter positive and negative samples and sample proportionally. However, using this method to balance positive and negative samples increases the complexity of the model. Therefore, an attention loss function is proposed, and the sample weight factor is added to the cross-entropy loss function. The sample weight factor is dynamically and adaptively adjusted according to the difficulty of sample classification. The model judges the samples that are difficult to classify, the sample factor is small; the model judges the samples that are easy to classify, the sample factor is larger. Through the attention loss function, in the process of model optimization, the attention of model optimization will focus on samples that are difficult to classify, so that a large number of pure samples that are easy to classify are suppressed. Using the attention loss function reduces the complexity of the hazard identification model, and improves the model convergence speed and model complexity.
[0083]Loss=-Alpha(α,y)(1-pt)γ log pt
[0084]
[0085]
[0086]Attention loss function (focal loss) is as described in the above formula, where Alpha(α, γ) and γ are harmonic coefficients, pt is the difficulty of predicting the target area; p is the logistic regression value of the model, and y is the sample label Value, α is a constant from 0 to 1.
[0087]Use adaptive training sample sampling strategy. When training samples are sampled, different from the traditional sample sampling strategy, an adaptive sample sampling method is proposed. Because traditional sample sampling uses hyperparameters as the classification threshold of positive and negative samples, this will lead to more hyperparameters in the model, and different classification thresholds cause different classification results, which increases the complexity of model optimization. The adaptive sampling strategy is adopted in the hazard identification algorithm. For each real sample frame, from each feature layer extracted from the feature pyramid, the 9 sample frames with the closest anchor frame center point to the center point of the real sample frame are taken out . Calculate the IOUs between all the anchor boxes and the real sample boxes, and calculate the average and standard deviation of all IOUs. Finally, the sum of the average and standard deviation is used as the adaptive classification threshold.
[0088]
[0089]iou_mean=Mean(ioui)(i∈Si)
[0090]iou_std=Std(ioui)(i∈Si)
[0091]iou_thres=iou_mean+iou_std
[0092]The adaptive training sample sampling strategy is as described in the above formula, where center_distance is the distance from the point in the bounding box of the predicted target to the center point of the bounding box of the real sample; (xgt , Ygt ) Is the coordinate value of the center point of the bounding box of the real sample, (xpred , Ypred ) Is the coordinate value of the center point of the predicted bounding box; iouiIs the similarity between the predicted bounding box and the real bounding box; SiIs the set of 9 prediction boxes closest to the real sample box; iou_mean and iou_std are the mean and standard deviation of the 9 similarities, respectively, and iou_thres is the threshold for distinguishing positive and negative samples.
[0093]Furthermore, in the process of anti-intrusion detection in high-risk areas, the video frame is first detected through the target recognition model to detect the location information of the person. After that, Kalman filtering and target matching between frames based on metric learning are used to dynamically track the targets in the picture. When a target is found to enter the critical area, the system will automatically issue an alarm.
[0094]
[0095]
[0096]
[0097]
[0098]
[0099]The Kalman filter algorithm is as described in the above formula, where Are the posterior estimates of k-1 and k frames in the video, For the posterior estimation based on the previous frame Predicted prior value; A is the state transition matrix, ATIs the transposed matrix of matrix A, and B is the conversion matrix that converts noise into state; uk-1Is the action matrix of the outside world on the system. In the target tracking algorithm of this module, uk-1Set to 0; Pk, Pk-1Respectively The covariance, Is the covariance of the predicted value, H is the conversion matrix from the state coordinate system to the measurement coordinate system, HTIs the transposed matrix of H; ZkIs the observation value of frame K, where the observation value is the predicted value of the target detection algorithm, KKIs the Kalman gain, R and Q are the covariance of the noise data. The Kalman filter is used to predict the position information of the next frame of the continuously moving object, and the bounding box position of the object is obtained, and the feature value of the region is obtained through ROI pooling. After that, perform ROI pooling on the actual prediction result of the frame to obtain the feature value of the region. Finally, the Maha distance calculation is performed on the characteristic values of the two, so as to determine whether the two are the same construction personnel, and realize the tracking of the construction personnel appearing in the video screen.
[0100]Further, after extracting the video frame, the real-time weather condition is judged, and the video frame is adjusted according to the real-time weather to perform image enhancement. The system has the functions of auto focus, auto exposure, and auto white balance, so that the system can adapt to different weather changes, and improve the accuracy and flexibility of the target identification module such as hazard identification module and helmet identification module that need to be performed outdoors. When the system judges that the current environment is underexposed or in a dark environment, the system automatically performs tone mapping, and uses gamma correction and histogram equalization techniques to adjust image contrast and tone. When the system judges that the current weather is foggy, the system will automatically call the dark channel defogging algorithm, use the defogging model to defog the video stream, and add guided filtering to the defogging algorithm to enhance the quality of the image. The image enhancement technology of adaptive weather environment can minimize the impact of weather factors on the performance of the system when the intelligent emergency system is operating in an outdoor environment, and improve the robustness of the system. By adding image enhancement modules, the overall operating efficiency of the system is increased by 20%, the recognition accuracy of each module is increased by an average of 10.3%, the recall rate is increased by 5%, and the recognition accuracy is increased by 7%.
[0101]Further, after extracting the video frame, it also includes the following steps: extract the target face image, perform identity authentication based on the target face image, and use the database to compare the certified persons to determine whether they are qualified for construction. If they are not qualified for construction, make a sound And/or light warning;
[0102]Recognize whether the target face image is wearing a mask. If the mask is not worn, a sound and/or light warning is issued. among them:
[0103]In the process of target face recognition, in the process of reviewing the qualifications of the operators detected in the video frames, the present invention proposes a cascaded neural network algorithm to detect and recognize the faces of on-site construction workers, and the recognition results show Those who do not meet the qualifications are prohibited from working on the construction site. The face detection stage uses a three-stage cascaded neural network, which is a regional suggestion network, a regional improvement network and an output network. First, use the 12*12 regional suggestion network with the receptive field to make a preliminary screening of the positions of human faces in the video. Then, crop and scale the area where the human face is preliminarily judged to be the output of improving the network. Improve the overall receptive field size of the network to 24*24, and correct the areas where human faces may exist. Cut and scale the area where the human face exists in the second-stage prediction as the output of the output network. The size of the receptive field of the output network is 48*48, and the final output is the location of the face area and the coordinates of the key points of the face. In the face recognition stage, a loss function with added angle interval is proposed. This algorithm is not only easy to implement but also very efficient. By adding the angular interval θ in the final prediction layer, the intra-class interval can be reduced and the out-class interval can be increased during classification. By adding the loss function of the angle interval, the distinguishing ability of the feature embedding layer for human faces is effectively enhanced.
[0104]
[0105]The face recognition loss function with added interval is shown above, where Lface Is the loss function value, m is the angle interval, which is a constant; N is the number of each batch in training, θyiIs the angle between the input vector in the final classification layer and the yith row weight vector in the final classification layer, θjIs the angle between the input vector in the final classification layer and the non-yi row weight vector in the final classification layer, and s is the product of the normalized input vector modulus and the weight vector modulus.
[0106]In the face mask recognition process, the same multi-task cascaded neural network as the face recognition module is used to recognize the face in the video. The recognized face is clipped and scaled to a uniform size, scaled to a data format of 224*224, and finally input into the classification model for two-class detection of whether to wear a mask. In the classification model, an inverse residual block different from the original residual network is used. In the original residual network, the residual block first uses a 1*1 convolution kernel to compress the number of channels of the input data, then uses a 3*3 convolution kernel to extract local features, and finally uses a 1*1 convolution kernel to adjust the channel . In this module, first expand the number of characteristic data channels, and then reduce the number of characteristic data channels. Therefore, an inverse residual technique is proposed, which first uses a 1*1 convolution kernel to expand the number of input data channels, then uses a 3*3 convolution kernel to perform layered convolution, and finally uses a 1*1 convolution kernel to output data . And, intuitively, after the image feature information is extracted through the previous convolutional layer, the extracted feature information is already high-order feature information, and then compressed with the activation function with compression, but it will be lost to a certain extent These high-level feature information. Therefore, in the inverse residual technique, the output data removes the activation function, that is, the output feature does not undergo activation function compression processing. The system uses artificial intelligence technology to automatically recognize whether on-site workers are wearing masks, improve the efficiency of epidemic prevention and control, and reduce labor costs. Through the application of integrated network model, inverse residual technology and hierarchical convolution technology, the recognition accuracy of the mask recognition module is 98.7%, reaching the Bayesian error level.
[0107]Furthermore, after identifying the potential hazards at the construction site in the video frame, it also includes the following steps: detecting whether there is a human ladder operation at the construction site, and if it exists, identifying the construction personnel in the ladder space domain in the video frame, and using the identification The bounding box of the video frame is cropped, the posture detection algorithm is used to recognize the cropped video frame, and the behavior recognition is performed to determine whether there is a single-person ladder operation, and/or the number of construction personnel on the ladder is greater than one, and / Or the irregularities of the escalator personnel's movements, if the above conditions exist, a sound and/or light warning will be issued.
[0108]For the human ladder operation scene, the computer needs to understand the relationship between the human and the ladder in the image and the behavior recognition of the operator. A semantic understanding algorithm using posture detection and behavior recognition is proposed. Considering the real-time operating efficiency of the system, behavior recognition in this module uses a heuristic algorithm that combines manual design and deep learning. Using this innovative algorithm, it is possible to effectively eliminate dangerous sources such as single-person ladder operations and non-standard escalator movements in human ladder operations in real time to ensure the safety of human ladder operations. Among them, the posture detection algorithm uses a top-down single-person posture detection algorithm, which first detects the area where the human body is located, and then uses the single-person posture detection algorithm to detect the human posture points.
[0109]
[0110]
[0111]The single-person pose detection algorithm is described in the above formula: heatmapgt For the true heat map of the sample generated by Gaussian distribution, MSE is the mean square error loss function, heatmappred Is the posture heat map predicted by the model, σ is the standard deviation of the Gaussian distribution, xi, YiIs the coordinate value of the i-th posture point, and x and y are the coordinate values of the point on the heat map.