Intelligent recognition method and system for abnormal behavior of construction site based on deep learning

By using a deep learning-based background dynamic decoupling and consistency verification network, the problem of interference from non-human dynamic elements in construction site monitoring is solved, enabling accurate identification of abnormal behaviors on the construction site, ensuring that the input features are purely human behavioral features, and improving the accuracy of identification.

CN122244804APending Publication Date: 2026-06-19JIANGXI SHENGJIE MUNICIPAL ENG CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
JIANGXI SHENGJIE MUNICIPAL ENG CO LTD
Filing Date
2026-05-12
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies cannot effectively distinguish between non-human dynamic elements and human behavioral characteristics in construction site monitoring, resulting in mixed and erroneous foreground regions in behavior recognition data, which affects the accuracy of classification networks.

Method used

A deep learning-based approach is adopted, which uses a background dynamic decoupling network and a consistency verification network to estimate background motion and separate foreground human body. Deformable convolution is used to predict the background pixel-level displacement field, stripping the background dynamic elements. The spatiotemporal feature similarity matrix is ​​used to remove pseudo motion masks and retain temporally consistent motion masks, which are then input into a behavior classification network for recognition.

🎯Benefits of technology

It effectively eliminates dynamic background interference such as mechanical operation and dust, ensuring that the features of the input behavior classification network are only pure human behavior features, thereby improving the accuracy of behavior recognition and solving the problem of misidentification caused by non-human dynamic interference in the existing technology.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244804A_ABST
    Figure CN122244804A_ABST
Patent Text Reader

Abstract

This invention relates to the field of image recognition technology, specifically to a method and system for intelligent recognition of abnormal behavior at construction sites based on deep learning. The method acquires continuous image frames and extracts shallow feature maps, which are then input into a parallel decoupling network of a background dynamic decoupling network. A background motion estimation branch predicts the background pixel-level displacement field based on deformable convolution, and the shallow feature map of the current frame is reverse-distorted and aligned to generate an aligned feature map. A foreground human body separation branch extracts the foreground human body motion mask based on the aligned feature map. The spatiotemporal feature similarity matrix of the foreground human body motion masks across multiple consecutive frames is calculated, and pseudo-motion masks caused by sudden changes in illumination are removed, retaining temporally consistent motion masks. The temporally consistent motion masks are then input into a behavior classification network to output the recognition result. This invention removes non-human dynamic background interference at the feature level, eliminates mask anomalies caused by illumination changes, and ensures the purity of the input features to the behavior classification network.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image recognition technology, and more specifically to a method and system for intelligent recognition of abnormal behavior at construction sites based on deep learning. Background Technology

[0002] In the field of construction site monitoring image recognition, conventional abnormal behavior recognition schemes typically employ background subtraction or optical flow methods to extract moving target regions from video sequences. These extracted moving target regions are then input into a classification network for behavior determination. In the implementation of background subtraction, the system establishes and updates a Gaussian mixture background model in real time, performs pixel-level difference operations between the currently acquired video frame and the background model, and segments out changing regions using a fixed threshold, identifying these changing regions as foreground moving targets. In the implementation of optical flow, the system calculates the gradient changes in pixel grayscale values ​​between adjacent video frames to estimate the motion vector field of each pixel in the image, and delineates the foreground moving region based on the amplitude distribution of the motion vector field.

[0003] Construction sites contain numerous non-human dynamic elements, including the operation of large robotic arms, the movement of transport vehicles, and dust. When using conventional background subtraction or optical flow methods, Gaussian mixture models cannot adapt to the drastic local pixel changes caused by these non-human dynamic elements, misclassifying mechanical movement or dusty areas as foreground moving targets. Similarly, the motion vector field calculated by optical flow also includes motion vectors from machinery and dust, preventing the system from distinguishing between non-human dynamic interference in the background and actual foreground behavioral features at the pixel level. This mixture of non-human dynamic interference and human behavioral features results in a large number of erroneous foreground regions in the data input to the classification network, preventing the behavior classification network from acquiring purely human behavioral features for judgment. Summary of the Invention

[0004] The purpose of this invention is to provide a method and system for intelligent identification of abnormal behavior at construction sites based on deep learning, which can solve the problems mentioned in the background art.

[0005] To achieve the above objectives, the technical solution adopted by the present invention is as follows:

[0006] A deep learning-based intelligent recognition method for abnormal behavior at construction sites includes: acquiring continuous image frames from construction site monitoring videos; inputting the continuous image frames into a preset feature encoder to extract shallow feature maps corresponding to two adjacent frames; inputting the shallow feature maps into a background dynamic decoupling network, processing the shallow feature maps through a parallel decoupling network included in the background dynamic decoupling network, wherein the parallel decoupling network includes a background motion estimation branch and a foreground human body separation branch; the background motion estimation branch predicts background pixel-level displacement fields based on deformable convolution, and uses the background pixel-level displacement fields to reverse-distort and align the shallow feature maps of the current frame to generate aligned feature maps; the foreground human body separation branch extracts a foreground human body motion mask based on the aligned feature maps; inputting the foreground human body motion masks corresponding to multiple consecutive frames into a consistency verification network, calculating the spatiotemporal feature similarity matrix of the foreground human body motion masks, removing pseudo-motion masks caused by sudden changes in illumination based on the spatiotemporal feature similarity matrix, and retaining temporally consistent motion masks; inputting the temporally consistent motion masks into a behavior classification network, and outputting the recognition result of abnormal behavior at the construction site.

[0007] Preferably, the background motion estimation branch predicts the background pixel-level displacement field based on deformable convolution, comprising: concatenating the shallow feature maps of the reference frame in the two adjacent frames with the shallow feature map of the current frame to generate a concatenated feature map; inputting the concatenated feature map into a multi-layer deformable convolutional layer in the background motion estimation branch, wherein the multi-layer deformable convolutional layer learns the spatial offset in the concatenated feature map to generate the background pixel-level displacement field that matches the size of the shallow feature map of the current frame, wherein the background pixel-level displacement field includes a horizontal displacement component and a vertical displacement component; and using a bilinear interpolation algorithm in conjunction with the background pixel-level displacement field to sample and offset the shallow feature map of the current frame to generate the aligned feature map.

[0008] Preferably, the foreground human body separation branch extracts the foreground human body motion mask based on the aligned feature map, comprising: inputting the aligned feature map into the encoder-decoder architecture in the foreground human body separation branch; the encoder-decoder architecture performing downsampling and upsampling processing on the aligned feature map to output a multi-scale feature map; performing pixel-by-pixel sigmoid activation processing on the multi-scale feature map to generate a foreground probability map with the same resolution as the aligned feature map; calculating the average probability of all pixels in the foreground probability map; using the average probability as a dynamic segmentation benchmark; marking pixel regions in the foreground probability map that are greater than the dynamic segmentation benchmark as foreground regions; marking pixel regions that are smaller than the dynamic segmentation benchmark as background regions; and generating a binarized foreground human body motion mask.

[0009] Preferably, the method for calculating the spatiotemporal feature similarity matrix of the foreground human motion mask includes: extracting three foreground human motion masks corresponding to three consecutive frames of images, and flattening the three foreground human motion masks into one-dimensional mask vectors respectively; calculating the first cosine similarity between the one-dimensional mask vector corresponding to the first frame and the one-dimensional mask vector corresponding to the second frame, calculating the second cosine similarity between the one-dimensional mask vector corresponding to the second frame and the one-dimensional mask vector corresponding to the third frame, and combining the first cosine similarity and the second cosine similarity to construct the spatiotemporal feature similarity matrix; when there is a negative correlation between the first cosine similarity and the second cosine similarity in the spatiotemporal feature similarity matrix, determining that the foreground human motion mask corresponding to the intermediate frame is a pseudo-motion mask caused by a sudden change in illumination and removing it.

[0010] Preferably, the step of extracting shallow feature maps corresponding to two adjacent frames includes: inputting each frame of the continuous image frames into the residual network backbone of the feature encoder, and extracting a single-frame initial feature map through the initial convolutional layer of the residual network backbone; concatenating the two single-frame initial feature maps corresponding to two adjacent frames in the channel dimension to generate a dual-channel initial feature map; inputting the dual-channel initial feature map into the residual block group of the residual network backbone, and performing spatial feature extraction on the dual-channel initial feature map through multiple cascaded residual blocks in the residual block group, and outputting the shallow feature map corresponding to the two adjacent frames.

[0011] Preferably, the step of inputting the temporally consistent motion mask into the behavior classification network includes: performing element-wise multiplication of the temporally consistent motion mask with the original images in the consecutive image frames to generate a masked temporal image sequence; inputting the masked temporal image sequence into a three-dimensional spatiotemporal convolutional layer in the behavior classification network, wherein the three-dimensional spatiotemporal convolutional layer performs convolution processing on the masked temporal image sequence along the time and spatial dimensions to extract a spatiotemporal behavior feature map; inputting the spatiotemporal behavior feature map into a fully connected layer in the behavior classification network, wherein the fully connected layer outputs a classification probability vector corresponding to multiple preset abnormal behavior categories of the temporal image sequence, and the preset abnormal behavior category corresponding to the maximum probability value in the classification probability vector is taken as the construction site abnormal behavior recognition result.

[0012] Preferably, the multi-layer deformable convolutional layer learns the spatial offset in the stitched feature map by: performing convolution operations on the stitched feature map using regular convolutional kernels in the multi-layer deformable convolutional layer to output an initial offset feature map; performing channel rearrangement operations on the initial offset feature map to divide the channel dimensions into sub-channel groups with the same number as the regular convolutional kernels, with each sub-channel group corresponding to an output two-dimensional coordinate offset; superimposing the two-dimensional coordinate offset onto the regular sampling grid coordinates of the regular convolutional kernels to generate deformable sampling grid coordinates; and using the deformable sampling grid coordinates to perform feature sampling on the stitched feature map to generate the background pixel-level displacement field.

[0013] Preferably, the encoding-decoding architecture performs downsampling and upsampling processing on the aligned feature map, comprising: in the encoding stage of the encoding-decoding architecture, downsampling the aligned feature map multiple times through a convolutional layer with a preset stride to extract multiple downsampled feature maps with different spatial resolutions; in the decoding stage of the encoding-decoding architecture, upsampling the last downsampled feature map through a transposed convolutional layer, concatenating the upsampling result with the next-last downsampled feature map with the same spatial resolution, and repeating the upsampling and feature concatenation operations until the original spatial resolution of the aligned feature map is restored, thereby generating the multi-scale feature map.

[0014] Preferably, the method of flattening the three foreground human motion masks into one-dimensional mask vectors includes: assigning preset positive values ​​to the pixels in the foreground region and preset negative values ​​to the pixels in the background region of each foreground human motion mask, thereby generating a binary mask matrix after assignment; performing one-dimensional expansion of the binary mask matrix in row-major order to generate an initial one-dimensional vector; calculating the L2 norm of each initial one-dimensional vector, and dividing each element of each initial one-dimensional vector by the corresponding L2 norm to generate a normalized one-dimensional mask vector.

[0015] 1. A deep learning-based intelligent recognition system for abnormal behavior at construction sites, comprising: a feature extraction processor for acquiring continuous image frames from construction site monitoring videos, inputting the continuous image frames into a preset feature encoder, and extracting shallow feature maps corresponding to two adjacent image frames; and a background decoupling processor for inputting the shallow feature maps into a background dynamic decoupling network, processing the shallow feature maps through parallel decoupling networks included in the background dynamic decoupling network, wherein the parallel decoupling network includes a background motion estimation branch and a foreground human body separation branch, the background motion estimation branch predicting the background pixel-level displacement field based on deformable convolution, and utilizing the background... A pixel-level displacement field is used to reverse-distort and align the shallow feature map of the current frame to generate an aligned feature map. The foreground human body separation branch extracts the foreground human body motion mask based on the aligned feature map. A consistency verification processor is used to input the foreground human body motion mask corresponding to multiple consecutive frames into a consistency verification network, calculate the spatiotemporal feature similarity matrix of the foreground human body motion mask, and remove pseudo motion masks caused by sudden changes in illumination based on the spatiotemporal feature similarity matrix while retaining the temporally consistent motion mask. A behavior classification processor is used to input the temporally consistent motion mask into a behavior classification network and output the construction site abnormal behavior recognition result.

[0016] Compared with the prior art, the beneficial effects of the present invention are as follows:

[0017] 1. This invention constructs a parallel decoupled network comprising a background motion estimation branch and a foreground human body separation branch. It utilizes multi-layer deformable convolution to predict the background pixel-level displacement field and performs reverse distortion alignment on the shallow feature map of the current frame based on the displacement field. This strips away non-human dynamic elements in the background at the feature level, eliminating dynamic background interference such as mechanical operation and dust. A consistency verification network calculates the cosine similarity of the one-dimensional mask vectors corresponding to three consecutive frames to construct a spatiotemporal feature similarity matrix. The negative correlation in the similarity matrix is ​​used to identify and eliminate pseudo-motion masks caused by sudden changes in local illumination, preserving temporally consistent motion masks and eliminating mask anomalies caused by illumination changes. This ensures that the features input to the behavior classification network are purely human behavior features, solving the problem in existing technologies where non-human dynamic interference leads to erroneous foreground regions in behavior recognition data.

[0018] 2. By concatenating the initial feature maps of two adjacent frames along the channel dimension and inputting them into the residual block group to extract shallow feature maps, feature associations between adjacent frames are established. In multi-layer deformable convolution, the sampling path of the displacement field is changed by rearranging the channels of the initial offset feature map and superimposing it onto the regular sampling grid to generate deformable sampling grid coordinates. In the encoder-decoder architecture, the spatial details in the mask extraction process are restored by concatenating the upsampling results of the decoding stage with downsampling feature maps with the same spatial resolution. By performing L2 norm normalization on the one-dimensional mask vector, the numerical scale of motion masks of different sizes in the similarity calculation process is unified. Attached Figure Description

[0019] Figure 1 A flowchart illustrating the overall process of the intelligent identification method for abnormal behavior at construction sites based on deep learning, as provided in this embodiment of the invention.

[0020] Figure 2 A flowchart for predicting background pixel-level displacement fields using background motion estimation branches provided in an embodiment of the present invention;

[0021] Figure 3 A flowchart for extracting the foreground human motion mask using the foreground human body separation branch provided in an embodiment of the present invention;

[0022] Figure 4 A flowchart for removing pseudo-motion masks using a consistency verification network provided in an embodiment of the present invention;

[0023] Figure 5 A flowchart of foreground human motion mask flattening and normalization processing provided in an embodiment of the present invention;

[0024] Figure 6 This is an architecture diagram of a deep learning-based intelligent identification system for abnormal behavior at construction sites, provided in an embodiment of the present invention. Detailed Implementation

[0025] Please refer to the attached document. Figures 1 to 6This embodiment provides a method and system for intelligent recognition of abnormal behavior at construction sites based on deep learning. It acquires continuous image frames from construction site monitoring videos and inputs these frames into a preset feature encoder to extract shallow feature maps corresponding to adjacent frames. Specifically, the construction site monitoring video is acquired by fixed monitoring cameras deployed at the construction site. The acquired video stream is decoded at a preset frame rate to generate continuous time-series image frames. Each frame contains corresponding timestamp information, and the difference in timestamps between adjacent frames is consistent with the reciprocal of the preset frame rate. The decoded continuous image frames are input into the preset feature encoder, which is constructed based on a residual network structure to extract spatial structural features in the image frames and temporal correlation features between adjacent frames. For any pair of adjacent frames in a continuous image frame, the previous frame is taken as the reference frame and the next frame as the current frame. Feature extraction is performed on the reference frame and the current frame respectively to generate the corresponding single-frame initial feature map. The single-frame initial feature maps of the two frames are concatenated along the channel dimension to generate the concatenated feature map. Then, the residual block group in the feature encoder is used to extract the depth features of the concatenated feature map and output the shallow feature map corresponding to the pair of adjacent frames.

[0026] The operation process for channel splicing is defined by the following formula:

[0027]

[0028] in, This is the initial feature map of the stitched dual channels. This is the initial feature map of a single frame corresponding to the reference frame. This is the initial feature map of the single frame corresponding to the current frame. This indicates that the splicing operation is performed along the channel dimension.

[0029] The shallow feature map is input into a background dynamic decoupling network. The shallow feature map is then processed by a parallel decoupling network included in the background dynamic decoupling network. This parallel decoupling network includes a background motion estimation branch and a foreground human body separation branch. The background motion estimation branch predicts background pixel-level displacement fields based on deformable convolutions. These background pixel-level displacement fields are then used to reverse-distort and align the current frame's shallow feature map to generate an aligned feature map. The foreground human body separation branch extracts the foreground human body motion mask based on the aligned feature map. Specifically, the two branches of the parallel decoupling network employ a parallel computing architecture. Both branches input shallow feature maps, and the background motion estimation branch outputs the aligned feature map, which is simultaneously input to the foreground human body separation branch as the basic input for foreground region extraction. The background motion estimation branch captures pixel-level motion patterns in the background region through deformable convolution operations, generating displacement field data corresponding to each pixel position. Based on the displacement field, it performs a reverse spatial coordinate mapping of the current frame's shallow feature map, aligning the background region in the current frame with the background region in the reference frame in the feature space, thus eliminating feature differences caused by dynamic background elements. The reverse twist alignment operation is defined by the following formula:

[0030]

[0031] in, To align feature maps, These are the pixel coordinates in the feature map. For the background pixel-level displacement field in coordinates The horizontal displacement component at that location, For the background pixel-level displacement field in coordinates The vertical displacement component at that location. This is the shallow feature map corresponding to the current frame.

[0032] After receiving the aligned feature map, the foreground human body separation branch performs multi-scale feature extraction on the aligned feature map to generate a pixel-by-pixel foreground probability distribution. Based on the foreground probability distribution, binarization segmentation is performed to generate a foreground human body motion mask with the same resolution as the aligned feature map. In the foreground human body motion mask, the pixel value corresponding to the foreground region is 1, and the pixel value corresponding to the background region is 0. In the aligned feature map, after reverse distortion alignment, the feature difference between the background region and the background region of the reference frame is significantly compressed. However, the features of the foreground human body region retain significant feature differences due to the inconsistency between its motion pattern and the background, providing a distinguishable feature basis for foreground and background segmentation.

[0033] The foreground human motion mask corresponding to multiple consecutive frames is input into a consistency verification network. The spatiotemporal feature similarity matrix of the foreground human motion mask is calculated. Based on the spatiotemporal feature similarity matrix, pseudo-motion masks caused by sudden changes in illumination are removed, and temporally consistent motion masks are retained. Specifically, the multiple consecutive frames are at least three consecutive frames in a time series, corresponding to at least three consecutive foreground human motion masks. The consistency verification network performs temporal dimension feature association calculations on the consecutive foreground human motion masks to generate a spatiotemporal feature similarity matrix. This matrix characterizes the degree of feature similarity between consecutive masks. Based on preset judgment rules, the numerical relationships in the similarity matrix are judged. When there are values ​​that meet a preset negative correlation, the foreground human motion mask of the corresponding frame is marked as a pseudo-motion mask and removed. The remaining mask is the temporally consistent motion mask. The cosine similarity calculation between consecutive masks is defined by the following formula:

[0034]

[0035] in, The cosine similarity between two one-dimensional mask vectors. , These are the one-dimensional mask vectors corresponding to the foreground human motion masks in the two frames, respectively. This represents the dot product operation of vectors. This represents the L2 norm operation of vectors.

[0036] The temporally consistent motion mask is input into the behavior classification network, and the result of construction site abnormal behavior recognition is output. Specifically, the temporally consistent motion mask is multiplied element-wise with the original image frame corresponding to the timestamp to generate a temporal image sequence that retains only the foreground region. The behavior classification network is constructed based on a three-dimensional convolutional architecture, and performs spatiotemporal feature extraction on the temporal image sequence to generate feature vectors corresponding to the behavior. Classification operations are performed based on the feature vectors, and the classification result corresponding to the preset abnormal behavior category is output, which is the construction site abnormal behavior recognition result. The process of generating the classification probability is defined by the following formula:

[0037]

[0038] in, Let be the classification probability of the i-th preset abnormal behavior category. This represents the raw score for the i-th category output by the fully connected layer. To preset the total number of abnormal behavior categories, .

[0039] Table 1. Dimensional mapping relationship for shallow feature extraction of continuous image frames

[0040] Frame sequence group number Reference frame single-frame initial feature map size Current frame single frame initial feature map size Size of the stitched dual-channel feature map Output shallow feature map size Number of feature channels 1 H×W×C H×W×C H×W×2C H×W×C C 2 H / 2×W / 2×C H / 2×W / 2×C H / 2×W / 2×2C H / 2×W / 2×C C 3 H / 4×W / 4×C H / 4×W / 4×C H / 4×W / 4×2C H / 4×W / 4×C C 4 H / 8×W / 8×C H / 8×W / 8×C H / 8×W / 8×2C H / 8×W / 8×C C

[0041] Table 1 characterizes the dimensional correspondence of the output features at each stage when the feature encoder processes different frame sequences. This ensures that the features of adjacent frames maintain spatial resolution consistency during stitching and extraction, providing dimension-matched input features for subsequent background dynamic decoupling network processing. In the table, H represents the height of the original image frame, W represents the width of the original image frame, and C represents the basic number of channels in the initial feature map of a single frame. The spatial resolution of the features at each stage is downsampled by multiples of 2 to ensure the hierarchical transfer of spatial structure information during feature extraction.

[0042] In this embodiment, the deep learning-based intelligent recognition system for abnormal behavior at construction sites includes a feature extraction processor, a background decoupling processor, a consistency check processor, and a behavior classification processor. The feature extraction processor receives continuous image frames generated by decoding construction site monitoring video at its input end, and its output end is connected to the input end of the background decoupling processor. The feature extraction processor has a built-in preset feature encoder for performing feature extraction operations on continuous image frames and outputting shallow feature maps corresponding to two adjacent image frames. The background decoupling processor has a built-in background dynamic decoupling network, which includes parallel background motion estimation branches and foreground human body separation branches. The output end of the background decoupling processor is connected to the input end of the consistency check processor, for receiving shallow feature maps and outputting foreground human body motion masks corresponding to multiple consecutive image frames. The consistency check processor has a built-in consistency check network, and its output end is connected to the input end of the behavior classification processor. It receives foreground human body motion masks from multiple consecutive frames, performs spatiotemporal feature similarity calculation and pseudo-mask removal operations, and outputs a temporally consistent motion mask. The behavior classification processor has a built-in behavior classification network, for receiving the temporally consistent motion mask, performing behavior classification operations, and outputting the construction site abnormal behavior recognition results.

[0043] In this embodiment, shallow feature maps of two adjacent frames are extracted by a feature encoder, establishing temporal feature associations between adjacent frames. The background motion estimation and foreground human body separation operations are performed separately through the parallel decoupling network in the background dynamic decoupling network, achieving decoupling of background dynamic elements and foreground human body features at the feature level. The temporal consistency verification network performs temporal consistency verification on the continuous mask, eliminating pseudo-motion masks caused by sudden changes in illumination. Finally, the abnormal behavior recognition result is output through the behavior classification network, thus fully realizing the intelligent recognition process of abnormal behavior at the construction site.

[0044] refer to Figure 2In a preferred embodiment, the background motion estimation branch predicts a background pixel-level displacement field based on deformable convolution. It concatenates the shallow feature maps of a reference frame and the current frame from two adjacent frames, channel-wise, to generate a concatenated feature map. Specifically, the shallow feature maps of the reference and current frames have the same spatial resolution and number of channels. The two feature maps are concatenated along the channel dimension, resulting in a concatenated feature map with twice the number of channels of a single shallow feature map, while maintaining the same spatial resolution. The concatenated feature map is then input into a multi-layer deformable convolutional layer in the background motion estimation branch. This multi-layer deformable convolutional layer learns the spatial offset in the concatenated feature map to generate a background pixel-level displacement field matching the size of the current frame's shallow feature map. This background pixel-level displacement field includes both horizontal and vertical displacement components.

[0045] A multi-layer deformable convolutional layer consists of at least three cascaded deformable convolutional units. Each deformable convolutional unit includes a regular convolutional branch and an offset prediction branch. The regular convolutional branch is used to extract spatial features from the stitched feature map, and the offset prediction branch is used to predict the spatial offset of each convolutional kernel sampling point. The offset generation process is defined by the following formula:

[0046] in, This represents the two-dimensional coordinate offset corresponding to the nth sampling point. For the regular convolution operation of the offset prediction branch, The input is the spliced ​​feature map.

[0047] In a multi-layer deformable convolutional layer, a regular convolutional kernel performs a convolution operation on the stitched feature map, outputting an initial offset feature map. The initial offset feature map undergoes channel rearrangement, dividing the channel dimension into sub-channel groups with the same number of channels as the regular convolutional kernels. Each sub-channel group corresponds to a two-dimensional coordinate offset in the output. The channel rearrangement operation is defined by the following formula:

[0048]

[0049] in, This is the rearranged offset feature map. This is the initial offset feature map. is the number of sampling points for a regular convolution kernel, 2 corresponds to the offset in the horizontal and vertical directions, and H and W are the height and width of the feature map, respectively.

[0050] Two-dimensional coordinate offsets are superimposed onto the regular sampling grid coordinates of the regular convolution kernel to generate deformed sampling grid coordinates. The regular sampling grid coordinates are fixed, uniform grid coordinates used in regular convolution operations, with each sampling point having a preset fixed coordinate value. The predicted two-dimensional coordinate offsets are superimposed point-by-point onto the corresponding regular sampling grid coordinates to generate deformed sampling grid coordinates that change with the feature content. Feature sampling is then performed on the stitched feature map using these deformed sampling grid coordinates to generate a background pixel-level displacement field. The spatial resolution of the background pixel-level displacement field is completely consistent with the shallow feature map of the current frame. Each pixel position corresponds to two values: a horizontal displacement component and a vertical displacement component, used to characterize the motion displacement of the background pixel at that position from the reference frame to the current frame.

[0051] A bilinear interpolation algorithm combined with a background pixel-level displacement field is used to sample and offset the shallow feature map of the current frame, generating an aligned feature map. The bilinear interpolation sampling process is defined by the following formula:

[0052]

[0053] in, coordinates The sampled feature value at that location, These are the weighting coefficients for bilinear interpolation. , These are the integer coordinates of the points adjacent to the target sampling point in the shallow feature map of the current frame. This is the shallow feature map of the current frame.

[0054] The reverse warp alignment process involves mapping each background pixel in the current frame's shallow feature map to its corresponding pixel position in the reference frame's shallow feature map, based on the background pixel-level displacement field. This ensures complete spatial alignment of the background region features in the current and reference frames, eliminating feature differences caused by background dynamic elements. For background pixels, the feature values ​​after reverse warp alignment show minimal difference from the corresponding background pixel features in the reference frame. However, for foreground human body pixels, due to their inconsistent motion patterns with background dynamic elements, the feature values ​​after reverse warp alignment show significant differences from the corresponding feature values ​​in the reference frame, providing discriminative power for subsequent mask extraction of the foreground human body separation branch.

[0055] Table 2. Correspondence between operational parameters and output features of multilayer deformable convolutional layers

[0056] Convolutional unit number kernel size convolution stride Number of input channels Number of output channels Offset output dimension Output feature map size 1 3×3 1 2C C 2×9 H×W 2 3×3 1 C C 2×9 H×W 3 3×3 1 C 2 2×1 H×W

[0057] Table 2 characterizes the correspondence between the computational parameters and output features of each convolutional unit in a multi-layer deformable convolutional layer, ensuring that the accuracy of the offset prediction matches the spatial resolution requirements of the displacement field output. In the table, C represents the basic number of channels in the shallow feature map, H and W represent the height and width of the input stitched feature map, respectively. Convolutional unit 3 has 2 output channels, corresponding to the horizontal and vertical displacement components of the background pixel-level displacement field. The size of the output feature map is completely consistent with the input shallow feature map of the current frame, ensuring pixel-level matching for subsequent reverse twist alignment operations.

[0058] In this embodiment, by concatenating the shallow feature maps of the reference frame and the current frame, temporally correlated feature inputs are provided for background motion estimation. The spatial offsets in the concatenated feature maps are learned through multiple deformable convolutional layers to generate a pixel-level background displacement field. The offsets correspond to the convolutional sampling points through channel rearrangement operations. The reverse distortion alignment of the current frame feature map is achieved through bilinear interpolation algorithm, thus completing the feature alignment and stripping of background dynamic elements and providing an aligned feature map that eliminates background interference for the extraction of the foreground human motion mask.

[0059] refer to Figure 3 In another preferred embodiment, the foreground human body separation branch extracts the foreground human body motion mask based on the aligned feature map. The aligned feature map is then input into the encoder-decoder architecture in the foreground human body separation branch. The encoder-decoder architecture performs downsampling and upsampling processing on the aligned feature map, outputting a multi-scale feature map. Specifically, the encoder-decoder architecture adopts a U-shaped network structure, including symmetrically set encoding and decoding stages. The encoding stage is used to extract multi-scale deep semantic features from the aligned feature map, and the decoding stage is used to restore the deep semantic features to the spatial resolution of the original input. At the same time, the multi-scale features from the encoding stage are fused to preserve the spatial detail information of the foreground region.

[0060] The encoding-decoding architecture's encoding stage downsamples the aligned feature map multiple times using convolutional layers with a preset stride, extracting multiple downsampled feature maps with different spatial resolutions. The preset stride is 2; after each convolutional layer with a stride of 2, the spatial resolution of the feature map is reduced to half its original value, while the number of channels is doubled. The encoding stage contains at least four downsampled convolutional blocks, each consisting of two 3×3 convolutional layers, a batch normalization layer, and a ReLU activation layer. The last convolutional layer has a stride of 2 to perform the downsampling operation. After each downsampled convolutional block, a downsampled feature map corresponding to the spatial resolution is output, and all downsampled feature maps are stored in descending order of spatial resolution.

[0061] In the decoding stage of the encoder-decoder architecture, the last downsampled feature map is upsampled using a transposed convolutional layer. The upsampled result is then concatenated with the next-next-last downsampled feature map, which has the same spatial resolution. This upsampling and concatenation process is repeated until the original spatial resolution of the aligned feature maps is restored, generating a multi-scale feature map.

[0062]

[0063] in, This is the feature map output from the k-th layer during the decoding stage. The upsampling operation is implemented for transposed convolution. This is the feature map output from the (k+1)th layer during the decoding stage. For the output of the k-th layer in the encoding stage, and Downsampled feature maps with the same spatial resolution This indicates splicing along the channel dimension.

[0064] The decoding stage contains the same number of upsampled convolutional blocks as the encoding stage. Each upsampled convolutional block consists of a transposed convolutional layer with a stride of 2, two 3×3 convolutional layers, a batch normalization layer, and a ReLU activation layer. The transposed convolutional layer upsamples the input feature map, doubling the spatial resolution and halving the number of channels. The upsampled feature map is then concatenated with the corresponding downsampled feature map from the encoding stage, and the concatenated feature map is fed into subsequent convolutional layers for feature fusion, generating the corresponding layer's decoded feature map. After processing by the last upsampled convolutional block, the output feature map has the same spatial resolution as the input aligned feature map, and the number of channels is the preset single-channel value. This feature map is a multi-scale feature map, fusing semantic features and spatial detail features at different resolutions.

[0065] A pixel-wise sigmoid activation process is applied to the multi-scale feature map to generate a foreground probability map with the same resolution as the aligned feature map. The sigmoid activation process is defined by the following formula:

[0066]

[0067] in, Foreground probability map in coordinates The probability value at that location. For multi-scale feature maps in coordinates eigenvalues ​​at that location The value range is (0,1). The closer the value is to 1, the higher the probability that the pixel belongs to the foreground human body area. The closer the value is to 0, the higher the probability that the pixel belongs to the background area.

[0068] The mean probability of all pixels in the foreground probability map is calculated. This mean probability is used as the dynamic segmentation benchmark. Pixel regions in the foreground probability map with probabilities greater than the dynamic segmentation benchmark are marked as foreground regions, and pixel regions with probabilities less than the dynamic segmentation benchmark are marked as background regions, generating a binarized foreground human motion mask. The calculation process of the dynamic segmentation benchmark is defined by the following formula:

[0069]

[0070] in, As the dynamic segmentation benchmark, H and W represent the height and width of the foreground probability map, respectively. It represents the sum of the probability values ​​of all pixels in the foreground probability map.

[0071] The dynamic segmentation benchmark changes in real time with the probability distribution of the foreground probability map. Compared with fixed threshold segmentation, it can adapt to changes in the proportion of the foreground region in different scenarios, improving the accuracy of mask segmentation. For each pixel in the foreground probability map, when the probability value of the pixel is greater than the dynamic segmentation benchmark, the pixel is assigned a value of 1 and marked as a foreground region; when the probability value of the pixel is less than or equal to the dynamic segmentation benchmark, the pixel is assigned a value of 0 and marked as a background region. The final generated binarized foreground human motion mask has the same spatial resolution as the aligned feature map. The foreground region in the mask corresponds to the human motion region in the original image frame, and the background region corresponds to the aligned static and dynamic background regions in the original image frame.

[0072] Table 3 Feature processing parameters for each layer of the encoding-decoding architecture

[0073] stage Layer number Operation type Input feature size Output feature size Number of output channels Encoding stage 1 Convolution + Downsampling H×W×C H / 2×W / 2×2C 2C Encoding stage 2 Convolution + Downsampling H / 2×W / 2×2C H / 4×W / 4×4C 4C Encoding stage 3 Convolution + Downsampling H / 4×W / 4×4C H / 8×W / 8×8C 8C Encoding stage 4 Convolution + Downsampling H / 8×W / 8×8C H / 16×W / 16×16C 16C Decoding stage 1 Transposed convolution + upsampling H / 16×W / 16×16C H / 8×W / 8×8C 8C Decoding stage 2 Feature concatenation + convolution H / 8×W / 8×16C H / 4×W / 4×4C 4C Decoding stage 3 Feature concatenation + convolution H / 4×W / 4×8C H / 2×W / 2×2C 2C Decoding stage 4 Feature concatenation + convolution H / 2×W / 2×4C H×W×1 1

[0074] Table 3 characterizes the operation types, input / output feature sizes, and channel numbers of each layer in the encoding and decoding stages of the encoder-decoder architecture, ensuring the fusion of multi-scale features and accurate restoration of spatial resolution. In the table, H and W represent the height and width of the input aligned feature map, respectively, and C represents the base number of channels in the aligned feature map. Each layer's feature concatenation operation in the decoding stage fuses downsampled feature maps of the same resolution as those in the encoding stage, ensuring the restoration of spatial detail information during decoding. The fourth layer in the decoding stage has one output channel, corresponding to a single-channel multi-scale feature map, providing input for the subsequent generation of the foreground probability map.

[0075] In this embodiment, the aligned feature map is downsampled and upsampled using an encoder-decoder architecture to extract a multi-scale feature map that integrates multi-scale semantic information and spatial detail information. A pixel-wise foreground probability map is generated through sigmoid activation. A dynamic segmentation benchmark is generated by calculating the probability mean of the foreground probability map. Binarization segmentation is performed based on the dynamic segmentation benchmark to generate a foreground human motion mask. This achieves pixel-level accurate separation between the foreground human region and the background region, providing accurate mask input for subsequent temporal consistency verification.

[0076] refer to Figure 4 and 5 In another preferred embodiment, the foreground human motion masks corresponding to multiple consecutive frames of images are input into a consistency verification network. The spatiotemporal feature similarity matrix of the foreground human motion masks is calculated. Based on the spatiotemporal feature similarity matrix, pseudo-motion masks caused by sudden changes in illumination are removed, and temporally consistent motion masks are retained. Specifically, the multiple consecutive frames of images are three consecutive frames in a time series, corresponding to three consecutive foreground human motion masks. The three masks are arranged in chronological order according to their timestamps, namely the first frame mask, the second frame mask, and the third frame mask, where the second frame mask is the middle frame mask.

[0077] Three foreground human motion masks are flattened into one-dimensional mask vectors. Pixels in the foreground region of each foreground human motion mask are assigned preset positive values, and pixels in the background region are assigned preset negative values, generating a binary mask matrix. The preset positive value is 1, and the preset negative value is -1. The binary mask matrix has the same size as the original foreground human motion mask, with foreground pixels having a value of 1 and background pixels having a value of -1. Compared to a 0-1 binary matrix, this improves the discriminative power of subsequent similarity calculations. The binary mask matrix is ​​then one-dimensionally expanded in row-major order to generate an initial one-dimensional vector. The length of the initial one-dimensional vector is the product of the height and width of the binary mask matrix, consistent with the total number of pixels in the original mask. The L2 norm of each initial one-dimensional vector is calculated. Each element of the initial one-dimensional vector is divided by its corresponding L2 norm to generate a normalized one-dimensional mask vector. The normalization process is defined by the following formula:

[0078]

[0079] in, This is the normalized one-dimensional mask vector. As an initial one-dimensional vector, The norm of the initial one-dimensional vector is 2. The norm of the normalized one-dimensional mask vector is 1, which eliminates the influence of the numerical scale difference of different mask vectors on the similarity calculation.

[0080] Calculate the first cosine similarity between the one-dimensional mask vector corresponding to the first frame and the one-dimensional mask vector corresponding to the second frame, and calculate the second cosine similarity between the one-dimensional mask vector corresponding to the second frame and the one-dimensional mask vector corresponding to the third frame. Combine the first and second cosine similarities to construct a spatiotemporal feature similarity matrix. The construction process of the spatiotemporal feature similarity matrix is ​​defined by the following formula:

[0081]

[0082] in, This is a spatiotemporal feature similarity matrix. The first cosine similarity corresponds to the cosine similarity between the mask vectors of the first and second frames. The second cosine similarity corresponds to the cosine similarity between the mask vectors of the second and third frames. The off-diagonal elements of the matrix are 0, which is used to characterize the temporal similarity association between consecutive frames.

[0083] When there is a negative correlation between the first cosine similarity and the second cosine similarity in the spatiotemporal feature similarity matrix, the foreground human motion mask corresponding to the intermediate frame is determined to be a pseudo-motion mask caused by a sudden change in illumination and is removed. Specifically, the rule for determining the negative correlation is that the absolute value of the difference between the first cosine similarity and the second cosine similarity is greater than a preset similarity threshold, and one of the similarity values ​​is greater than a preset positive correlation threshold, while the other similarity value is less than a preset negative correlation threshold. In construction site monitoring scenarios, sudden changes in illumination are usually instantaneous single-frame phenomena, manifested as a significant difference between the mask distribution of the intermediate frame and the mask distributions of the preceding and following frames. The mask distributions between the preceding and following frames have high similarity, while the similarity between the intermediate frame and both preceding and following frames is low, or the intermediate frame has extremely low similarity with both the preceding and following frames, exhibiting a negative temporal correlation characteristic. By determining the numerical relationship of the spatiotemporal feature similarity matrix, we can accurately identify the pseudo-motion mask caused by sudden changes in illumination and remove it from the continuous mask sequence. The remaining mask sequence is the temporally consistent motion mask. All masks in the temporally consistent motion mask have continuous temporal feature associations and correspond to the real human motion area.

[0084] A temporally consistent motion mask is input into a behavior classification network. The temporally consistent motion mask is then multiplied element-wise with the original images in consecutive image frames to generate a masked temporal image sequence. Specifically, each temporally consistent motion mask corresponds to one original image frame. The mask is multiplied element-wise with each of the three color channels of the corresponding original image. The pixel values ​​of the foreground region of the mask in the original image remain unchanged, while the pixel values ​​of the background region of the mask are set to 0, generating a mask image that retains only the foreground human body region. All mask images are arranged in timestamp order to form the masked temporal image sequence.

[0085] The masked temporal image sequence is input into a 3D spatiotemporal convolutional layer in the behavior classification network. The 3D spatiotemporal convolutional layer performs convolution processing on the masked temporal image sequence along both the temporal and spatial dimensions to extract spatiotemporal behavior feature maps. The convolution kernel size of the 3D spatiotemporal convolutional layer includes both temporal and spatial dimensions, enabling it to simultaneously capture temporal motion features and spatial pose features from the temporal image sequence. The operation of 3D spatiotemporal convolution is defined by the following formula:

[0086]

[0087] in, Spatiotemporal behavior feature map in coordinates eigenvalues ​​at that location The time dimension size of the convolution kernel. The spatial dimension of the convolution kernel. The weight parameters of the 3D convolution kernel. This is a masked temporal image sequence. For time-dimensional indexing, , For spatial dimensions index.

[0088] The spatiotemporal behavior feature map is input into the fully connected layer of the behavior classification network. The fully connected layer outputs a classification probability vector corresponding to multiple preset abnormal behavior categories for the time-series image sequence. The preset abnormal behavior category corresponding to the highest probability value in the classification probability vector is taken as the abnormal behavior recognition result for the construction site. Specifically, the behavior classification network also includes a global average pooling layer. The global average pooling layer performs global average pooling on the spatiotemporal behavior feature map output by the 3D spatiotemporal convolutional layer to generate a one-dimensional feature vector. The feature vector is input into the fully connected layer, which performs a linear transformation on the feature vector and outputs the original score corresponding to the preset abnormal behavior category. After the original score is processed by softmax activation, a classification probability vector is generated. Each element in the classification probability vector corresponds to the classification probability of a preset abnormal behavior category. The preset abnormal behavior category corresponding to the element with the highest value in the classification probability vector is selected as the final abnormal behavior recognition result for the construction site. The preset abnormal behavior categories include, but are not limited to, common violations and abnormal behaviors in construction site scenarios such as not wearing a safety helmet, not wearing a safety belt, unauthorized climbing, unauthorized hot work, and trespassing.

[0089] Table 4 Mapping Table of Abnormal Behaviors at Construction Sites and Classification Output

[0090] Category number Abnormal behavior name Category tag value Classification probability vector dimension index 1 Not wearing a helmet 0 0 2 Not wearing a seatbelt 1 1 3 illegal climbing 2 2 4 Unauthorized hot work 3 3 5 Entering dangerous area 4 4 6 normal behavior 5 5

[0091] Table 4 represents the correspondence between the preset abnormal behavior categories at the construction site and the output labels and probability dimensions of the classification network, ensuring accurate mapping and output of the classification results. The classification label values ​​in the table are the category labels used during the training of the behavior classification network. The dimension index of the classification probability vector corresponds to the category corresponding to each probability value in the output classification probability vector. The normal behavior category is used to represent the absence of preset abnormal behaviors in the time-series image sequence, ensuring that the classification network can cover all possible input scenarios.

[0092] In this embodiment, a one-dimensional mask vector with uniform scale is generated by assigning values ​​to the foreground human motion mask and normalizing it using the L2 norm. A spatiotemporal feature similarity matrix is ​​constructed by calculating the cosine similarity of the mask vectors of three consecutive frames. Based on the numerical relationship of the similarity matrix, pseudo-motion masks are determined and eliminated, while temporally consistent motion masks are retained. A temporally consistent motion mask is multiplied with the original image to generate a temporal image sequence after masking. Spatiotemporal behavior features are extracted through a three-dimensional spatiotemporal convolutional layer, and a classification probability vector is output through a fully connected layer. Finally, the abnormal behavior recognition result of the construction site is obtained, ensuring the accuracy and anti-interference ability of abnormal behavior recognition.

Claims

1. A method for intelligent identification of abnormal behavior at construction sites based on deep learning, characterized in that, include: Acquire continuous image frames from construction site monitoring video, input the continuous image frames into a preset feature encoder, and extract shallow feature maps corresponding to two adjacent image frames; The shallow feature map is input into a background dynamic decoupling network. The shallow feature map is processed by a parallel decoupling network included in the background dynamic decoupling network. The parallel decoupling network includes a background motion estimation branch and a foreground human body separation branch. The background motion estimation branch predicts the background pixel-level displacement field based on deformable convolution. The background pixel-level displacement field is used to reverse distort and align the shallow feature map of the current frame to generate an aligned feature map. The foreground human body separation branch extracts the foreground human body motion mask based on the aligned feature map. The foreground human motion mask corresponding to multiple consecutive frames of images is input into the consistency verification network to calculate the spatiotemporal feature similarity matrix of the foreground human motion mask. Based on the spatiotemporal feature similarity matrix, pseudo motion masks caused by sudden changes in illumination are removed and temporally consistent motion masks are retained. The temporally consistent motion mask is input into the behavior classification network, and the abnormal behavior recognition results of the construction site are output.

2. The intelligent identification method for abnormal behavior at construction sites based on deep learning according to claim 1, characterized in that, The background motion estimation branch predicts the background pixel-level displacement field based on deformable convolution, which includes: concatenating the shallow feature maps of the reference frame in the two adjacent frames with the shallow feature map of the current frame to generate a concatenated feature map. The stitched feature map is input into a multi-layer deformable convolutional layer in the background motion estimation branch. The multi-layer deformable convolutional layer learns the spatial offset in the stitched feature map and generates a background pixel-level displacement field that matches the size of the shallow feature map of the current frame. The background pixel-level displacement field includes horizontal displacement components and vertical displacement components. The alignment feature map is generated by sampling and offsetting the shallow feature map of the current frame using a bilinear interpolation algorithm combined with the background pixel-level displacement field.

3. The intelligent identification method for abnormal behavior at construction sites based on deep learning according to claim 1, characterized in that, The foreground human body separation branch extracts the foreground human body motion mask based on the alignment feature map by: inputting the alignment feature map into the encoder-decoder architecture in the foreground human body separation branch, wherein the encoder-decoder architecture performs downsampling and upsampling processing on the alignment feature map and outputs a multi-scale feature map; The multi-scale feature map is subjected to pixel-wise sigmoid activation processing to generate a foreground probability map with the same resolution as the aligned feature map. Calculate the average probability of all pixels in the foreground probability map, use the average probability as a dynamic segmentation benchmark, mark the pixel regions in the foreground probability map that are greater than the dynamic segmentation benchmark as foreground regions, and mark the pixel regions that are less than the dynamic segmentation benchmark as background regions, and generate a binarized foreground human motion mask.

4. The intelligent identification method for abnormal behavior at construction sites based on deep learning according to claim 1, characterized in that, Calculating the spatiotemporal feature similarity matrix of the foreground human motion mask includes: extracting three foreground human motion masks corresponding to three consecutive frames of images, and flattening the three foreground human motion masks into one-dimensional mask vectors respectively; Calculate the first cosine similarity between the one-dimensional mask vector corresponding to the first frame and the one-dimensional mask vector corresponding to the second frame, calculate the second cosine similarity between the one-dimensional mask vector corresponding to the second frame and the one-dimensional mask vector corresponding to the third frame, and combine the first cosine similarity and the second cosine similarity to construct the spatiotemporal feature similarity matrix. When there is a negative correlation between the first cosine similarity and the second cosine similarity in the spatiotemporal feature similarity matrix, the foreground human motion mask corresponding to the intermediate frame is determined to be a pseudo motion mask caused by a sudden change in illumination and is removed.

5. The intelligent identification method for abnormal behavior at construction sites based on deep learning according to claim 1, characterized in that, Extracting shallow feature maps corresponding to two adjacent frames includes: inputting each frame of the consecutive image frames into the residual network backbone of the feature encoder, and extracting the initial feature map of a single frame through the initial convolutional layer of the residual network backbone; The two single-frame initial feature maps corresponding to two adjacent frames are spliced ​​together in the channel dimension to generate a dual-channel initial feature map. The dual-channel initial feature map is input into the residual block group of the residual network backbone. Spatial features are extracted from the dual-channel initial feature map through multiple cascaded residual blocks in the residual block group, and the shallow feature map corresponding to the two adjacent frames is output.

6. The intelligent identification method for abnormal behavior at construction sites based on deep learning according to claim 1, characterized in that, Inputting the temporally consistent motion mask into the behavior classification network includes: performing element-wise multiplication of the temporally consistent motion mask with the original images in the consecutive image frames to generate a masked temporal image sequence; The masked temporal image sequence is input into the three-dimensional spatiotemporal convolutional layer in the behavior classification network. The three-dimensional spatiotemporal convolutional layer performs convolution processing on the masked temporal image sequence along the time and spatial dimensions to extract spatiotemporal behavior feature maps. The spatiotemporal behavior feature map is input into the fully connected layer of the behavior classification network. The fully connected layer outputs a classification probability vector of the time-series image sequence corresponding to multiple preset abnormal behavior categories. The preset abnormal behavior category corresponding to the maximum probability value in the classification probability vector is taken as the construction site abnormal behavior recognition result.

7. The intelligent identification method for abnormal behavior at construction sites based on deep learning according to claim 2, characterized in that, The multi-layer deformable convolutional layer learns the spatial offset in the stitched feature map by: performing convolution operations on the stitched feature map using regular convolution kernels in the multi-layer deformable convolutional layer, and outputting an initial offset feature map; The initial offset feature map is subjected to channel rearrangement operation, which divides the channel dimension into sub-channel groups with the same number of regular convolutional kernels. Each sub-channel group corresponds to a two-dimensional coordinate offset output. The two-dimensional coordinate offset is superimposed on the regular sampling grid coordinates of the regular convolution kernel to generate deformed sampling grid coordinates; The background pixel-level displacement field is generated by sampling features on the stitched feature map using the coordinates of the deformable sampling grid.

8. The intelligent identification method for abnormal behavior at construction sites based on deep learning according to claim 3, characterized in that, The encoding-decoding architecture performs downsampling and upsampling processing on the aligned feature map, including: in the encoding stage of the encoding-decoding architecture, the aligned feature map is downsampled multiple times by a convolutional layer with a preset stride to extract multiple downsampled feature maps with different spatial resolutions; The decoding stage of the encoder-decoder architecture upsamples the last downsampled feature map by transposing the convolutional layer, concatenates the upsampled result with the next last downsampled feature map with the same spatial resolution, and repeats the upsampling and feature concatenation operations until the original spatial resolution of the aligned feature map is restored, thereby generating the multi-scale feature map.

9. The intelligent identification method for abnormal behavior at construction sites based on deep learning according to claim 4, characterized in that, Flattening the three foreground human motion masks into one-dimensional mask vectors includes: assigning a preset positive value to the pixels in the foreground region of each foreground human motion mask, and assigning a preset negative value to the pixels in the background region, thereby generating a binary mask matrix after assignment. The binary mask matrix is ​​expanded in row-major order to generate an initial one-dimensional vector. The L2 norm of each initial one-dimensional vector is calculated, and the normalized one-dimensional mask vector is generated by dividing each element of each initial one-dimensional vector by the corresponding L2 norm.

10. A deep learning-based intelligent identification system for abnormal behavior at construction sites, characterized in that: include: The feature extraction processor is used to acquire continuous image frames of construction site monitoring video, input the continuous image frames into a preset feature encoder, and extract shallow feature maps corresponding to two adjacent image frames. A background decoupling processor is used to input the shallow feature map into a background dynamic decoupling network, and process the shallow feature map through a parallel decoupling network included in the background dynamic decoupling network. The parallel decoupling network includes a background motion estimation branch and a foreground human body separation branch. The background motion estimation branch predicts the background pixel-level displacement field based on deformable convolution, and uses the background pixel-level displacement field to perform reverse distortion alignment on the shallow feature map of the current frame to generate an aligned feature map. The foreground human body separation branch extracts the foreground human body motion mask based on the aligned feature map. A consistency check processor is used to input the foreground human motion mask corresponding to multiple consecutive frames of images into a consistency check network, calculate the spatiotemporal feature similarity matrix of the foreground human motion mask, and remove pseudo motion masks caused by sudden changes in illumination based on the spatiotemporal feature similarity matrix while retaining temporally consistent motion masks. A behavior classification processor is used to input the time-consistent motion mask into the behavior classification network and output the abnormal behavior recognition results of the construction site.