AI video monitoring method for intrusion of dangerous area of construction site
By constructing a three-dimensional pixel array and a spatiotemporal structure tensor matrix, the trajectory manifold geometric features in construction site videos are extracted, the abnormal behavior manifold index is calculated, and a pixel micro-motion fluctuation feature set is generated. This solves the problem of distinguishing between stationary objects and human bodies in construction site monitoring videos and achieves high-accuracy intrusion detection in complex environments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- INNER MONGOLIA JIAYING CONSTR ENG CO LTD
- Filing Date
- 2026-03-16
- Publication Date
- 2026-06-12
AI Technical Summary
Existing image recognition technology has difficulty distinguishing between stationary objects with similar shapes and human bodies in construction site monitoring videos. It is prone to misjudgment, especially in environments with complex lighting, numerous piles of objects, and severe obstruction, leading to missed reports of intrusions into dangerous areas.
By constructing a three-dimensional pixel array, a spatiotemporal slice texture map is generated, the gradient covariance matrix and spatiotemporal structure tensor matrix are calculated, the trajectory manifold geometric feature vector is extracted, the abnormal behavior manifold index is calculated, static or quasi-static regions are screened, a temporal pixel grayscale sequence of ROI is generated, the pixel micro-motion fluctuation feature set is calculated, and finally, it is compared with a preset threshold to generate an intrusion assessment signal for dangerous areas at the construction site.
It improves the robustness and accuracy of intrusion detection in dangerous areas in complex scenarios, avoids the risk of missed detection due to stationary targets, and can accurately distinguish between stationary construction materials and stationary personnel with vital signs.
Smart Images

Figure CN122200546A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image recognition technology, and in particular to an AI video monitoring method for intrusion into dangerous areas of construction sites. Background Technology
[0002] Image recognition technology uses computer vision theory and artificial intelligence algorithms to simulate the human visual system to intelligently process and analyze digital images or video streams.
[0003] Current image recognition technologies for processing construction site surveillance videos mostly employ detection modes based on inter-frame differences or target contour features. However, in complex lighting conditions, with numerous piles of debris and severe obstructions at construction sites, relying solely on visual contour features is highly susceptible to environmental noise interference, making it difficult to distinguish between similar-looking stationary objects and human bodies. For example, it might misidentify piled-up cement bags as squatting individuals or mistake a fallen person for construction waste. Therefore, improvements are needed. Summary of the Invention
[0004] The purpose of this invention is to address the shortcomings of existing technologies by proposing an AI video monitoring method for intrusion into dangerous areas of construction sites.
[0005] To achieve the above objectives, the present invention adopts the following technical solution: an AI video monitoring method for intrusion into dangerous areas of construction sites, comprising the following steps: Based on the video stream data covering the dangerous boundary of the construction site, continuous frame images are extracted to construct a three-dimensional pixel array. The three-dimensional pixel array is segmented along the time axis to obtain the cross-sectional image at the specified spatial location, and a spatiotemporal slice texture map is generated. The brightness partial derivatives with respect to the spatial and temporal dimensions are calculated for each pixel in the spatiotemporal slice texture map. The gradient covariance matrix is constructed using the partial derivatives to generate the spatiotemporal structure tensor matrix. Extract the eigenvalues and eigenvectors of the spatiotemporal structure tensor matrix, calculate and generate the trajectory manifold geometric eigenvectors, calculate the Euclidean distance between the trajectory manifold geometric eigenvectors and the preset standard walking parameter set, and at the same time calculate abnormal large-amplitude movements to generate abnormal behavior manifold indices. Filter the pixel coordinate regions where the abnormal behavior manifold index is in the static or quasi-static range, collect the edge pixel brightness values of the dangerous boundary targets in the construction site within the region, generate the ROI time domain pixel grayscale sequence, perform statistical operations on the ROI time domain pixel grayscale sequence, calculate the standard deviation value representing the brightness dispersion and the zero-crossing rate value representing the signal oscillation frequency respectively, and generate the pixel micro-motion fluctuation feature set. Based on the pixel micro-motion fluctuation feature set, a target activity value is calculated and generated. The target activity value is then compared with a preset non-biological background noise threshold and a biological characteristic lower limit threshold to generate an intrusion assessment signal for the dangerous area of the construction site.
[0006] Preferably, the steps for obtaining the spatiotemporal structure tensor matrix are as follows: Based on the video stream data covering the dangerous boundary of the construction site, extract the time-continuous frame images, stack them into a three-dimensional pixel array according to pixel coordinates, divide the three-dimensional pixel array into equal lengths along the time axis, locate the cross-sectional image at the specified spatial location and perform pixel interpolation correction to generate a spatiotemporal slice texture map. Based on the spatiotemporal slice texture map, the brightness difference is calculated for each pixel according to the horizontal pixel index, vertical pixel index and time index. The mean and variance of the brightness difference are summarized according to the pixel position and component statistics are generated. The component covariance is calculated based on the component statistics and the gradient covariance matrix is constructed to obtain the gradient covariance matrix. Based on the gradient covariance matrix, eigenvalues and eigenvectors are calculated pixel by pixel and the main direction index is recorded. The local texture direction and intensity are quantized according to the main direction index and mapped to structural components. The structural component matrix is reorganized according to the pixel position while maintaining the consistency of the temporal index to generate a spatiotemporal structural tensor matrix.
[0007] Preferably, the steps for obtaining the trajectory manifold geometric feature vector are as follows: Based on the spatiotemporal structure tensor matrix, the principal eigenvalues and corresponding eigenvectors of each pixel are extracted. The eigenvectors are sorted in descending order of eigenvalues. The correspondence between each eigenvalue in the same spatial position is recorded. The ratio of the largest, second largest, and smallest eigenvalues is structured into a set of directional parameters to generate trajectory manifold geometric eigenvectors.
[0008] Preferably, the step of obtaining the abnormal behavior manifold index is as follows: Based on the geometric feature vector of the trajectory manifold, each element is paired and compared with the standard walking parameter set. The squared difference of each corresponding component is calculated and the root mean square is taken to obtain three types of action intensity representing dynamic differences, which are defined as running action intensity, falling action intensity and loitering action intensity, respectively, to obtain action intensity triplet. The abnormal behavior manifold index is calculated based on the action intensity triplet.
[0009] Preferably, the step of obtaining the temporal pixel grayscale sequence of the ROI is as follows: Based on the abnormal behavior manifold index, a lower limit and an upper limit of the static threshold are set. The interval relationship between the abnormal behavior manifold index and the lower and upper limits of the static threshold is compared pixel by pixel. The pixel coordinates that satisfy the interval relationship are marked and eight-neighbor connectivity is marked. Adjacent marked regions are merged to obtain a set of boundary coordinates, and a static or quasi-static pixel coordinate region is generated. Based on the static or quasi-static pixel coordinate region, the edge pixel position of the dangerous boundary target at the construction site is located in each frame. The edge pixel brightness value of the dangerous boundary target at the construction site is extracted and spliced in the order of time index and pixel coordinate to generate the ROI time domain pixel grayscale sequence.
[0010] Preferably, the step of obtaining the pixel micro-motion fluctuation feature set is as follows: Based on the ROI temporal pixel grayscale sequence, the brightness standard deviation is calculated for the zero-mean sequence using a sliding time window of uniform duration, and the proportion of sign changes to the number of samples is used as the zero-crossing rate. The brightness standard deviation and the zero-crossing rate are aggregated into entries according to pixel coordinates to generate a pixel micro-motion fluctuation feature set.
[0011] Preferably, the step of obtaining the target activity value is as follows: Based on the pixel micro-motion fluctuation feature set, the standard deviation value and zero-crossing rate value are extracted according to the pixel coordinate index, and missing value removal, time synchronization and amplitude standardization are performed to generate a standardized feature combination. The target activity value is calculated based on the standardized feature combination.
[0012] Preferably, the steps for obtaining the intrusion assessment signal in the hazardous area of the construction site are as follows: Based on the target activity value, a non-biological background noise threshold and a biological sign lower limit threshold are set. The target activity value is then judged within a range. If the target activity value is lower than the non-biological background noise threshold, it is determined to be stationary construction material. If it is between the two thresholds or higher than the biological sign lower limit threshold, it is determined to be stationary personnel with vital signs, thus generating an intrusion assessment signal for the dangerous area of the construction site.
[0013] Compared with the prior art, the advantages and positive effects of the present invention are as follows: In this invention, continuous frame images are extracted from video stream data covering the dangerous boundaries of a construction site to construct a three-dimensional pixel array, which is then segmented along the time axis to obtain spatiotemporal slice texture maps. This transforms dynamic temporal motion into static spatial texture features for analysis. By calculating the partial derivatives of spatiotemporal brightness, a gradient covariance matrix is constructed, and a spatiotemporal structure tensor matrix is generated. This effectively quantifies the local texture direction and intensity of pixels in the spatiotemporal domain, avoiding the sensitivity of traditional inter-frame difference methods to changes in illumination. The geometric feature vector of the trajectory manifold is generated using the ratio of eigenvalues, and the Euclidean distance with the standard walking parameter set is calculated. This transforms the abstract motion trajectory into numerical values representing the intensity of running, falling, or wandering actions, achieving precise quantification of the amplitude of abnormal behavior. Numerical labeling involves collecting edge pixel brightness values from selected static or quasi-static areas to generate a temporal pixel grayscale sequence of the ROI. Statistical operations are then performed to calculate the standard deviation representing the brightness dispersion and the zero-crossing rate representing the signal oscillation frequency. This allows the detection of minute pixel fluctuations caused by biological respiration or muscle tremors, generating a pixel micro-motion fluctuation feature set and calculating the target activity value. By comparing this value with non-biological background noise thresholds and biological sign lower limit thresholds, the system can accurately distinguish between stationary construction materials and stationary personnel with vital signs, even when the target lacks displacement characteristics. This improves the robustness and accuracy of intrusion detection in dangerous areas under complex scenarios and eliminates the risk of missed detections due to target stillness. Attached Figure Description
[0014] Figure 1 This is a schematic diagram of the steps of the present invention. Detailed Implementation
[0015] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.
[0016] Please see Figure 1 This invention provides a technical solution: an AI video monitoring method for detecting intrusions into dangerous areas of construction sites, comprising the following steps: Based on the video stream data covering the dangerous boundary of the construction site, continuous frame images are extracted to construct a three-dimensional pixel array. The three-dimensional pixel array is segmented along the time axis to obtain the cross-sectional image at the specified spatial location, and a spatiotemporal slice texture map is generated. The partial derivatives of brightness with respect to the spatial and temporal dimensions are calculated for each pixel in the spatiotemporal slice texture map. The gradient covariance matrix is constructed using the partial derivatives, and a spatiotemporal structure tensor matrix is generated. Extract the eigenvalues and eigenvectors of the spatiotemporal structure tensor matrix, calculate and generate the trajectory manifold geometric eigenvectors, calculate the Euclidean distance between the trajectory manifold geometric eigenvectors and the preset standard walking parameter set, and calculate abnormal large-amplitude movements to generate abnormal behavior manifold indices. Filter the pixel coordinate regions where the abnormal behavior manifold index is in the static or quasi-static range, collect the edge pixel brightness values of dangerous boundary targets in the construction site within the region, generate the ROI time domain pixel grayscale sequence, perform statistical operations on the ROI time domain pixel grayscale sequence, calculate the standard deviation value representing the brightness dispersion and the zero-crossing rate value representing the signal oscillation frequency respectively, and generate the pixel micro-motion fluctuation feature set. Based on the pixel micro-motion fluctuation feature set, the target activity value is calculated and generated. The target activity value is then compared with the preset non-biological background noise threshold and the biological sign lower limit threshold to generate an intrusion assessment signal for the dangerous area of the construction site.
[0017] The steps to obtain the spatiotemporal structure tensor matrix are as follows: Based on the video stream data covering the dangerous boundary of the construction site, extract the time-continuous frame images, stack them into a three-dimensional pixel array according to pixel coordinates, divide the three-dimensional pixel array into equal lengths along the time axis, locate the cross-sectional image at the specified spatial location and perform pixel interpolation correction to generate a spatiotemporal slice texture map. Based on the spatiotemporal slice texture map, the brightness difference is calculated for each pixel according to the horizontal pixel index, vertical pixel index and time index. The mean and variance of the brightness difference are summarized according to the pixel position and component statistics are generated. The component covariance is calculated based on the component statistics and the gradient covariance matrix is constructed to obtain the gradient covariance matrix. Based on the gradient covariance matrix, eigenvalues and eigenvectors are calculated pixel by pixel and the principal direction index is recorded. The local texture direction and intensity are quantized according to the principal direction index and mapped to structural components. The structural component matrix is reorganized according to the pixel position while maintaining the consistency of the temporal index, thus generating a spatiotemporal structural tensor matrix.
[0018] Specifically, based on the video stream data covering the dangerous boundaries of the construction site, the video stream acquisition parameters were set to a resolution of 1920×1080 pixels and a frame rate of 25 frames per second, and the time window length was set. To achieve 100 frames per second, the acquired video stream data is converted to grayscale color space to reduce computational redundancy, and consecutive frames are arranged in chronological order. The frame images are stacked in memory, and the total size is [size missing]. A three-dimensional pixel array, in which Image height, Image width, To define the time dimension as depth, the danger boundary of the construction site is set as a line consisting of... A polyline path composed of coordinate points Along the time axis, for the corresponding path in the three-dimensional pixel array Divide the path along the vertical plane it is located in and traverse the path. For each sampling point on the timeline and for each frame on the time axis, for pixels with non-integer coordinates, a bilinear interpolation algorithm is used to calculate the brightness value of that point. The calculation formula is as follows: ,in The interpolated target pixel brightness. Let x be the left and right x-coordinates adjacent to the target point. The vertical coordinates of the target point are the upper and lower coordinates. The original brightness values of adjacent integer coordinate points. Using the floating-point coordinates of the target point, all extracted pixels are rearranged with the spatial path index as the horizontal axis and the time index as the vertical axis to generate an initial spatiotemporal cross-sectional image. To address the issue of inconsistent target scales due to camera perspective effects, a perspective transformation matrix is used to correct the cross-sectional image, and a baseline scale parameter is set. The scaling factor is set to 1.0. Based on the pre-calibrated camera intrinsic parameter matrix and the distance between calibration points on site, the scaling factor for each row of pixels is calculated. The cross-sectional image is resampled to generate a spatiotemporal slice texture map.
[0019] Based on the spatiotemporal slice texture map, the horizontal direction of the image is defined as the spatial dimension. The vertical direction represents the time dimension. Select The Sobel operator of size is used as the gradient calculation kernel to calculate the partial luminance derivative of each pixel in the spatial and temporal dimensions. The calculation formula is as follows: ,in This is a spatial gradient map. For time gradient plot, For input spatiotemporal slice texture maps, This represents the convolution operation. For horizontal Sobel cores, For the vertical Sobel kernel, the product of gradient components is calculated pixel by pixel. , as well as ,in This represents the spatial gradient value of the corresponding pixel. The local smoothing window size is set to the temporal gradient value of the corresponding pixel. (For example (pixels), within this window, a Gaussian weighted function is introduced for weighted summation. The expression for the Gaussian function is: ,in Coordinates within the window The weight of the position, These are the horizontal and vertical offsets relative to the center of the window. The standard deviation of the Gaussian distribution (e.g., 1.5) is used to control the smoothing degree. Weighted statistics are applied to the gradient products within the window, and the smoothed component statistics are calculated separately. The gradient covariance matrix (also known as the structure tensor) of each pixel is constructed using these three statistical components, and its form is: ,in The variance representing the spatial gradient, The variance representing the time gradient, The gradient covariance matrix is obtained by representing the covariance of the spatial and temporal gradients.
[0020] Based on the gradient covariance matrix, traverse each pixel in the matrix graph and obtain its corresponding... Gradient covariance matrix Solve the characteristic equation of the matrix. To obtain the eigenvalues, the calculation formula is as follows: ,in and These are the larger and smaller eigenvalues, respectively. It is the identity matrix. For each element of the covariance matrix, calculate the corresponding principal eigenvector. This vector indicates the main direction of change (i.e., the direction of motion velocity) in the spatiotemporal texture, and the calculation formula satisfies According to the feature vector The amount Calculate the principal direction index angle ,in and These are the components of the feature vector along the spatial and temporal axes, respectively. The coherence measure of the local texture is calculated using the eigenvalues. ,in As a coherence index, it represents the degree of certainty regarding texture direction. To prevent extremely small constants with a denominator of zero (e.g.) The calculated principal eigenvalues Secondary eigenvalues Main direction angle and coherence indicators As structural components, based on the spatial coordinates of the original pixels and time index The features are recombined to construct a multi-channel feature map and generate a spatiotemporal structure tensor matrix.
[0021] The steps for obtaining the geometric eigenvectors of the trajectory manifold are as follows: Based on the spatiotemporal structure tensor matrix, the principal eigenvalues and corresponding eigenvectors of each pixel are extracted. The eigenvectors are sorted in descending order of eigenvalues. The correspondence between each eigenvalue in the same spatial position is recorded. The proportions of the largest, second largest, and smallest eigenvalues are structured into a set of directional parameters to generate trajectory manifold geometric eigenvectors.
[0022] Specifically, based on the spatiotemporal structure tensor matrix, each pixel node in the matrix is traversed, and a linear algebra library is used to perform eigenvalue decomposition on the tensor data at that location to obtain eigenvalues describing the local spatiotemporal geometry. The obtained eigenvalues are then sorted in descending order of numerical value and marked as the largest eigenvalues. Second largest eigenvalue and minimum eigenvalue The numerical relationship of these three feature values under the same spatial coordinates is recorded to analyze the dynamic texture structure of the current pixel. To eliminate the influence of different lighting intensities and contrasts on the absolute magnitude of the feature values, L2 norm normalization is performed on the feature values, and the normalization factor is calculated. Normalized eigenvalues are obtained. Based on these normalized values, geometric descriptors describing motion patterns are constructed. For example, the differences between feature values are used to quantify the shape of the trajectory, and linearity indices are calculated. To characterize a specific motion along a single direction, the flatness index is calculated. To characterize the extended motion within a specific plane, the dispersion index is calculated. To characterize disordered noise or isotropic motion, the proportional relationships of the largest, second largest, and smallest eigenvalues are structured into a set of directional parameters. At the same time, the eigenvector corresponding to the largest eigenvalue is extracted as the main motion direction vector, generating the trajectory manifold geometric feature vector.
[0023] The steps for obtaining the manifold index of anomalous behavior are as follows: Based on the trajectory manifold geometric feature vector, each element is paired and compared with the standard walking parameter set. The squared difference of each corresponding component is calculated and the root mean square is taken to obtain three types of action intensity representing dynamic differences, which are defined as running action intensity, falling action intensity and loitering action intensity, respectively, to obtain action intensity triplet. The abnormal behavior manifold index is calculated based on the action intensity triplet. The formula is as follows: ; in, For anomalous behavior manifold indices, The intensity of the running motion is calculated from the square root of the difference between the principal component of the time gradient in the trajectory manifold geometric eigenvector and the standard walking parameter set. The intensity of the fall action is calculated from the square root of the difference between the principal component of the vertical gradient in the trajectory manifold geometric eigenvector and the standard walking parameter set. The intensity of the loitering motion is calculated from the square root of the difference between the low-frequency fluctuation principal component in the trajectory manifold geometric eigenvector and the standard walking parameter set. This is a dimensionless relative difference balance factor used to adjust the contribution of the difference term between the intensity of running motion and the intensity of falling motion. It is a minimal positive real number, used to prevent when and The case where the denominator is zero when the fraction approaches zero.
[0024] Specifically, based on the trajectory manifold geometric feature vector, a pre-stored set of standard walking parameters is read. This set of parameters is obtained by statistically averaging the extracted trajectory manifold geometric feature vectors after processing video data collected at the construction site under normal walking conditions through the same feature extraction and normalization steps. For example, the normalized reference value for the corresponding temporal gradient is set to 0.15, the normalized reference value for the corresponding vertical gradient is set to 0.05, and the normalized reference value for the corresponding low-frequency fluctuation is set to 0.02. The trajectory manifold geometric feature vector of the current pixel is paired and compared with the standard walking parameter set element by element, and the numerical difference of each corresponding component in the vector is calculated. To ensure the uniformity of dimensions in subsequent calculations, Divide these differences by a preset limit deviation value (e.g., set the maximum possible deviation to 1.0) and normalize them twice to obtain dimensionless difference coefficients between 0 and 1. Classify and map these difference coefficients according to their physical meaning. Define the normalized difference between the principal component of the temporal gradient and the standard value as the running action intensity, because running behavior will produce drastic gradient changes in the time dimension. Define the normalized difference between the principal component of the vertical gradient and the standard value as the falling action intensity, because falling behavior has significant abrupt changes in the vertical spatial direction. Define the normalized difference between the principal component of the low-frequency fluctuation and the standard value as the loitering action intensity, because loitering behavior is a non-directional continuous low-frequency disturbance. Combine these three dimensionless intensity values in sequence to obtain the action intensity triplet.
[0025] The formula for calculating the abnormal behavior manifold index incorporates a weighted difference term to construct a comprehensive index that integrates "overall movement amplitude" and "specific morphological asymmetry." The term measures the total energy intensity of the aberrant behavior, while The method utilizes a relative difference structure to enhance the distinction between two high-risk behaviors: running and falling. When the intensity difference between the two is large (i.e., the behavioral characteristics clearly tend to be of one type), the index increases significantly, thereby improving the sensitivity to specific dangerous actions. In addition, all parameters are dimensionless coefficients, ensuring the physical consistency of the calculation logic. The running motion intensity is a dimensionless coefficient, obtained by extracting the principal component of the time gradient from the trajectory manifold geometric feature vector (e.g., a normalized measurement value of 0.85), subtracting the corresponding baseline value from the standard walking parameter set (e.g., 0.15), resulting in an initial difference of 0.70. This difference is then divided by a preset normalization scaling factor (set to 1.0) to calculate the dimensionless intensity. ; The intensity of the fall action is a dimensionless coefficient, obtained by extracting the principal component of the vertical gradient from the trajectory manifold geometric feature vector (e.g., a normalized measurement value of 0.30), subtracting the corresponding baseline value from the standard walking parameter set (e.g., 0.05), resulting in an original difference of 0.25. After normalization, the dimensionless intensity is obtained. ; The intensity of the loitering motion is a dimensionless coefficient, obtained by extracting the low-frequency fluctuation principal component from the trajectory manifold geometric feature vector (e.g., a normalized measurement value of 0.12), subtracting the corresponding baseline value from the standard walking parameter set (e.g., 0.02), resulting in the original difference of 0.10. After normalization, the dimensionless intensity is obtained. ; This is a dimensionless relative difference balance factor used to adjust the contribution of the difference term between the intensity of running and falling movements. Its setting is based on balancing the weights between single-movement judgments and mixed-movement noise. If the value is too large, it will be too sensitive to small differences; if it is too small, it will be unable to distinguish complex actions. By performing logistic regression fitting on a simulated dataset, the optimal value was determined to be 1.5. It is a very small positive real number, used to prevent the denominator from being zero, and takes the value of ; Calculations based on parameters: First, calculate the square term of each intensity (dimensionless): ; ; ; Calculate the fundamental strength and (dimensionless energy term): ; Calculate the numerator of the difference term: ; Calculate the denominator for the difference term: ; Calculate the ratio of the difference terms: ; Calculate the weighted difference term (dimensionless adjustment term): ; Calculate the sum inside the square root: ; Calculate the final index: ; This result indicates that the anomalous behavior manifold index of the current pixel region is 1.0546, due to... Normalized, an index greater than 1.0 means that not only is the intensity of the action itself relatively high (…). Contribution), and the action characteristics have a high degree of directional clarity ( A significant contribution indicates a high probability of clear running behavior. This value exceeds the preset static threshold (usually 0.1-0.3) and the general abnormal threshold (0.8), generating an abnormal behavior manifold index.
[0026] The steps for obtaining the temporal pixel grayscale sequence of the ROI are as follows: Based on the abnormal behavior manifold index, a lower limit and an upper limit of the static threshold are set. The interval relationship between the abnormal behavior manifold index and the lower and upper limits of the static threshold is compared pixel by pixel. The pixel coordinates that satisfy the interval relationship are marked and eight-neighbor connectivity is marked. The adjacent marked regions are merged to obtain the boundary coordinate set, and a static or quasi-static pixel coordinate region is generated. Based on the static or quasi-static pixel coordinate regions, the edge pixel positions of the dangerous boundary targets at the construction site are located in each frame. The edge pixel brightness values of the dangerous boundary targets at the construction site are extracted and spliced in order of time index and pixel coordinates to generate the ROI temporal pixel grayscale sequence.
[0027] Specifically, based on the abnormal behavior manifold index, the background manifold index of each pixel in the sample is calculated by analyzing video samples under historical non-intrusion conditions. The distribution of this index is statistically analyzed, and the 99th percentile value is selected as the baseline noise level. For example, if the calculated baseline noise value is 0.12, a lower limit for the stillness threshold is set. The threshold value is 1.2 times the base noise value, i.e., 0.144. Simultaneously, samples containing slight environmental disturbances (such as swaying leaves or changes in light and shadow) are analyzed, and the median manifold index is selected as the upper limit of the stationary threshold. For example, setting it to 0.45 allows for a pixel-by-pixel scanning of the abnormal behavior manifold index map of the current frame, setting the index value for each pixel. With the set threshold range Perform a comparison, if If a pixel is found to be in a quasi-static state, it is marked as a foreground candidate. Eight-neighbor connectivity analysis is performed on all marked foreground candidate pixels, examining the eight neighboring pixels around each candidate. If any already marked connected pixels exist, they are merged into the same region number. This process is recursively executed until all adjacent candidate pixels are classified as independent connected regions. The pixel count area of each connected region is calculated, and a minimum area threshold is set. The pixel size is 50 pixels (determined based on image resolution and minimum imaging size of distant targets), and the area to be removed is smaller than [a certain value]. For isolated noise regions, extract the outer contour coordinates of the remaining effective connected regions, merge the spatial ranges of these regions, and generate static or quasi-static pixel coordinate regions.
[0028] Based on the static or quasi-static pixel coordinate regions, in each frame of the original video stream, the Canny edge detection operator is used to calculate the gradient of the image content within these coordinate regions. A low threshold of 30 and a high threshold of 80 are set to detect locations of abrupt changes in brightness within the regions, thereby locating the contour edges of the target and extracting the set of pixel coordinates at these edge locations. ,in The number of edge points detected in the current frame, for each edge point Read its current frame brightness value To construct a temporal sequence, the centroid of the region or a specific edge point is used as the tracking anchor point, and the sequence is executed in consecutive time frames. The system tracks the corresponding spatial position. If the target is stationary, its coordinates remain unchanged, and the brightness at the corresponding position is read directly. If the target has a slight displacement, the coordinate offset is corrected using optical flow before reading the brightness. The extracted continuous... Frames (e.g.) The edge pixel brightness values are arranged in chronological order to construct a one-dimensional time series vector. The sequences of all edge points in the region are summarized to generate the ROI time domain pixel grayscale sequence.
[0029] The steps for obtaining the pixel micro-motion fluctuation feature set are as follows: Based on the ROI temporal pixel grayscale sequence, the brightness standard deviation is calculated for the zero-mean sequence using a sliding time window of uniform duration, and the proportion of sign changes to the number of samples is used as the zero-crossing rate. The brightness standard deviation and the zero-crossing rate are aggregated into entries according to pixel coordinates to generate a pixel micro-motion fluctuation feature set.
[0030] Specifically, based on the pixel micro-motion fluctuation feature set, each pixel feature record in the data storage area is traversed. First, a data cleaning program is executed to check the integrity of the standard deviation and zero-crossing rate values. Abnormal records with more than 5 consecutive missing frames of data are removed in their entirety. For sporadic missing points, a linear interpolation algorithm is used to fill in the missing data using the average of the effective values before and after the missing data. Then, the feature data of all pixels are aligned according to the unified timecode of the video stream to eliminate minor misalignments caused by processing delays. Next, a critical amplitude standardization process is performed to eliminate differences in physical dimensions. The brightness standard deviation of all pixels within the current monitoring period is calculated to obtain the global maximum value. (Set to 50.0 grayscale units) and minimum value (Set to 0.0), the raw standard deviation for each pixel Calculate the normalized value Similarly, calculate the global maximum value of the zero-crossing rate. (Set to 1.0) and minimum value (Set to 0.0), for the original zero-crossing rate Calculate the normalized value To ensure that all feature values are mapped to a dimensionless closed interval between 0 and 1, the processed normalized standard deviation and normalized zero-crossing rate are repackaged according to pixel coordinates to generate a standardized feature combination.
[0031] The steps to obtain the target activity value are as follows: Based on the pixel micro-motion fluctuation feature set, the standard deviation value and zero-crossing rate value are extracted according to the pixel coordinate index. Missing values are removed, time synchronization and amplitude standardization are performed to generate a standardized feature combination. The target activity value is calculated based on the standardized feature combination, using the following formula: ; in, The target activity value, The standard deviation of brightness is a standardized value, representing the discrete amplitude of pixel grayscale variation. The zero-crossing rate value is a standardized value, representing the oscillation frequency of the pixel grayscale signal. It is the product of the standard deviation and the zero-crossing rate, used to describe the degree of coupling between the two. The linear weighting coefficients for the standard deviation values are determined through sample fitting. The linear weighting coefficients for the zero-crossing rate values are determined through sample fitting. These are the interaction weight coefficients, used to adjust the influence of coupling terms in the model. This is a bias constant used to adjust the baseline level of the calculation results.
[0032] Specifically, in the formula for calculating the target activity value, the difference in physical units of the original data is eliminated through normalization. This not only linearly superimposes the contributions of amplitude and frequency, but also... Interaction terms specifically enhance the weight of "high-frequency and high-amplitude" micro-motion signals, which typically correspond to the breathing or tremors of organisms, thereby effectively distinguishing background noise; The standardized brightness standard deviation represents the discrete amplitude of pixel grayscale variation. It is obtained by reading from the aforementioned standardized feature combination; this value is a dimensionless coefficient ranging from 0 to 1. For example, in a certain actual sampling, the original grayscale standard deviation of a pixel is measured to be 12.5. Normalization is performed using a preset global maximum value of 50.0 to obtain... ; The standardized zero-crossing rate value represents the oscillation frequency of the pixel grayscale signal. It is obtained by reading from the aforementioned standardized feature combination; this value is a dimensionless coefficient ranging from 0 to 1. For example, if the sign change rate of the pixel within a unit time window is detected to be 0.4 (i.e., sign flipping occurs at 40% of the time points), it is obtained after normalization. ; The linear weighting coefficients for the standard deviation are obtained through the following steps: A sample set containing pure background interference (such as swaying leaves) and static human targets is constructed in a laboratory. The sample labels are fitted using a logistic regression algorithm, and the contribution weight of the dispersion features to target classification is quantified and determined through calculation. ; The linear weighting coefficients for the zero-crossing rate are obtained through the following steps: Using the same regression analysis, considering that weak biological signs (such as chest rise and fall) have specific frequency characteristics, which are significant in distinguishing low-frequency background noise, the coefficients are determined through fitting. ; The interaction weighting coefficients are obtained as follows: To capture the nonlinear enhancement effect (a typical manifestation of biological signs) when both amplitude and frequency increase significantly at the same time, a cross-product term is introduced into the regression model and its coefficients are solved. ; The bias constant is obtained by: using it as the intercept term of the model to calibrate the baseline output in a pure background environment and suppress the accumulation of random noise when there is no target; and setting it by statistically analyzing and inverting the mean of the model output in a large number of pure background scenes. ; Calculations based on parameters: Calculate the linear contribution (dimensionless) of the standard deviation component: ; Calculate the linear contribution (dimensionless) of the zero-crossing rate component: ; Calculate the feature interaction term (dimensionless): ; Calculate the weighted contribution (dimensionless) of the interaction components: ; Sum of all contribution values: ; Reference calibration is performed by superimposing bias constants: ; The result indicates that the target activity value of this pixel is 0.26, which is a dimensionless score that combines temporal fluctuation amplitude and frequency characteristics. The higher the value, the greater the possibility of microscopic life activity at this location. This result will be directly used for subsequent intrusion determination.
[0033] The steps for obtaining intrusion assessment signals in hazardous areas of a construction site are as follows: Based on the target activity value, a non-biological background noise threshold and a biological sign lower limit threshold are set. The target activity value is then judged within a range. If the target activity value is lower than the non-biological background noise threshold, it is determined to be stationary construction material. If it is between the two thresholds or higher than the biological sign lower limit threshold, it is determined to be stationary personnel with vital signs, thus generating an intrusion assessment signal for the dangerous area of the construction site.
[0034] Specifically, based on the target activity value, a judgment criterion based on statistical distribution is constructed. A one-hour pure static background video stream is collected during non-working hours at the construction site. The activity values of all pixels in the field are calculated and a histogram is generated. The 99th percentile of the probability density function is selected as the non-biological background noise threshold. For example, statistics show that the activity values of most environmental noises (such as slight changes in light) are below 0.08, therefore a setting was made. Simultaneously, personnel were organized to simulate static squatting and standing postures on-site, collecting activity values of corresponding pixels, and selecting the 5% lower quantile of its distribution as the lower limit threshold for biological signs. For example, statistics show that even faint breathing can produce an activity value higher than 0.25, therefore a setting was made. The activity value is calculated in real time by scanning pixel by pixel. (As calculated above, the result is 0.26), Compare with the threshold range, if If it is considered a static building material or background, it is ignored. If it is marked as a suspected interference area, then... (In this example, 0.26 meets the condition), then the pixel is determined to have vital signs and is identified as a stationary person. The coordinates of all pixels identified as persons are mapped back to the site plan, generating an intrusion assessment signal for the dangerous area of the construction site.
[0035] The above are merely preferred embodiments of the present invention and are not intended to limit the present invention in any other way. Any person skilled in the art may make changes or modifications to the above-disclosed technical content to create equivalent embodiments that can be applied to other fields. However, any simple modifications, equivalent changes, and modifications made to the above embodiments based on the technical essence of the present invention without departing from the scope of the present invention shall still fall within the protection scope of the present invention.
Claims
1. An AI video monitoring method for detecting intrusion into hazardous areas of construction sites, characterized in that, Includes the following steps: Based on the video stream data covering the dangerous boundary of the construction site, continuous frame images are extracted to construct a three-dimensional pixel array. The three-dimensional pixel array is segmented along the time axis to obtain the cross-sectional image at the specified spatial location, and a spatiotemporal slice texture map is generated. The brightness partial derivatives with respect to the spatial and temporal dimensions are calculated for each pixel in the spatiotemporal slice texture map. The gradient covariance matrix is constructed using the partial derivatives to generate the spatiotemporal structure tensor matrix. Extract the eigenvalues and eigenvectors of the spatiotemporal structure tensor matrix, calculate and generate the trajectory manifold geometric eigenvectors, calculate the Euclidean distance between the trajectory manifold geometric eigenvectors and the preset standard walking parameter set, and at the same time calculate abnormal large-amplitude movements to generate abnormal behavior manifold indices. Filter the pixel coordinate regions where the abnormal behavior manifold index is in the static or quasi-static range, collect the edge pixel brightness values of the dangerous boundary targets in the construction site within the region, generate the ROI time domain pixel grayscale sequence, perform statistical operations on the ROI time domain pixel grayscale sequence, calculate the standard deviation value representing the brightness dispersion and the zero-crossing rate value representing the signal oscillation frequency respectively, and generate the pixel micro-motion fluctuation feature set. Based on the pixel micro-motion fluctuation feature set, a target activity value is calculated and generated. The target activity value is then compared with a preset non-biological background noise threshold and a biological characteristic lower limit threshold to generate an intrusion assessment signal for the dangerous area of the construction site.
2. The AI video monitoring method for intrusion into hazardous areas of construction sites according to claim 1, characterized in that, The steps for obtaining the spatiotemporal structure tensor matrix are as follows: Based on the video stream data covering the dangerous boundary of the construction site, extract the time-continuous frame images, stack them into a three-dimensional pixel array according to pixel coordinates, divide the three-dimensional pixel array into equal lengths along the time axis, locate the cross-sectional image at the specified spatial location and perform pixel interpolation correction to generate a spatiotemporal slice texture map. Based on the spatiotemporal slice texture map, the brightness difference is calculated for each pixel according to the horizontal pixel index, vertical pixel index and time index. The mean and variance of the brightness difference are summarized according to the pixel position and component statistics are generated. The component covariance is calculated based on the component statistics and the gradient covariance matrix is constructed to obtain the gradient covariance matrix. Based on the gradient covariance matrix, eigenvalues and eigenvectors are calculated pixel by pixel and the main direction index is recorded. The local texture direction and intensity are quantized according to the main direction index and mapped to structural components. The structural component matrix is reorganized according to the pixel position while maintaining the consistency of the temporal index to generate a spatiotemporal structural tensor matrix.
3. The AI video monitoring method for intrusion into hazardous areas of construction sites according to claim 1, characterized in that, The steps for obtaining the geometric feature vector of the trajectory manifold are as follows: Based on the spatiotemporal structure tensor matrix, the principal eigenvalues and corresponding eigenvectors of each pixel are extracted. The eigenvectors are sorted in descending order of eigenvalues. The correspondence between each eigenvalue in the same spatial position is recorded. The ratio of the largest, second largest, and smallest eigenvalues is structured into a set of directional parameters to generate trajectory manifold geometric eigenvectors.
4. The AI video monitoring method for intrusion into hazardous areas of construction sites according to claim 1, characterized in that, The steps for obtaining the abnormal behavior manifold index are as follows: Based on the geometric feature vector of the trajectory manifold, each element is paired and compared with the standard walking parameter set. The squared difference of each corresponding component is calculated and the root mean square is taken to obtain three types of action intensity representing dynamic differences, which are defined as running action intensity, falling action intensity and loitering action intensity, respectively, to obtain action intensity triplet. The abnormal behavior manifold index is calculated based on the action intensity triplet.
5. The AI video monitoring method for intrusion into hazardous areas of construction sites according to claim 1, characterized in that, The steps for obtaining the temporal pixel grayscale sequence of the ROI are as follows: Based on the abnormal behavior manifold index, a lower limit and an upper limit of the static threshold are set. The interval relationship between the abnormal behavior manifold index and the lower and upper limits of the static threshold is compared pixel by pixel. The pixel coordinates that satisfy the interval relationship are marked and eight-neighbor connectivity is marked. Adjacent marked regions are merged to obtain a set of boundary coordinates, and a static or quasi-static pixel coordinate region is generated. Based on the static or quasi-static pixel coordinate region, the edge pixel position of the dangerous boundary target at the construction site is located in each frame. The edge pixel brightness value of the dangerous boundary target at the construction site is extracted and spliced in the order of time index and pixel coordinate to generate the ROI time domain pixel grayscale sequence.
6. The AI video monitoring method for intrusion into hazardous areas of construction sites according to claim 1, characterized in that, The steps for obtaining the pixel micro-motion fluctuation feature set are as follows: Based on the ROI temporal pixel grayscale sequence, the brightness standard deviation is calculated for the zero-mean sequence using a sliding time window of uniform duration, and the proportion of sign changes to the number of samples is used as the zero-crossing rate. The brightness standard deviation and the zero-crossing rate are aggregated into entries according to pixel coordinates to generate a pixel micro-motion fluctuation feature set.
7. The AI video monitoring method for intrusion into hazardous areas of construction sites according to claim 1, characterized in that, The steps for obtaining the target activity value are as follows: Based on the pixel micro-motion fluctuation feature set, the standard deviation value and zero-crossing rate value are extracted according to the pixel coordinate index, and missing value removal, time synchronization and amplitude standardization are performed to generate a standardized feature combination. The target activity value is calculated based on the standardized feature combination.
8. The AI video monitoring method for intrusion into hazardous areas of construction sites according to claim 1, characterized in that, The steps for obtaining the intrusion assessment signal in the hazardous area of the construction site are as follows: Based on the target activity value, a non-biological background noise threshold and a biological sign lower limit threshold are set. The target activity value is then judged within a range. If the target activity value is lower than the non-biological background noise threshold, it is determined to be stationary construction material. If it is between the two thresholds or higher than the biological sign lower limit threshold, it is determined to be stationary personnel with vital signs, thus generating an intrusion assessment signal for the dangerous area of the construction site.