Deep learning-based real-time video abnormal behavior detection method and system
By employing a lightweight spatiotemporal joint feature extraction and dynamic threshold update mechanism, the latency and adaptability issues of video abnormal behavior detection on edge devices are resolved, achieving low-latency and reliable video abnormal behavior detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- DOMAIN INFORMATION TECHNOLOGY (XIAN) INFORMATION TECHNOLOGY CO LTD
- Filing Date
- 2026-05-15
- Publication Date
- 2026-06-19
AI Technical Summary
Existing deep learning-based video anomaly detection methods struggle to achieve low-latency processing on edge devices with limited computing resources, and they are poorly adaptable to changes in lighting, weather, and background dynamics, making it difficult to balance false positive and false negative rates and limiting their generalization capabilities.
Employing a lightweight spatiotemporal joint feature extraction, real-time motion activity assessment, and dynamic threshold update mechanism, this method generates compact feature maps by calculating full-frame difference statistics for video frames, performs reconstruction error analysis and temporal consistency verification, and generates a structured detection report.
Real-time detection with millisecond-level latency is achieved on edge computing devices, adapting to dynamic changes in scenarios, reducing computational redundancy, improving the reliability and robustness of detection results, and reducing the number of false alarms.
Smart Images

Figure CN122244956A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of video recognition technology, and in particular to a real-time detection method and system for abnormal video behavior based on deep learning. Background Technology
[0002] Deep learning-based video anomaly detection is a core technology in the field of intelligent video surveillance, widely used in scenarios such as public safety, traffic management, and industrial inspection. Video surveillance systems acquire raw video streams through cameras deployed in various locations. Backend detection algorithms analyze the behavioral patterns of moving targets such as pedestrians and vehicles in the video stream, identifying behavioral events that deviate from normal patterns. Current technologies lack mechanisms for utilizing redundant information between video frames, making it difficult to achieve low-latency processing on edge devices with limited computing resources, resulting in significant detection latency and high equipment deployment costs.
[0003] Existing deep learning-based video anomaly detection methods suffer from significant limitations in adaptability to dynamic scenes. Most methods employ static thresholds for binary judgment of anomaly scores, pre-setting a fixed threshold that is used across all frames to distinguish between normal and abnormal behavior. This approach struggles to handle time-varying factors in monitored scenes, such as changes in lighting, weather, and background dynamics, leading to drastic fluctuations in detection accuracy and an imbalance between false positives and false negatives. Furthermore, existing methods lack sufficient spatiotemporal consistency modeling for anomaly behavior, often relying solely on single-frame scores for judgment. They lack a mechanism to verify the trajectory continuity of candidate anomaly regions across multiple frames, making it difficult to distinguish between genuine anomaly behavior and transient noise interference, resulting in numerous isolated false alarms in the detection results. Additionally, most methods use closed-set anomaly definitions, which are ill-suited to the dynamic changes in anomaly categories across different scenarios, limiting their generalization capabilities. Therefore, there is an urgent need to develop a real-time video anomaly detection method that can adapt to dynamic scene changes, reduce computational redundancy, and improve judgment reliability. This would address the problems of high computational cost, poor threshold adaptability, and numerous false alarms in existing technologies, thereby improving the real-time performance, robustness, and scene generalization capabilities of the detection system. Summary of the Invention
[0004] This invention provides a real-time video abnormal behavior detection method and system based on deep learning to solve the problems mentioned in the background art.
[0005] To achieve the above objectives, the present invention provides a real-time video abnormal behavior detection method based on deep learning, comprising: I1: Perform lightweight spatiotemporal joint feature extraction on the sequence of video frames to be processed from the original video stream in the video to obtain a compact feature map of the video; I2: Perform reconstruction error analysis on the compact feature map to obtain a preliminary anomaly scoring map; I3: Update the dynamic judgment threshold of the scene in the current video in real time, determine the area in the preliminary anomaly scoring map that exceeds the dynamic judgment threshold as a candidate anomaly area, and mark the candidate anomaly area that passes the temporal consistency verification as the final anomaly behavior marking map, and generate a structured detection report.
[0006] In a preferred embodiment, the video frame sequence to be processed includes: Real-time motion activity assessment is performed on consecutive raw frames after decoding the original video stream, and processing decision markers and local processing region coordinates are generated for each frame based on the assessment results. When the processing decision flag indicates to skip the current frame, the current frame is excluded, and the detection result of the previous frame is directly reused; When the processing decision flag indicates local processing, the corresponding local image block is cropped from the current frame according to the coordinates of the local processing area, and the local image block is used as the content to be processed in the current frame; When the processing decision flag indicates full-frame processing, the full-frame image of the current frame is used as the content to be processed in the current frame; The content to be processed in the same spatial location is arranged in chronological order and combined into a sequence of video frames to be processed.
[0007] In a preferred embodiment, the real-time motion activity assessment of consecutive raw frames after decoding the original video stream includes: Divide the current frame and the previous frame into multiple non-overlapping pixel blocks, calculate the sum of the luminance components of all pixels in the pixel block, and calculate the full frame difference statistics of the current frame. The formula for calculating the full-frame difference statistics is as follows: in, For the first Full-frame difference statistics of frames. The row number of the pixel block. The number of columns in the pixel block. For the first The first frame Line number The sum of the luminance components of the column pixel block. For the first The sum of the luminance components of the corresponding pixel block in the frame; When the full-frame difference statistics are within a preset activity range, the difference contribution of the pixel blocks in the current frame is evaluated. When the pixel block in the evaluation result exceeds the block-level threshold, the coordinates of the smallest bounding rectangle in the pixel block are used as the coordinates of the local processing area, and a processing decision label indicating local processing is generated. When the full-frame difference statistics are less than the low-activity threshold in the activity range, a processing decision flag indicating skipping the current frame is generated, and the coordinates of the local processing area are cleared. When the full-frame difference statistics are greater than or equal to the high activity threshold in the activity range, a processing decision flag indicating full-frame processing is generated, and the coordinates of the local processing area are set to a coordinate range covering the entire frame.
[0008] In a preferred embodiment, the step of performing lightweight spatiotemporal joint feature extraction on the sequence of video frames to be processed from the original video stream in the video to obtain a compact feature map of the video includes: Separable convolution processing is performed on each frame in the video frame sequence to be processed, resulting in multiple sets of spatial feature maps corresponding to each frame. The spatial feature maps corresponding to two adjacent frames in the video frame sequence to be processed are differentially calculated according to pixel positions to obtain the inter-frame difference excitation map. The spatial feature map of the current frame and the inter-frame difference excitation map are summed pixel by pixel to obtain the fused feature map of the current frame. The fused feature map is subjected to channel compression transformation to generate an intermediate feature map of the video; The intermediate feature map is spatially downsampled to generate a compact feature map of the video.
[0009] In a preferred embodiment, the step of performing reconstruction error analysis on the compact feature map to obtain a preliminary anomaly scoring map includes: The similarity value of the pixel position is obtained by performing distance metric analysis between the feature vector corresponding to the pixel position in the compact feature map and the prototype feature vectors in the pre-built normal behavior feature memory library. The prototype feature vector with the highest similarity value is selected as the best matching prototype for the pixel position; Confirm the reconstruction error between the feature vector at each pixel location and the corresponding best-matching prototype; The reconstruction errors are arranged according to the original spatial coordinates of each pixel in the compact feature map to obtain an initial scoring map; The initial scoring map is filtered by local region maximum values, and the maximum error value in each local region is used as the representative score of the local region to obtain a preliminary anomaly scoring map.
[0010] In a preferred embodiment, the pre-built normal behavioral characteristic memory bank includes: Obtain a training video sample set containing only normal behavior, and perform lightweight spatiotemporal joint feature extraction frame by frame for each training sample in the training video sample set to obtain a training compact feature map corresponding to each training sample. The feature vectors at each pixel location in all the trained compact feature maps are aggregated into an initial feature vector set; The feature vectors are divided into multiple clusters according to their similarity, and the center vector of all feature vectors in each cluster is taken as the corresponding prototype feature vector. The prototype feature vector is associated with and stored with the distribution radii of all feature vectors in the cluster, forming key-value pairs; All key-value pairs are integrated and stored in a vector database to generate a normal behavioral feature memory.
[0011] In a preferred embodiment, the real-time updating of the dynamic determination threshold for the scene in the current video includes: When a new preliminary anomaly rating map is generated, the mean and standard deviation of the ratings of all pixels in the new preliminary anomaly rating map are calculated and stored as a set of statistics in the sliding time window. The average of the mean scores and the average of the standard deviations of the scores within the sliding time window are combined to form the scene stability baseline value; The dynamic judgment threshold for the current frame is determined based on the scene stability baseline value and the preset sensitivity adjustment coefficient.
[0012] In a preferred embodiment, determining regions in the preliminary anomaly scoring map that exceed the dynamic determination threshold as candidate anomaly regions includes: The pixel score in the preliminary anomaly score map is compared with the dynamic judgment threshold. When the pixel score is greater than the dynamic judgment threshold, the pixel is marked as an abnormal pixel; otherwise, it is marked as a normal pixel, and a binarized abnormal pixel mask map is generated. Connectivity analysis is performed on the abnormal pixel mask image, and isolated noise points are removed from the connected regions to obtain candidate abnormal regions.
[0013] In a preferred embodiment, marking the candidate abnormal regions that have passed the time consistency verification as the final abnormal behavior labeling map and generating a structured detection report includes: The candidate anomaly region in the current frame is associated with the candidate anomaly region at the corresponding spatial location in the previous consecutive frames to establish a temporal trajectory linked list. The continuity of the time-series trajectory linked list is checked to obtain the positional overlap of candidate abnormal regions between adjacent frames, and the candidate abnormal regions are determined to pass the time consistency verification based on the magnitude of the positional overlap. The spatial masks in the verified candidate anomaly regions are merged according to pixel position to obtain the final anomaly behavior label map; Based on the spatiotemporal trajectory of the abnormal region in the final abnormal behavior marking map, the corresponding abnormal behavior category name is matched to obtain a structured text description; The final abnormal behavior marker map and the structured text description are encapsulated into a structured detection report.
[0014] To address the aforementioned problems, this invention also provides a real-time video abnormal behavior detection system based on deep learning, the system comprising: The joint feature extraction module is used to perform lightweight spatiotemporal joint feature extraction on the sequence of video frames to be processed from the original video stream in the video, so as to obtain a compact feature map of the video. The reconstruction error module is used to perform reconstruction error analysis on the compact feature map to obtain a preliminary anomaly score map. The threshold update and temporal verification module is used to update the dynamic judgment threshold of the scene in the current video in real time, determine the region in the preliminary anomaly scoring map that exceeds the dynamic judgment threshold as a candidate anomaly region, and mark the candidate anomaly region that passes the temporal consistency verification as the final anomaly behavior marking map, and generate a structured detection report.
[0015] Compared with the prior art, the present invention has the following beneficial effects: 1. This invention introduces a real-time motion activity assessment mechanism to calculate full-frame difference statistics for consecutive decoded original frames, and adaptively generates decision tags for skipping, local processing, or full-frame processing based on motion activity. When the scene is static or has weak motion, the current frame is skipped and the detection result of the previous frame is reused, avoiding full-frame computational redundancy. When motion is concentrated in a local area, only that local image block is cropped as the content to be processed, significantly reducing the computational scale of feature extraction. Simultaneously, a lightweight spatiotemporal joint feature extraction architecture is adopted. Through a series of operations such as separable convolution processing, inter-frame difference excitation map generation, pixel-wise weighted fusion, and channel compression transformation, a compact feature map with a dimension far lower than the original frame stacking is generated without relying on optical flow calculation and background modeling. The synergistic effect of these mechanisms enables this invention to complete end-to-end detection on edge computing devices with millisecond-level latency, effectively solving the problems of high computational overhead and poor real-time performance caused by centralized full-frame processing in existing technologies.
[0016] 2. This invention constructs a normal behavior feature memory library, performs pixel-by-pixel or region-by-region reconstruction error analysis on compact feature maps, and calculates the deviation degree using pre-stored normal behavior prototype feature vectors, achieving unsupervised anomaly localization without relying on scarce anomaly samples. Building upon this, a dynamic threshold update mechanism based on a sliding time window is introduced. The mean and standard deviation of historical anomaly scoring maps are statistically analyzed in real time, adaptively calculating the judgment threshold for the current frame. This allows the threshold to automatically adjust with scene factors such as changes in lighting, weather, and dynamic background fluctuations, overcoming the shortcomings of existing technologies where fixed thresholds have poor adaptability, leading to an imbalance between false alarm and false negative rates. Furthermore, by performing temporal consistency verification on candidate anomaly regions, a temporal trajectory linked list is used to record the positional changes of candidate regions in consecutive frames. The overlap between adjacent frames is calculated, and isolated noise points are eliminated, effectively distinguishing between real anomaly behavior and transient interference, significantly reducing the number of false alarms. These mechanisms collectively improve the reliability of the detection results, enabling this invention to adapt to dynamic changes in different monitoring scenarios and possessing strong generalization ability and robustness. Attached Figure Description
[0017] Figure 1 This is a flowchart illustrating a real-time video abnormal behavior detection method based on deep learning, provided in an embodiment of the present invention. Figure 2 This is a functional block diagram of a deep learning-based real-time video abnormal behavior detection system provided in an embodiment of the present invention. The realization of the objective, functional features and advantages of the present invention will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation
[0018] It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
[0019] This application provides a real-time video abnormal behavior detection method based on deep learning. The execution entity of the deep learning-based real-time video abnormal behavior detection method includes, but is not limited to, at least one of the following electronic devices that can be configured to execute the method provided in this application: a server, a terminal, etc. In other words, the deep learning-based real-time video abnormal behavior detection method can be executed by software or hardware installed on a terminal device or a server device. The server includes, but is not limited to, a single server, a server cluster, a cloud server, or a cloud server cluster. The server can be an independent server or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDNs), and big data and artificial intelligence platforms.
[0020] Reference Figure 1 The diagram shown is a flowchart illustrating a real-time video anomaly detection method based on deep learning according to an embodiment of the present invention. In this embodiment, the real-time video anomaly detection method based on deep learning includes: I1: Perform lightweight spatiotemporal joint feature extraction on the sequence of video frames to be processed from the original video stream in the video to obtain a compact feature map of the video; In this embodiment of the invention, the video frame sequence to be processed includes: Real-time motion activity assessment is performed on consecutive raw frames after decoding the original video stream, and processing decision markers and local processing region coordinates are generated for each frame based on the assessment results. When the processing decision flag indicates to skip the current frame, the current frame is excluded, and the detection result of the previous frame is directly reused; When the processing decision flag indicates local processing, the corresponding local image block is cropped from the current frame according to the coordinates of the local processing area, and the local image block is used as the content to be processed in the current frame; When the processing decision flag indicates full-frame processing, the full-frame image of the current frame is used as the content to be processed in the current frame; The content to be processed in the same spatial location is arranged in chronological order and combined into a sequence of video frames to be processed.
[0021] The real-time motion activity assessment of consecutive raw frames after decoding the original video stream includes: Divide the current frame and the previous frame into multiple non-overlapping pixel blocks, calculate the sum of the luminance components of all pixels in the pixel block, and calculate the full frame difference statistics of the current frame. The formula for calculating the full-frame difference statistics is as follows: in, For the first Full-frame difference statistics of frames. The row number of the pixel block. The number of columns in the pixel block. For the first The first frame Line number The sum of the luminance components of the column pixel block. For the first The sum of the luminance components of the corresponding pixel block in the frame; When the full-frame difference statistics are within a preset activity range, the difference contribution of the pixel blocks in the current frame is evaluated. When the pixel block in the evaluation result exceeds the block-level threshold, the coordinates of the smallest bounding rectangle in the pixel block are used as the coordinates of the local processing area, and a processing decision label indicating local processing is generated. When the full-frame difference statistics are less than the low-activity threshold in the activity range, a processing decision flag indicating skipping the current frame is generated, and the coordinates of the local processing area are cleared. When the full-frame difference statistics are greater than or equal to the high activity threshold in the activity range, a processing decision flag indicating full-frame processing is generated, and the coordinates of the local processing area are set to a coordinate range covering the entire frame.
[0022] The step of performing lightweight spatiotemporal joint feature extraction on the sequence of video frames to be processed from the original video stream in the video to obtain a compact feature map of the video includes: Separable convolution processing is performed on each frame in the video frame sequence to be processed, resulting in multiple sets of spatial feature maps corresponding to each frame. The spatial feature maps corresponding to two adjacent frames in the video frame sequence to be processed are differentially calculated according to pixel positions to obtain the inter-frame difference excitation map. The spatial feature map of the current frame and the inter-frame difference excitation map are summed pixel by pixel to obtain the fused feature map of the current frame. The fused feature map is subjected to channel compression transformation to generate an intermediate feature map of the video; The intermediate feature map is spatially downsampled to generate a compact feature map of the video.
[0023] The raw video stream is decoded and output as consecutive raw frames. The current frame and the previous frame are each divided into multiple non-overlapping pixel blocks of a fixed size, each block containing multiple rows and columns of pixels. The luminance component values of all pixels within each pixel block are counted and summed to obtain the sum of the luminance components of each pixel block in the current frame, and the sum of the luminance components of the corresponding pixel block in the previous frame. The sum of the luminance components of each pixel block in the current frame is subtracted from the sum of the luminance components of the corresponding pixel block in the previous frame, and the absolute value of the difference is taken. The absolute values of all pixel blocks are then summed to obtain the overall frame difference statistic for the current frame.
[0024] The parameters in the formula for calculating the full-frame difference statistics are defined as follows: Indicates the sequence number of the current frame. This indicates the sequence number of the frame preceding the current frame. Indicates the first The frame-wide difference statistic is a numerical value used to measure the degree of motion change of the current frame relative to the previous frame. This indicates the number of rows of pixel blocks obtained when dividing an image frame into pixel blocks along the vertical direction. This indicates the number of columns of pixel blocks obtained when dividing an image frame into pixel blocks along the horizontal direction. It is a row index variable, with values ranging from the first row to the second row. The row is used to iterate through each row of pixels. It is a column index variable, with values ranging from the first column to the second column. Columns are used to iterate through each column of pixels. Indicates the first The frame located at the first Line number The sum of the luminance components of a pixel block is the sum of the luminance component values of all pixels within that pixel block. Indicates the first Frames located at the same row index and indexes of the same column The sum of the luminance components of the pixel block at that location.
[0025] First, divide the current frame and the previous frame into two groups according to the same rules. Multiply List non-overlapping pixel blocks. For each pixel block, the system calculates the luminance components of all pixels within that block, and sums these luminance component values one by one to obtain the sum of the luminance components of that pixel block. For the first... Line number For each pixel block in a column, the system extracts the sum of the luminance components of that block in the current frame and the sum of the luminance components of the corresponding block in the previous frame. The system calculates the absolute difference between these two values by subtracting the smaller value from the larger value, ensuring the difference is non-negative. This absolute difference is then accumulated into a summation variable. The system iterates through all row indices. From the first row to the... Rows, and all column indexes From the first column to the... The process of taking the difference and accumulating the values is repeated for each pixel block. Once the absolute differences of all pixel blocks have been accumulated, the final sum is the full-frame difference statistic. A larger full-frame difference statistic indicates a more drastic change in the overall brightness distribution between the current and previous frames, meaning a larger range or greater amplitude of moving objects in the scene. Conversely, a smaller full-frame difference statistic indicates almost no change between the two frames, with the scene nearly static. This statistic provides a quantitative basis for subsequently determining whether to skip the current frame or only process local moving areas.
[0026] The system pre-determines low-activity and high-activity thresholds through a period of scene learning. During initial deployment, multiple consecutive frames of the monitored scene are collected when no abnormal events occur. The full-frame difference statistics for each frame are calculated, and the distribution range of these statistics is analyzed. The lower percentile value in the distribution is used as the low-activity threshold, and the higher percentile value as the high-activity threshold. This ensures that the statistics for normal static scenes are below the low-activity threshold, the statistics for normal active scenes are between the two thresholds, and the statistics for violently moving scenes are above the high-activity threshold. When the full-frame difference statistics are less than the low-activity threshold, a processing decision flag is generated to skip the current frame, and the coordinates of the local processing area are cleared. When the full-frame difference statistics are greater than or equal to the low-activity threshold but less than the high-activity threshold, the local processing decision process begins. When the full-frame difference statistics are greater than or equal to the high-activity threshold, a processing decision flag for full-frame processing is generated, and the coordinates of the local processing area are set to cover the entire frame's coordinate range.
[0027] For cases within the local processing decision process, the system calculates the difference contribution value for each pixel block in the current frame, which is the absolute value of the difference between the sum of the luminance components of that pixel block and the sum of the luminance components of the corresponding pixel block in the previous frame. This difference contribution value is compared with a preset block-level threshold, which is determined based on the statistical fluctuation range of pixel block differences in normal scenes, typically taken as several times the average value of pixel block differences in normal scenes. Pixel blocks with difference contribution values exceeding the block-level threshold are marked as difference contribution blocks. All difference contribution blocks are collected, and the smallest bounding rectangle that can enclose these blocks is calculated. The four boundary coordinates of this rectangle are used as the coordinates of the local processing region. Simultaneously, a processing decision label indicating local processing is generated.
[0028] Frame content filtering is performed based on processing decision markers. If marked as "skip the current frame," the current frame is discarded without any subsequent feature extraction operations, and the compact feature map, preliminary anomaly score map, and final anomaly behavior marker map generated in the previous frame are output as the detection results for the current frame. If marked as "local processing," an image patch at the corresponding location is cropped from the current frame based on the coordinates of the local processing region. This image patch only contains areas of significant motion. If marked as "full-frame processing," the entire image of the current frame is retained. The filtered valid content of the current frame is then combined with image patches at the same spatial location from several temporally adjacent previous frames in chronological order from earliest to latest to form a sequence of video frames to be processed.
[0029] Separable convolution is performed on each frame of the video frame sequence to be processed. Separable convolution consists of two stages: depthwise convolution and pointwise convolution. In the depthwise convolution stage, for each channel of the input image, a separate two-dimensional convolution kernel is used to perform spatial convolution on that channel. The size of the convolution kernel is fixed, and each convolution kernel is responsible for the output of only one channel, producing the same number of depth feature maps as the number of input channels.
[0030] In the pointwise convolution stage, a 1x1 convolution kernel is used to linearly combine all depth feature maps along the channel direction. That is, each 1x1 convolution kernel weighted sums the values of each depth feature map at the same pixel location to generate an output channel. By setting multiple 1x1 convolution kernels, multiple sets of output channels can be obtained. Finally, after each frame undergoes separable convolution processing, multiple sets of spatial feature maps are obtained, each corresponding to a spatial pattern, preserving the shape, edge, and texture information within the frame.
[0031] Extract the spatial feature maps of the next frame and the previous frame. For each identical spatial location, subtract the feature value of the same location in the previous frame from the feature value in the next frame to obtain the difference. The differences at all locations constitute an inter-frame difference excitation map of the same size as the spatial feature map. Locations with larger values in this excitation map indicate that the region has undergone significant motion changes between the two frames.
[0032] The spatial feature map of the current frame and the corresponding inter-frame difference excitation map are summed pixel-by-pixel with weights. During system initialization, a video segment containing normal motion is selected, and subsequent anomaly detection is performed separately using the spatial feature map and the inter-frame difference excitation map, comparing the contributions of the two features to the detection accuracy. Based on experimental statistics, features with greater contributions are assigned higher weights, and features with smaller contributions are assigned lower weights. Typically, the sum of the spatial weight and the motion weight is one, with the spatial weight set between 0.5 and 0.7, and the motion weight set between 0.3 and 0.5. In practice, for each pixel location, the value of the spatial feature map is multiplied by the spatial weight, and the value of the inter-frame difference excitation map is multiplied by the motion weight. The two products are then added together to obtain the fused value for that pixel location. The set of fused values for all pixel locations constitutes the fused feature map of the current frame.
[0033] Channel compression transformation is performed on the fused feature map. Channel compression refers to reducing the values of multiple channels in the fused feature map to a smaller number of channels through linear combination. Specifically, for each output channel, the system presets a set of coefficients with the same number of input channels. The values of all input channels at the same pixel position are multiplied by their corresponding coefficients and then summed to obtain the value of the output channel at that pixel position. In this way, the fused feature map, which originally had dozens of channels, is compressed into a few channels to form an intermediate feature map.
[0034] Spatial downsampling is performed on the intermediate feature map. The system slides across the intermediate feature map with a fixed window size, and the window moves in steps equal to the window size. For each region covered by the window, the average value of all pixels within that region is calculated, and this average value is used as the value at the corresponding position in the output feature map. After sliding the window through the entire intermediate feature map, a compact feature map is obtained, with both its height and width reduced to a fraction of the original. The data volume of this compact feature map is much smaller than the data volume of the original stacked frames, facilitating subsequent fast processing.
[0035] The beneficial effects are as follows: This invention uses a real-time motion activity assessment mechanism to calculate the full-frame difference statistics based on pixel block brightness differences, and adaptively decides to skip motionless frames or only process local motion regions based on dynamic thresholds learned from scene learning. This significantly reduces unnecessary computation. Simultaneously, it directly reuses the detection results of the previous frame in low-motion scenes, significantly reducing the computational load and detection latency of edge devices. Lightweight spatiotemporal joint feature extraction uses separable convolution to decompose standard convolution into two stages: depthwise convolution and pointwise convolution, significantly reducing the number of multiplication operations. It also directly captures motion changes through inter-frame difference activation maps, eliminating the need for optical flow calculations. Channel compression and spatial downsampling generate low-dimensional, compact feature maps, further improving the speed of feature extraction. These mechanisms work synergistically, enabling the entire detection process to run in real-time on resource-constrained security cameras, meeting the low-latency response requirements of 24 / 7 video surveillance.
[0036] I2: Perform reconstruction error analysis on the compact feature map to obtain a preliminary anomaly scoring map; In this embodiment of the invention, the step of performing reconstruction error analysis on the compact feature map to obtain a preliminary anomaly scoring map includes: The similarity value of the pixel position is obtained by performing distance metric analysis between the feature vector corresponding to the pixel position in the compact feature map and the prototype feature vectors in the pre-built normal behavior feature memory library. The prototype feature vector with the highest similarity value is selected as the best matching prototype for the pixel position; Confirm the reconstruction error between the feature vector at each pixel location and the corresponding best-matching prototype; The reconstruction errors are arranged according to the original spatial coordinates of each pixel in the compact feature map to obtain an initial scoring map; The initial scoring map is filtered by local region maximum values, and the maximum error value in each local region is used as the representative score of the local region to obtain a preliminary anomaly scoring map.
[0037] The pre-built memory base of normal behavioral characteristics includes: Obtain a training video sample set containing only normal behavior, and perform lightweight spatiotemporal joint feature extraction frame by frame for each training sample in the training video sample set to obtain a training compact feature map corresponding to each training sample. The feature vectors at each pixel location in all the trained compact feature maps are aggregated into an initial feature vector set; The feature vectors are divided into multiple clusters according to their similarity, and the center vector of all feature vectors in each cluster is taken as the corresponding prototype feature vector. The prototype feature vector is associated with and stored with the distribution radii of all feature vectors in the cluster, forming key-value pairs; All key-value pairs are integrated and stored in a vector database to generate a normal behavioral feature memory.
[0038] A compact feature map consists of multiple pixel locations, each containing a feature vector representing the spatial appearance and motion information of that location. For each pixel location in the compact feature map, the system performs distance metric analysis between its feature vector and every prototype feature vector stored in the normal behavior feature memory. The distance metric uses Euclidean distance, which is calculated by taking the square root of the sum of the squares of the differences between corresponding components of the two vectors. A smaller distance value indicates a higher similarity. The system then finds the prototype feature vector with the smallest distance among all prototype feature vectors for the current pixel location; this prototype feature vector is the best matching prototype for that pixel location.
[0039] The system pre-constructs a feature memory database of normal behavior. During the training phase, a batch of training video samples containing only normal behavior is acquired. For each frame in each training sample, a training compact feature map is obtained according to the aforementioned lightweight spatiotemporal joint feature extraction process. Feature vectors for each pixel location are extracted from all training compact feature maps, and these vectors are aggregated to form an initial feature vector set. The initial feature vector set contains a large number of feature vectors from different locations and times in the normal scene.
[0040] Clustering is performed on these feature vectors. The system randomly selects a preset number of feature vectors as initial cluster centers. This number is determined based on the complexity of the scenario, typically set to several hundred. Then, the distance between each feature vector and each cluster center is calculated using Euclidean distance, which is calculated by taking the square root of the sum of the squares of the differences between corresponding components of two vectors. The cluster center with the smallest distance is found, and the feature vector is assigned to the cluster containing that center. After all vectors are assigned, the system recalculates the center of each cluster by averaging all feature vectors within that cluster on the same component. The average of all components forms the new cluster center vector. The system repeats the assignment and center update steps until the cluster centers no longer change or the change is less than a preset small threshold. Ultimately, the system obtains multiple stable clusters, where the feature vectors within each cluster are similar to each other, representing a normal behavioral pattern.
[0041] For each cluster, the system calculates the prototype feature vector, which is the final cluster center vector of the cluster. Simultaneously, the system calculates the distribution radius of the cluster, which is the maximum distance among all feature vectors within the cluster to the prototype feature vector. The prototype feature vector and distribution radius are stored in memory as a key-value pair, with the prototype feature vector as the key and the distribution radius as the value. The system integrates all key-value pairs from all clusters into a vector database. This database uses a tree-indexed structure to quickly find the prototype feature vector closest to the input vector. This vector database is the normal behavior feature memory.
[0042] After obtaining a compact feature map during the detection phase, the system performs reconstruction error analysis on each pixel location within the compact feature map. For the feature vector of the current pixel location, the system inputs it into the normal behavior feature memory and quickly finds the nearest prototype feature vector using the index of the vector database. This nearest prototype feature vector is the best matching prototype for that pixel location. The system calculates the Euclidean distance between the feature vector of the current pixel location and the best matching prototype; this distance value is the reconstruction error. The larger the reconstruction error, the greater the deviation between the behavior pattern of the current pixel location and all normal behavior patterns, meaning a higher probability of an anomaly occurring at that location.
[0043] The reconstruction error values calculated for each pixel location are arranged according to the original spatial coordinates of that pixel in the compact feature map to generate an initial score map. The initial score map has the same spatial dimensions as the compact feature map, where the value at each location represents the reconstruction error at that location.
[0044] The system presets a fixed-size local window, the size of which is determined based on the minimum spatial range of abnormal behavior in the monitored scene, typically set to cover the pixel area occupied by the upper body of a single person. The local window is slid across the initial scoring map, with each movement increment matching the window size to ensure no overlap between windows. For each local area covered by a window, the error values of all pixels within that area are iterated, and the maximum value is used as the representative score for that local area. All representative scores for all local areas are rearranged according to their spatial location, resulting in a reduced-size preliminary anomaly scoring map. This is done to make the anomaly areas more focused, avoid simultaneous alarms from multiple adjacent pixels, and reduce the amount of data processed subsequently.
[0045] The beneficial effects are as follows: This invention constructs a normal behavior feature memory library, transforming training samples containing only normal behavior into prototype feature vectors in a vector database, thus achieving unsupervised anomaly localization without relying on scarce abnormal samples. Distance metric analysis and the optimal matching prototype selection process can accurately quantify the degree of behavioral deviation at each pixel location, and the reconstruction error directly reflects the probability of anomaly. The local region maximum value screening mechanism effectively suppresses isolated noise points in the anomaly scoring map, making the detection results more focused on significant anomaly areas, thereby improving the accuracy and robustness of anomaly localization.
[0046] I3: Update the dynamic judgment threshold of the scene in the current video in real time, determine the area in the preliminary anomaly scoring map that exceeds the dynamic judgment threshold as a candidate anomaly area, and mark the candidate anomaly area that passes the temporal consistency verification as the final anomaly behavior marking map, and generate a structured detection report.
[0047] In this embodiment of the invention, the real-time updating of the dynamic determination threshold of the scene in the current video includes: When a new preliminary anomaly rating map is generated, the mean and standard deviation of the ratings of all pixels in the new preliminary anomaly rating map are calculated and stored as a set of statistics in the sliding time window. The average of the mean scores and the average of the standard deviations of the scores within the sliding time window are combined to form the scene stability baseline value; The dynamic judgment threshold for the current frame is determined based on the scene stability baseline value and the preset sensitivity adjustment coefficient.
[0048] The step of identifying regions in the preliminary anomaly scoring map that exceed the dynamic determination threshold as candidate anomaly regions includes: The pixel score in the preliminary anomaly score map is compared with the dynamic judgment threshold. When the pixel score is greater than the dynamic judgment threshold, the pixel is marked as an abnormal pixel; otherwise, it is marked as a normal pixel, and a binarized abnormal pixel mask map is generated. Connectivity analysis is performed on the abnormal pixel mask image, and isolated noise points are removed from the connected regions to obtain candidate abnormal regions.
[0049] The step of marking candidate anomalous regions that pass the time-series consistency verification as the final anomalous behavior labeling map and generating a structured detection report includes: The candidate anomaly region in the current frame is associated with the candidate anomaly region at the corresponding spatial location in the previous consecutive frames to establish a temporal trajectory linked list. The continuity of the time-series trajectory linked list is checked to obtain the positional overlap of candidate abnormal regions between adjacent frames, and the candidate abnormal regions are determined to pass the time consistency verification based on the magnitude of the positional overlap. The spatial masks in the verified candidate anomaly regions are merged according to pixel position to obtain the final anomaly behavior label map; Based on the spatiotemporal trajectory of the abnormal region in the final abnormal behavior marking map, the corresponding abnormal behavior category name is matched to obtain a structured text description; The final abnormal behavior marker map and the structured text description are encapsulated into a structured detection report.
[0050] When a new preliminary anomaly rating map is generated, the system calculates the arithmetic mean of the ratings of all pixels in the preliminary anomaly rating map to obtain the rating mean. Simultaneously, the system calculates the square of the difference between each pixel's rating and the rating mean, sums the squares of all differences, divides by the total number of pixels, and then takes the square root to obtain the rating standard deviation. The system stores the rating mean and rating standard deviation of the current frame as a set of statistics in a sliding time window.
[0051] The sliding time window is a fixed-length queue, the length of which is predetermined based on the dynamic change rate of the monitored scene. For indoor scenes with slow pedestrian flow, the window length is set to the number of video frames in thirty seconds; for rapidly changing scenes such as traffic intersections, the window length is set to the number of video frames in ten seconds. When a new set of statistics is added, if the queue is full, the oldest set of statistics in the queue is discarded, ensuring that the window always contains statistics from the most recent period.
[0052] The system reads all stored group statistics within the sliding time window and calculates the average of the mean ratings and the average of the standard deviations of the ratings for each group. These two values are then combined to form a stable baseline value for the current scene. This stable baseline value reflects the typical level and dispersion of the overall distribution of abnormal rating maps over a recent period.
[0053] The sensitivity adjustment coefficient is set based on the monitoring scenario's tolerance for missed and false alarms. For high-security scenarios such as prisons and nuclear power plants, the sensitivity adjustment coefficient is set lower to reduce the threshold, increase the detection rate of abnormal events, and allow for a small number of false alarms. For conventional scenarios such as shopping malls and office buildings, the sensitivity adjustment coefficient is set to a medium level. For scenarios requiring extremely low false alarm rates, such as traffic monitoring, the sensitivity adjustment coefficient is set higher to increase the threshold and reduce false alarms. Specifically, the system adds the average of the score mean to the product of the sensitivity adjustment coefficient and the average of the score standard deviations to obtain the dynamic judgment threshold. This threshold is automatically adjusted according to changes in the scene's statistical characteristics.
[0054] After obtaining the dynamic judgment threshold, the system compares the score value of each pixel in the preliminary anomaly scoring map with the dynamic judgment threshold. If the score value of a pixel is greater than the dynamic judgment threshold, the pixel is marked as an anomaly pixel; if the score value is less than or equal to the dynamic judgment threshold, it is marked as a normal pixel. After all pixels are marked, the system generates a binary anomaly pixel mask map, where anomaly pixels are represented by a first value and normal pixels are represented by a second value.
[0055] The system scans every pixel in the mask image from top to bottom and left to right. When it encounters a pixel marked as abnormal that has not yet been visited, it uses that pixel as a seed and searches for other pixels marked as abnormal in its four adjacent directions (up, down, left, and right). These adjacent pixels are grouped into the same connected region. The search continues outward from the newly added pixel until no adjacent abnormal pixels are found. After this scanning is completed, the mask image is divided into multiple independent connected regions, and each connected region is a candidate abnormal region.
[0056] A pre-set area threshold is established, determined based on the minimum pixel area occupied by a single pedestrian or vehicle in the monitored image. Typically, the area threshold is set to a few pixels for distant pedestrians and larger for nearby targets. The system calculates the total number of pixels contained in each connected region. If this total is less than the pre-set area threshold, the connected region is identified as isolated noise and removed from the candidate anomaly region set. The remaining connected regions after removal are the valid candidate anomaly regions.
[0057] A temporal trajectory linked list is maintained for each candidate anomaly region. Each node in this list records the coordinates of the minimum bounding rectangle of that region within a given frame. For a candidate anomaly region in the current frame, the system calculates the overlap between the minimum bounding rectangle of that region and the minimum bounding rectangles of all candidate anomaly regions in the previous frame. The overlap is calculated by dividing the area of the intersection of the two rectangles by the area of their union. The region in the previous frame with the highest overlap is selected as the associated object, and the position coordinates of the current frame region are appended to the end of the temporal trajectory linked list corresponding to that region. If the maximum overlap is lower than a preset association threshold, a new temporal trajectory linked list is created, with the current frame region as the first node of the list.
[0058] Continuity detection involves checking the positional overlap between adjacent frame nodes in a linked list. The system presets an overlap threshold, typically set to 0.5. For each pair of adjacent nodes in the linked list, their overlap is calculated. If the overlap of adjacent nodes in multiple consecutive frames is greater than the threshold, the candidate anomalous region is deemed to have passed the temporal consistency verification. If the overlap of any pair of adjacent nodes in the linked list is less than the threshold, or if the length of the linked list is less than the preset minimum number of consecutive frames, the candidate anomalous region is determined to be transient noise or a false detection, and is removed from the verification list.
[0059] The spatial masks corresponding to all candidate anomaly regions that have passed the temporal consistency verification are merged according to their pixel positions. A spatial mask refers to the set of pixel positions covered by each candidate anomaly region in the current frame. During merging, the system creates an all-zero matrix with the same size as the frame image. For each verified candidate anomaly region, the value at its mask-covered position is set to the first value. If different regions overlap, the overlapping position remains the first value. After merging, a final binarized anomaly behavior marker map is obtained.
[0060] Based on the spatiotemporal trajectory of each abnormal region in the final abnormal behavior labeling map, the system matches the corresponding abnormal behavior category name. The spatiotemporal trajectory includes information such as the position change sequence, velocity change, and area change of the region over multiple consecutive frames. The system pre-maintains an abnormal behavior description library containing various typical abnormal behavior categories, such as running, falling, fighting, and loitering, with each category corresponding to a series of trajectory feature templates. The system compares the spatiotemporal trajectory to be matched with each feature template in the library, calculates the similarity between the trajectory and the template, and selects the category name corresponding to the template with the highest similarity as the abnormal behavior category name for the current abnormal region. The system generates a structured text description, which includes the start and end times of the abnormal behavior, the trajectory of the coordinate change of the abnormal region's center point, and the matched abnormal behavior category name.
[0061] The marked image data is converted into a transmittable binary stream according to a preset data format, and the text description is organized in a key-value pair format, for example, using timestamps, coordinate sequences, and category names as different keys. These two parts of data are combined to form a complete structured detection report. This report is output to a display terminal, where abnormal areas are highlighted with overlaid text descriptions on the monitoring screen. Simultaneously, based on the abnormal behavior category name in the report, corresponding alarm signals are triggered, such as playing a specific voice message or sending an alarm notification to the management platform.
[0062] The beneficial effects are as follows: This invention uses a sliding time window to statistically analyze the mean and standard deviation of the preliminary anomaly scoring map in real time, dynamically calculating the judgment threshold. This allows the threshold to automatically adjust with scene factors such as changes in lighting, weather, and background dynamics, avoiding a large number of missed or false alarms when the scene changes due to a fixed threshold. Based on connected component analysis and area screening of isolated noise points, false alarms of single pixels or extremely small areas are effectively eliminated. Temporal consistency verification distinguishes between real continuous abnormal behavior and transient interference through trajectory association and overlap detection. Only abnormal areas that appear stably over multiple consecutive frames are finally confirmed, significantly reducing the false alarm rate. The final generated structured detection report includes both anomaly marker maps and text descriptions, providing monitoring personnel with intuitive visualization results and detailed semantic information, facilitating rapid decision-making and post-event traceability.
[0063] like Figure 2 The diagram shown is a functional block diagram of a real-time video abnormal behavior detection system based on deep learning provided in an embodiment of the present invention.
[0064] The deep learning-based real-time video abnormal behavior detection system 100 of this invention can be installed in an electronic device. Depending on the functions implemented, the deep learning-based real-time video abnormal behavior detection system 100 may include a joint feature extraction module 101, a reconstruction error module 102, and a threshold update and timing verification module 103. The module described in this invention can also be called a unit, which refers to a series of computer program segments that can be executed by the processor of an electronic device and can perform a fixed function, and which are stored in the memory of the electronic device.
[0065] In this embodiment, the functions of each module / unit are as follows: The joint feature extraction module 101 is used to perform lightweight spatiotemporal joint feature extraction on the video frame sequence to be processed from the original video stream in the video to obtain a compact feature map of the video. The reconstruction error module 102 is used to perform reconstruction error analysis on the compact feature map to obtain a preliminary anomaly score map. The threshold update and timing verification module 103 is used to update the dynamic judgment threshold of the scene in the current video in real time, determine the area in the preliminary anomaly scoring map that exceeds the dynamic judgment threshold as a candidate anomaly area, and mark the candidate anomaly area that passes the timing consistency verification as the final anomaly behavior marking map, and generate a structured detection report.
[0066] In the several embodiments provided by this invention, it should be understood that the disclosed methods and systems can be implemented in other ways. For example, the system embodiments described above are merely illustrative; for instance, the division of modules is only a logical functional division, and other division methods may be used in actual implementation.
[0067] The modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.
[0068] Furthermore, the functional modules in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or in the form of hardware plus software functional modules.
[0069] It will be apparent to those skilled in the art that the present invention is not limited to the details of the exemplary embodiments described above, and that the present invention can be implemented in other specific forms without departing from the spirit or essential characteristics of the present invention.
[0070] The embodiments of this application can acquire and process relevant data based on an artificial intelligence technology. Artificial intelligence is the theory, method, technology, and application system that uses digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to obtain optimal results.
[0071] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.
Claims
1. A real-time video abnormal behavior detection method based on deep learning, characterized in that, The method includes: I1: Perform lightweight spatiotemporal joint feature extraction on the sequence of video frames to be processed from the original video stream in the video to obtain a compact feature map of the video; I2: Perform reconstruction error analysis on the compact feature map to obtain a preliminary anomaly scoring map; I3: Update the dynamic judgment threshold of the scene in the current video in real time, determine the area in the preliminary anomaly scoring map that exceeds the dynamic judgment threshold as a candidate anomaly area, and mark the candidate anomaly area that passes the temporal consistency verification as the final anomaly behavior marking map, and generate a structured detection report.
2. The real-time video abnormal behavior detection method based on deep learning as described in claim 1, characterized in that, The video frame sequence to be processed includes: Real-time motion activity assessment is performed on consecutive raw frames after decoding the original video stream, and processing decision markers and local processing region coordinates are generated for each frame based on the assessment results. When the processing decision flag indicates to skip the current frame, the current frame is excluded, and the detection result of the previous frame is directly reused; When the processing decision flag indicates local processing, the corresponding local image block is cropped from the current frame according to the coordinates of the local processing area, and the local image block is used as the content to be processed in the current frame; When the processing decision flag indicates full-frame processing, the full-frame image of the current frame is used as the content to be processed in the current frame; The content to be processed in the same spatial location is arranged in chronological order and combined into a sequence of video frames to be processed.
3. The real-time video abnormal behavior detection method based on deep learning as described in claim 2, characterized in that, The real-time motion activity assessment of consecutive raw frames after decoding the original video stream includes: Divide the current frame and the previous frame into multiple non-overlapping pixel blocks, calculate the sum of the luminance components of all pixels in the pixel block, and calculate the full frame difference statistics of the current frame. The formula for calculating the full-frame difference statistics is as follows: in, For the first Full-frame difference statistics of frames. The row number of the pixel block. The number of columns in the pixel block. For the first The first frame Line number The sum of the luminance components of the column pixel block. For the first The sum of the luminance components of the corresponding pixel block in the frame; When the full-frame difference statistics are within a preset activity range, the difference contribution of the pixel blocks in the current frame is evaluated. When the pixel block in the evaluation result exceeds the block-level threshold, the coordinates of the smallest bounding rectangle in the pixel block are used as the coordinates of the local processing area, and a processing decision label indicating local processing is generated. When the full-frame difference statistics are less than the low-activity threshold in the activity range, a processing decision flag indicating skipping the current frame is generated, and the coordinates of the local processing area are cleared. When the full-frame difference statistics are greater than or equal to the high activity threshold in the activity range, a processing decision flag indicating full-frame processing is generated, and the coordinates of the local processing area are set to a coordinate range covering the entire frame.
4. The real-time video abnormal behavior detection method based on deep learning as described in claim 1, characterized in that, The step of performing lightweight spatiotemporal joint feature extraction on the sequence of video frames to be processed from the original video stream in the video to obtain a compact feature map of the video includes: Separable convolution processing is performed on each frame in the video frame sequence to be processed, resulting in multiple sets of spatial feature maps corresponding to each frame. The spatial feature maps corresponding to two adjacent frames in the video frame sequence to be processed are differentially calculated according to pixel positions to obtain the inter-frame difference excitation map. The spatial feature map of the current frame and the inter-frame difference excitation map are summed pixel by pixel to obtain the fused feature map of the current frame. The fused feature map is subjected to channel compression transformation to generate an intermediate feature map of the video; The intermediate feature map is spatially downsampled to generate a compact feature map of the video.
5. The real-time video abnormal behavior detection method based on deep learning as described in claim 1, characterized in that, The reconstruction error analysis of the compact feature map to obtain a preliminary anomaly scoring map includes: The similarity value of the pixel position is obtained by performing distance metric analysis between the feature vector corresponding to the pixel position in the compact feature map and the prototype feature vectors in the pre-built normal behavior feature memory library. The prototype feature vector with the highest similarity value is selected as the best matching prototype for the pixel position; Confirm the reconstruction error between the feature vector at each pixel location and the corresponding best-matching prototype; The reconstruction errors are arranged according to the original spatial coordinates of each pixel in the compact feature map to obtain an initial scoring map; The initial scoring map is filtered by local region maximum values, and the maximum error value in each local region is used as the representative score of the local region to obtain a preliminary anomaly scoring map.
6. The real-time video abnormal behavior detection method based on deep learning as described in claim 5, characterized in that, The pre-built memory base of normal behavioral characteristics includes: Obtain a training video sample set containing only normal behavior, and perform lightweight spatiotemporal joint feature extraction frame by frame for each training sample in the training video sample set to obtain a training compact feature map corresponding to each training sample. The feature vectors at each pixel location in all the trained compact feature maps are aggregated into an initial feature vector set; The feature vectors are divided into multiple clusters according to their similarity, and the center vector of all feature vectors in each cluster is taken as the corresponding prototype feature vector. The prototype feature vector is associated with and stored with the distribution radii of all feature vectors in the cluster, forming key-value pairs; All key-value pairs are integrated and stored in a vector database to generate a normal behavioral feature memory.
7. The real-time video abnormal behavior detection method based on deep learning as described in claim 1, characterized in that, The real-time updating of the dynamic judgment threshold for the scene in the current video includes: When a new preliminary anomaly rating map is generated, the mean and standard deviation of the ratings of all pixels in the new preliminary anomaly rating map are calculated and stored as a set of statistics in the sliding time window. The average of the mean scores and the average of the standard deviations of the scores within the sliding time window are combined to form the scene stability baseline value; The dynamic judgment threshold for the current frame is determined based on the scene stability baseline value and the preset sensitivity adjustment coefficient.
8. The real-time video abnormal behavior detection method based on deep learning as described in claim 1, characterized in that, The step of identifying regions in the preliminary anomaly scoring map that exceed the dynamic determination threshold as candidate anomaly regions includes: The pixel score in the preliminary anomaly score map is compared with the dynamic judgment threshold. When the pixel score is greater than the dynamic judgment threshold, the pixel is marked as an abnormal pixel; otherwise, it is marked as a normal pixel, and a binarized abnormal pixel mask map is generated. Connectivity analysis is performed on the abnormal pixel mask image, and isolated noise points are removed from the connected regions to obtain candidate abnormal regions.
9. The real-time video abnormal behavior detection method based on deep learning as described in claim 1, characterized in that, The step of marking candidate anomalous regions that pass the time-series consistency verification as the final anomalous behavior labeling map and generating a structured detection report includes: The candidate anomaly region in the current frame is associated with the candidate anomaly region at the corresponding spatial location in the previous consecutive frames to establish a temporal trajectory linked list. The continuity of the time-series trajectory linked list is checked to obtain the positional overlap of candidate abnormal regions between adjacent frames, and the candidate abnormal regions are determined to pass the time consistency verification based on the magnitude of the positional overlap. The spatial masks in the verified candidate anomaly regions are merged according to pixel position to obtain the final anomaly behavior label map; Based on the spatiotemporal trajectory of the abnormal region in the final abnormal behavior marking map, the corresponding abnormal behavior category name is matched to obtain a structured text description; The final abnormal behavior marker map and the structured text description are encapsulated into a structured detection report.
10. A real-time video abnormal behavior detection system based on deep learning, characterized in that, The system for implementing the deep learning-based real-time video abnormal behavior detection method of claim 1, the system comprising: The joint feature extraction module is used to perform lightweight spatiotemporal joint feature extraction on the sequence of video frames to be processed from the original video stream in the video, so as to obtain a compact feature map of the video. The reconstruction error module is used to perform reconstruction error analysis on the compact feature map to obtain a preliminary anomaly score map. The threshold update and temporal verification module is used to update the dynamic judgment threshold of the scene in the current video in real time, determine the region in the preliminary anomaly scoring map that exceeds the dynamic judgment threshold as a candidate anomaly region, and mark the candidate anomaly region that passes the temporal consistency verification as the final anomaly behavior marking map, and generate a structured detection report.