An edge intelligent monitoring video dynamic and static feature cooperative detection method and device

By leveraging the collaborative work of edge node devices and central cloud devices, and utilizing lightweight and multi-frame feature prediction neural networks for collaborative detection of dynamic and static features in surveillance videos, the computational and communication burden of anomaly detection in surveillance videos under IoT scenarios is solved, achieving efficient and accurate anomaly behavior detection.

CN119007066BActive Publication Date: 2026-06-19TSINGHUA UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TSINGHUA UNIVERSITY
Filing Date
2024-07-26
Publication Date
2026-06-19

Smart Images

  • Figure CN119007066B_ABST
    Figure CN119007066B_ABST
Patent Text Reader

Abstract

This application provides a method and apparatus for collaborative detection of dynamic and static features in edge-intelligent surveillance videos. The method includes: acquiring video data using a terminal device; using a lightweight single-frame dynamic feature prediction neural network in the edge node device to perform bidirectional motion constraints and dynamic feature extraction on a target video frame in the video data, obtaining a first reconstructed video frame; performing abnormal dynamic information detection based on the first reconstructed video frame to obtain a dynamic information anomaly detection result; using a multi-frame static feature prediction neural network in a central cloud device to perform spatiotemporal aggregation and decoding on multiple adjacent video frames before and after the target video, obtaining a second reconstructed video frame; performing abnormal static information detection based on the second reconstructed video frame to obtain a static information anomaly detection result; and obtaining a final anomaly detection result based on the dynamic information anomaly detection result and the static information anomaly detection result. This achieves efficient and accurate abnormal behavior detection.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of video processing technology, and in particular to an edge-intelligent collaborative detection method and apparatus for dynamic and static features of surveillance video. Background Technology

[0002] Intelligent surveillance has been widely applied in daily life and industrial scenarios within the Internet of Things (IoT) framework, becoming an important tool in areas such as traffic monitoring, violence detection, and defect detection. However, the massive amounts of data generated can place a huge burden on subsequent storage, communication, and analysis; furthermore, the scarcity of abnormal events from the monitoring perspective exacerbates data redundancy problems. Therefore, automatically detecting abnormal events in surveillance videos is a crucial step in realizing advanced vision systems.

[0003] Due to the scarcity and unbounded definition of anomalous events, collecting and labeling a sufficient number of them is impractical. Unsupervised learning-based methods, however, only require normal samples and do not need labels during training, making them more feasible. In this context, anomalous events are naturally viewed as outliers that deviate from normal patterns.

[0004] Currently, unsupervised anomaly detection methods are mainly based on deep learning techniques using deep neural networks. The application of deep neural networks typically requires intensive computation, and in typical IoT scenarios, the infrastructure struggles to handle the computational and communication burdens brought by large-scale data and models. This has become a bottleneck for the development of efficient intelligent vision systems. Therefore, how to achieve efficient and accurate anomaly detection is a pressing technical problem that needs to be solved. Summary of the Invention

[0005] In view of the above problems, this application provides an edge-intelligent collaborative detection method and apparatus for dynamic and static features of surveillance video, so as to overcome the above problems or at least partially solve the above problems.

[0006] A first aspect of this application discloses an edge-intelligent collaborative detection method for dynamic and static features of surveillance videos, the method comprising:

[0007] Acquire video data using terminal devices;

[0008] A lightweight single-frame dynamic feature prediction neural network in an edge node device is used to perform bidirectional motion constraints and dynamic feature extraction on the target video frame in the video data to obtain a first reconstructed video frame containing dynamic features of adjacent video frames. Anomaly detection is performed based on the first reconstructed video frame to obtain anomaly detection results.

[0009] Using a multi-frame static feature prediction neural network in a central cloud device, spatiotemporal aggregation and decoding are performed on multiple adjacent video frames before and after the target video to obtain a second reconstructed video frame containing the static semantics of multiple adjacent video frames before and after. Based on the second reconstructed video frame, abnormal static information detection is performed to obtain a static information anomaly detection result.

[0010] Based on the dynamic information anomaly detection results and the static information anomaly detection results, the final anomaly detection result is obtained.

[0011] Optionally, the method further includes:

[0012] Monitor the network bandwidth status of the edge node devices and the central cloud devices;

[0013] When the network bandwidth does not meet the real-time requirements, a lightweight single-frame dynamic feature prediction neural network in the edge node device is used for anomaly detection, so as to provide real-time early warning based on the anomaly detection results of dynamic information.

[0014] Based on the dynamic information anomaly detection results and the static information anomaly detection results, the final anomaly detection results are obtained, including:

[0015] When the network bandwidth meets the real-time requirements, a lightweight single-frame dynamic feature prediction neural network in the edge node device and a multi-frame static feature prediction neural network in the central cloud device are used for collaborative anomaly detection to obtain the final anomaly detection result.

[0016] Optionally, the abnormal dynamic information detection based on the first reconstructed video frame and the abnormal static information detection based on the second reconstructed video frame are performed according to the following steps:

[0017] The peak signal-to-noise ratio between the target video frame and the reconstructed video frame is used as the quality evaluation index of the target video frame. The quality evaluation index characterizes the normality of the target video frame. The reconstructed video frame includes: a first reconstructed video frame and a second reconstructed video frame.

[0018] The quality assessment metrics of all target video frames in the video data are normalized to obtain anomaly detection results, which include dynamic information anomaly detection results and static information anomaly detection results.

[0019] Optionally, the single-frame dynamic feature prediction neural network includes: a first spatial encoder and a dynamic decoder;

[0020] A lightweight single-frame dynamic feature prediction neural network in an edge node device is used to perform inter-frame bidirectional motion constraints and dynamic feature extraction on target video frames in the video data, resulting in a first reconstructed video frame containing dynamic features of adjacent video frames, including:

[0021] The spatial features of the target video frame are extracted using the first spatial encoder to obtain a bidirectional dynamic feature map associated with the preceding and following video frames of the target video frame.

[0022] The dynamic decoder is used to reconstruct features based on the bidirectional dynamic feature map to obtain a first reconstructed video frame containing dynamic features of adjacent video frames.

[0023] Optionally, a first reconstructed video frame containing dynamic features of consecutive video frames is obtained by using a dynamic decoder to reconstruct features from the bidirectional dynamic feature map, including:

[0024] Based on the bidirectional dynamic feature map and the target video frame, construct the adjacent video frames before and after the target video frame;

[0025] An inverse mapping transformation is performed based on the preceding and following video frames of the target video frame and the bidirectional dynamic feature map to obtain a first reconstructed video frame containing the dynamic features of the preceding and following video frames.

[0026] Optionally, the multi-frame static feature prediction neural network includes: a second spatial encoder, a multi-scale spatiotemporal aggregator, and a static decoder;

[0027] Using a multi-frame static feature prediction neural network in a central cloud device, spatiotemporal aggregation and decoding are performed on multiple adjacent video frames before and after the target video to obtain a second reconstructed video frame containing the static semantics of multiple adjacent video frames before and after, including:

[0028] The spatial features of each adjacent video frame are obtained using the second spatial encoder.

[0029] The spatial features of each adjacent video frame are aggregated using the multi-scale spatiotemporal aggregator to obtain the static appearance features of the target video.

[0030] The static decoder is used to reconstruct video frames from the static appearance features of the target video, resulting in a second reconstructed video frame containing the static semantics of multiple adjacent video frames.

[0031] Optionally, the spatial features include: spatial features of two adjacent video frames in the first two frames and spatial features of two adjacent video frames in the last two frames; the multi-scale spatiotemporal aggregator includes: a forward multi-scale spatiotemporal aggregator and a backward multi-scale spatiotemporal aggregator;

[0032] The spatial features of each adjacent video frame are aggregated using a multi-scale spatiotemporal aggregator to obtain the static appearance features of the target video, including:

[0033] The spatial features of the first two adjacent video frames are input into the forward multi-scale spatiotemporal aggregator for aggregation to obtain forward spatiotemporal features.

[0034] The spatial features of the next two adjacent video frames are input into the backward multi-scale spatiotemporal aggregator for aggregation to obtain backward spatiotemporal features;

[0035] Based on the forward spatiotemporal features and the backward spatiotemporal features, the static appearance features of the target video are obtained.

[0036] Optionally, the single-frame dynamic feature prediction neural network and the multi-frame static feature prediction neural network are obtained according to the following dynamic and static joint training steps:

[0037] The first sample reconstructed video frame corresponding to the sample target video frame in the sample video data is obtained according to the single-frame dynamic feature prediction neural network, and the second sample reconstructed video frame corresponding to the sample target video frame is obtained according to the multi-frame static feature prediction neural network.

[0038] A dynamic loss is constructed based on the reconstructed video frame from the first sample and the target video frame from the sample, and a static loss is constructed based on the reconstructed video frame from the second sample and the target video frame from the sample.

[0039] The network parameters of the single-frame dynamic feature prediction neural network and the network parameters of the multi-frame static feature prediction neural network are updated based on the dynamic loss and the static loss.

[0040] Optionally, the dynamic loss and the static loss include: strength constraint loss, gradient constraint loss, perceptual constraint loss, and adversarial constraint loss;

[0041] The dynamic loss and the static loss are constructed according to the following steps:

[0042] The intensity constraint loss is obtained based on the Euclidean distance between the sample reconstructed video frame and the sample target video frame. The sample reconstructed video frame includes: a first sample reconstructed video frame and a second sample reconstructed video frame.

[0043] Gradient constraint loss is obtained by comparing the gradient magnitude similarity between the reconstructed video frame and the target video frame.

[0044] Based on the local structural similarity error between the reconstructed video frame and the target video frame, the perceptual constraint loss is obtained.

[0045] Adversarial learning is performed on the reconstructed video frames based on the samples and the target video frames based on the samples to obtain the adversarial constraint loss.

[0046] A second aspect of this application discloses an edge-intelligent collaborative detection device for dynamic and static features of surveillance videos, the device comprising:

[0047] The data acquisition module is used to acquire video data using terminal devices;

[0048] The first detection module is used to use a lightweight single-frame dynamic feature prediction neural network in the edge node device to perform inter-frame bidirectional motion constraints and dynamic feature extraction on the target video frame in the video data, to obtain a first reconstructed video frame containing the dynamic features of adjacent video frames, and to perform abnormal dynamic information detection based on the first reconstructed video frame to obtain a dynamic information anomaly detection result.

[0049] The second detection module is used to use a multi-frame static feature prediction neural network in the central cloud device to perform spatiotemporal aggregation and decoding processing on multiple adjacent video frames before and after the target video to obtain a second reconstructed video frame containing the static semantics of multiple adjacent video frames before and after, and to perform abnormal static information detection based on the second reconstructed video frame to obtain a static information abnormality detection result.

[0050] The final result module is used to obtain the final anomaly detection result based on the dynamic information anomaly detection result and the static information anomaly detection result.

[0051] The embodiments of this application have the following advantages:

[0052] In this embodiment, combining the real-time characteristics of edge computing architecture, and based on the observation results of the differences between abnormal events and normal events in dynamic and static spaces, a collaborative detection method for dynamic and static features of surveillance video using edge intelligence is proposed. Video data is acquired using a terminal device; a lightweight single-frame dynamic feature prediction neural network in the edge node device is used to perform bidirectional motion constraints and dynamic feature extraction on the target video frame in the video data, resulting in a first reconstructed video frame containing the dynamic features of adjacent video frames. Anomaly detection of dynamic information is then performed based on the first reconstructed video frame to obtain a dynamic information anomaly detection result. A multi-frame static feature prediction neural network in the central cloud device is used to perform spatiotemporal aggregation and decoding on multiple adjacent video frames before and after the target video, resulting in a second reconstructed video frame containing the static semantics of multiple adjacent video frames. Anomaly detection of static information is then performed based on the second reconstructed video frame to obtain a static information anomaly detection result. Finally, the final anomaly detection result is obtained based on the dynamic information anomaly detection result and the static information anomaly detection result.

[0053] Thus, a cloud-edge collaborative abnormal behavior detection framework is implemented to improve the efficiency of video anomaly detection in IoT scenarios. By using a lightweight single-frame dynamic feature prediction neural network in the edge node device, the appearance dynamic information of a single target video frame (i.e., the dynamic features of adjacent video frames) can be obtained. Anomaly detection is then performed based on this dynamic information, yielding dynamic information anomaly detection results. Since the lightweight single-frame dynamic feature prediction neural network in the edge node device processes single target video frames from the video data, rapid abnormal behavior detection can be achieved in the edge node device.

[0054] Furthermore, by using a multi-frame static feature prediction neural network in the central cloud device to obtain the static semantics of several adjacent video frames before and after the target video frame, more accurate anomaly detection can be achieved based on the static semantics of these adjacent video frames. Finally, by combining the dynamic information anomaly detection results and the static information anomaly detection results, an accurate final anomaly detection result is obtained. This enables efficient and accurate anomaly behavior detection. Attached Figure Description

[0055] To more clearly illustrate the technical solutions of the embodiments of this application, the drawings used in the description of the embodiments of this application will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0056] Figure 1 This is a schematic diagram of an abnormal behavior detection framework for cloud-edge collaborative execution provided in an embodiment of this application;

[0057] Figure 2 This is a flowchart illustrating the steps of a collaborative detection method for dynamic and static features of surveillance video using edge intelligence, as provided in an embodiment of this application.

[0058] Figure 3 This is a flowchart illustrating the steps of another edge-intelligent collaborative detection method for dynamic and static features of surveillance video provided in this application embodiment;

[0059] Figure 4 This is an overall architecture diagram of a collaborative detection method for dynamic and static features of surveillance video using edge intelligence, provided in an embodiment of this application.

[0060] Figure 5 This is a schematic diagram of the structure of an edge-intelligent surveillance video dynamic and static feature collaborative detection device provided in an embodiment of this application. Detailed Implementation

[0061] To make the above-mentioned objectives, features, and advantages of this application more apparent and understandable, the technical solutions in the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0062] In related technologies, anomaly detection is performed based solely on the dynamic or static features of video data, without considering the use of combined dynamic and static features for anomaly detection, resulting in poor accuracy. Furthermore, the network structure of existing methods is relatively complex and does not take into account issues of computational performance and real-time performance.

[0063] Therefore, to address the high-performance requirements of intelligent security systems for storing and analyzing large amounts of video data, this application proposes an edge-intelligent collaborative detection method for dynamic and static features of surveillance videos, based on the observation results of the differences between abnormal events and normal events in dynamic and static spaces, leveraging the real-time characteristics of edge computing architecture. This application deploys a lightweight single-frame dynamic feature prediction neural network on edge node devices to achieve low-latency abnormal behavior detection (i.e., obtaining dynamic information anomaly detection results), while deploying a multi-frame static feature prediction neural network on the central cloud device to correct the dynamic information anomaly detection results, thereby achieving efficient and accurate abnormal behavior detection.

[0064] The following description, in conjunction with the accompanying drawings, details the edge-intelligent collaborative detection method for dynamic and static features of surveillance video provided in this application.

[0065] Reference Figure 1 As shown, Figure 1 This is a schematic diagram of a cloud-edge collaborative abnormal behavior detection framework provided in an embodiment of this application. This application proposes a cloud-edge collaborative abnormal behavior detection framework to improve the efficiency of video anomaly detection in IoT scenarios. Figure 1 As shown in (a), video data is acquired using a sensing device located at the terminal (i.e., the terminal device). To improve computing power, an edge node device with certain computing capabilities is configured at the edge end, shortly away from the sensing device, for local downstream task analysis of the shallow network in the industrial field (e.g., early warning, real-time feedback, and privacy protection task processing). In one specific implementation, the edge node device is a Mini PC. To ensure smooth communication, the sensing device and the Mini PC are connected via a serial port. The Mini PC's hardware environment consists of a 1.9GHz Intel i9 10900T CPU@1.9GHz processor and 16GB of memory.

[0066] To receive signals from the edge, the Mini PC connects via broadband to a central cloud device with powerful computing capabilities for global task processing in deep networks (e.g., model training and global system maintenance). In one specific implementation, two high-performance servers are used as the central cloud device, with hardware including an Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz processor, 32GB of RAM, and an NVIDIA GeForce A100 graphics card with 40GB of video memory. This cloud-edge collaborative abnormal behavior detection framework can improve the system's processing power to meet the high-performance analysis requirements of intelligent security systems.

[0067] Specifically, the specific planning of the anomaly detection task in this application embodiment is as follows: Figure 1 As shown in (b), the terminal device is mainly used for video data acquisition tasks, the edge node device is responsible for dynamic feature extraction and diagnosis tasks based on single-frame static information, as well as local downstream task analysis of shallow networks; while the central cloud device is used for static feature prediction and diagnosis tasks based on multi-frame dynamic information; the end user terminal can perform upper-layer tasks such as result analysis.

[0068] Reference Figure 2 As shown, Figure 2 This is a flowchart illustrating the steps of a collaborative detection method for dynamic and static features in edge-intelligent surveillance video provided in an embodiment of this application. Figure 2 As shown in the embodiments of this application, a collaborative detection method for dynamic and static features of surveillance video using edge intelligence may include steps S120 to S240:

[0069] Step S210: Use the terminal device to acquire video data.

[0070] In this embodiment, the terminal device refers to a sensing device deployed on a terminal, including fixed-view surveillance cameras, cameras, and drones. The terminal device has communication connections with edge node devices and central cloud devices. After acquiring video data, the terminal device sends the collected video data to the edge node devices and central cloud devices through the communication connections.

[0071] Step S120: Utilize a lightweight single-frame dynamic feature prediction neural network in the edge node device to perform inter-frame bidirectional motion constraints and dynamic feature extraction on the target video frame in the video data, thereby obtaining a first reconstructed video frame containing dynamic features of adjacent video frames. Based on the first reconstructed video frame, perform abnormal dynamic information detection to obtain an abnormal dynamic information detection result.

[0072] In this embodiment, the edge node device refers to a computing device with certain computing capabilities configured at the edge. A lightweight single-frame dynamic feature prediction neural network is deployed in the edge node device. Utilizing the inherent consistency semantics between static and dynamic features, the target video frame is processed based on the single-frame dynamic feature prediction neural network. Specifically, for each target video frame (i.e., each video frame to be detected) in the video data, the lightweight single-frame dynamic feature prediction neural network performs bidirectional motion constraints and dynamic feature extraction to obtain the appearance dynamic information of the target video frame (i.e., the dynamic features of adjacent video frames). Based on the appearance dynamic information of the target video frame, a first reconstructed video frame is obtained. Finally, abnormal dynamic information is detected based on the difference between the first reconstructed video frame and the target video frame. Since the lightweight single-frame dynamic feature prediction neural network in the edge node device processes single target video frames in the video data, rapid abnormal behavior detection can be achieved in the edge node device.

[0073] In an optional embodiment, the single-frame dynamic feature prediction neural network includes: a first spatial encoder and a dynamic decoder; utilizing a lightweight single-frame dynamic feature prediction neural network in an edge node device, inter-frame bidirectional motion constraints and dynamic feature extraction are performed on the target video frame in the video data to obtain a first reconstructed video frame containing the dynamic features of adjacent video frames, including steps A1 to A2:

[0074] Step A1: Use the first spatial encoder to extract the spatial features of the target video frame to obtain a bidirectional dynamic feature map associated with the preceding and following video frames of the target video frame.

[0075] Step A2: Use the dynamic decoder to reconstruct features based on the bidirectional dynamic feature map to obtain a first reconstructed video frame containing dynamic features of adjacent video frames.

[0076] In this embodiment, the spatial features of the target video frame are extracted by a first spatial encoder (a spatial encoder is a 2D spatial encoder) and rendered as a bidirectional dynamic feature map associated with the preceding and following video frames of the target video frame. The bidirectional dynamic feature map includes dynamic features (i.e., motion flow features) from the target video frame to the adjacent preceding video frame and dynamic features from the target video frame to the adjacent following video frame. Using the bidirectional dynamic feature map, the target video frame can be converted to the preceding and following video frames of the target video frame by a dynamic decoder to reconstruct the first reconstructed video frame corresponding to the target video frame.

[0077] Specifically, the first reconstructed video frame containing dynamic features of adjacent video frames is obtained by using a dynamic decoder to reconstruct features based on the bidirectional dynamic feature map. This includes: constructing adjacent video frames of the target video frame based on the bidirectional dynamic feature map and the target video frame; and performing an inverse mapping transformation based on the adjacent video frames of the target video frame and the bidirectional dynamic feature map to obtain the first reconstructed video frame containing dynamic features of adjacent video frames.

[0078] For example, adjacent video frames are represented as follows:

[0079] I T±1 =W(I T M)

[0080] Where, M = {M T→T-1 M T→T+1} represents a bidirectional dynamic feature map, M T→T-1 Indicates target video frame I T To the adjacent previous video frame I T-1 The dynamic characteristics of M T→T+1 Indicates target video frame I T To the next adjacent video frame I T+1 The dynamic characteristics of the target video frame allow for the transformation of the target video frame into its preceding and following adjacent video frames using a bidirectional dynamic feature map (M); W represents the transformation of the target video frame I into its preceding and following adjacent video frames based on the bidirectional dynamic feature map M. T Each position is mapped to the preceding and following adjacent video frames I T±1 Mapping transformation.

[0081] For example, the first reconstructed video frame is represented as:

[0082]

[0083] in, Indicates the first reconstructed video frame; W -1 This represents the inverse mapping transformation based on the bidirectional dynamic feature map; I t Let I represent the video frame at time t, where I is the frame when t∈T-1. t Indicates target video frame I T The adjacent previous video frame I T-1 When t∈T+1, I t Indicates target video frame I T The adjacent next video frame I T+1 M T→t Indicates target video frame I T To adjacent video frame I t The dynamic characteristics.

[0084] In this embodiment of the application, in order to utilize the semantic correlation between dynamic and static signals, a single-frame dynamic feature prediction neural network composed of a spatial encoder and a dynamic decoder is used to generate a bidirectional dynamic feature map as the target. The bidirectional dynamic feature map is converted into a first reconstructed video frame using an inverse mapping transformation. Finally, the difference between the first reconstructed video frame and the target video frame is used as an anomaly criterion.

[0085] Step S130: Using the multi-frame static feature prediction neural network in the central cloud device, spatiotemporal aggregation and decoding processing are performed on the adjacent video frames before and after the target video to obtain a second reconstructed video frame containing the static semantics of the adjacent video frames before and after. Anomaly static information detection is performed based on the second reconstructed video frame to obtain the static information anomaly detection result.

[0086] In this embodiment, the cloud device refers to a high-performance computing device deployed in the cloud, and a multi-frame static feature prediction neural network is deployed in the central cloud device. To utilize spatiotemporal normality, the static features of the target video frame are predicted using the contextual semantics of several adjacent video frames preceding and following the target video frame. Specifically, for several adjacent video frames preceding and following the target video frame, the multi-frame static feature prediction neural network acquires the spatial features of each video frame, and then performs spatiotemporal aggregation and decoding on the spatial features of these adjacent video frames to obtain a second reconstructed video frame containing the static semantics of the adjacent video frames. Finally, anomaly static information detection is performed based on the difference between the second reconstructed video frame and the target video frame. Since the multi-frame static feature prediction neural network acquires the static semantics of several adjacent video frames preceding and following the target video frame, more accurate anomaly detection can be achieved based on the static semantics of these adjacent video frames.

[0087] In an optional embodiment, the multi-frame static feature prediction neural network includes: a second spatial encoder, a multi-scale spatiotemporal aggregator, and a static decoder; using the multi-frame static feature prediction neural network in the central cloud device, spatiotemporal aggregation and decoding are performed on multiple adjacent video frames before and after the target video to obtain a second reconstructed video frame containing the static semantics of multiple adjacent video frames before and after, including steps B1 to B3:

[0088] Step B1: Use the second spatial encoder to obtain the spatial features of each adjacent video frame.

[0089] Step B2: Use the multi-scale spatiotemporal aggregator to aggregate the spatial features of each adjacent video frame to obtain the static appearance features of the target video.

[0090] Step B3: Use the static decoder to reconstruct the static appearance features of the target video to obtain a second reconstructed video frame containing the static semantics of multiple adjacent video frames.

[0091] In this embodiment, the second spatial encoder is a weighted 2D convolutional spatial encoder. It extracts features from each adjacent video frame to obtain the spatial features of each frame. Then, based on the continuity of motion, a multi-scale spatiotemporal aggregator is used to predict the static appearance features of the target video frame by simultaneously combining the spatial features of preceding and following video frames. A static decoder reconstructs the target video frame from its static appearance features to generate a second reconstructed video frame. The difference between the second reconstructed video frame and the target video frame is used as an anomaly criterion.

[0092] Specifically, the spatial features include: the spatial features of the first two adjacent video frames and the spatial features of the last two adjacent video frames; the multi-scale spatiotemporal aggregator includes: a forward multi-scale spatiotemporal aggregator and a backward multi-scale spatiotemporal aggregator.

[0093] The spatial features of each adjacent video frame are aggregated using a multi-scale spatiotemporal aggregator to obtain the static appearance features of the target video. This includes: inputting the spatial features of the first two adjacent video frames into the forward multi-scale spatiotemporal aggregator for aggregation to obtain forward spatiotemporal features; inputting the spatial features of the last two adjacent video frames into the backward multi-scale spatiotemporal aggregator for aggregation to obtain backward spatiotemporal features; and obtaining the static appearance features of the target video based on the forward and backward spatiotemporal features.

[0094] In this embodiment, the spatial features of the first two adjacent video frames and the last two adjacent video frames are respectively introduced into a forward multi-scale spatiotemporal aggregator and a backward multi-scale spatiotemporal aggregator to aggregate spatial features in a bidirectional manner, resulting in forward and backward spatiotemporal features. Each multi-scale spatiotemporal aggregator consists of a stack of two Convolutional Gated Recursive Units (Conv GRUs). Each Conv GRU receives the hidden state from the previous step, and for each time step, the spatial features of the four scales are propagated to the next time step.

[0095] For example, the computation process of spatial features by a convolutional gated recursive unit can be represented as follows:

[0096]

[0097] The operator "*" represents the convolution operation; the operator σ represents element-wise multiplication; tanh represents hyperbolic tangent; σ represents sigmoid activation function; w zx w zh w rx w rh w ox woh Both represent weight matrices; b z b r b o Both represent deviation vectors; Represents the update gate feature at time t for each scale of the i-th time; Represents the reset gate feature at time t for each scale of the i-th scale; F represents the candidate hidden state features at time t for each scale of the i-th time; t (i) Let represent the output hidden state features at time t for each scale of the i-th time.

[0098] Finally, based on the forward and backward spatiotemporal features, the static appearance features of the target video are obtained. The static appearance features of the target video include output hidden states at multiple scales. The output hidden state for each scale is represented as follows:

[0099]

[0100] in, This represents the output hidden state of the feedforward convolution gated recursive unit at the i-th scale. This represents the output hidden state of the backward convolution gated recursive unit at the i-th scale.

[0101] Step S140: Based on the dynamic information anomaly detection results and the static information anomaly detection results, the final anomaly detection result is obtained.

[0102] In this embodiment, the dynamic information anomaly detection result is a result quickly obtained by the edge node device based on a single target video frame, while the static information anomaly detection result is a result obtained by the central cloud device based on the static semantic information of several adjacent video frames before and after the target video frame. The dynamic information anomaly detection result is corrected using the static information anomaly detection result to obtain a more accurate final anomaly detection result.

[0103] In this embodiment, a lightweight single-frame dynamic feature prediction neural network in the edge node device can acquire the appearance dynamic information of a single target video frame. Anomaly detection is then performed based on this dynamic information, yielding a dynamic information anomaly detection result. Since the lightweight single-frame dynamic feature prediction neural network in the edge node device processes a single target video frame from the video data, rapid anomaly detection can be achieved in the edge node device. Furthermore, a multi-frame static feature prediction neural network in the central cloud device acquires the static semantics of several adjacent video frames before and after the target video frame. More accurate anomaly detection can be achieved based on the static semantics of these adjacent video frames. Finally, by combining the dynamic information anomaly detection result and the static information anomaly detection result, an accurate final anomaly detection result is obtained. This achieves efficient and accurate anomaly behavior detection.

[0104] In one optional embodiment, the anomaly detection method is determined based on the network bandwidth status of the edge node devices and the central cloud device. (Refer to...) Figure 3 As shown, Figure 3 This is a flowchart of another edge-intelligent collaborative detection method for dynamic and static features of surveillance video provided in this application embodiment. The method includes steps S310 to S330:

[0105] Step S310: Monitor the network bandwidth status of the edge node device and the central cloud device.

[0106] Step S320: When the network bandwidth status does not meet the real-time requirements, anomaly detection is performed using a lightweight single-frame dynamic feature prediction neural network in the edge node device, so as to provide real-time early warning based on the anomaly detection results of dynamic information.

[0107] Step S330: When the network bandwidth status meets the real-time requirements, collaborative anomaly detection is performed using a lightweight single-frame dynamic feature prediction neural network in the edge node device and a multi-frame static feature prediction neural network in the central cloud device to obtain the final anomaly detection result.

[0108] In this embodiment, the single-frame dynamic feature prediction neural network and the multi-frame static feature prediction neural network are assigned to different devices (edge ​​node devices and central cloud devices) to optimize the computational burden of abnormal behavior detection. Since the single-frame dynamic feature prediction neural network follows a lightweight design principle, requiring only processing a single target video frame, it is deployed on edge node devices, which are typically located close to the terminal device and can directly acquire video data. In contrast, the multi-frame static feature prediction neural network has relatively high computational complexity; therefore, it is assigned to the central cloud device, which has more powerful computing capabilities, and acquires data through communication transmission. With the edge node devices and central cloud devices collaborating, the single-frame dynamic feature prediction neural network and the multi-frame static feature prediction neural network can perform parallel inference.

[0109] Considering the unstable communication bandwidth and limited computing resources on edge node devices, the end-to-end latency of the single-frame dynamic feature prediction neural network and the multi-frame static feature prediction neural network varies with the network bandwidth status. If the network bandwidth status is blocked (i.e., the network bandwidth status does not meet the real-time requirements), step S330 is executed to use the lightweight single-frame dynamic feature prediction neural network in the edge node device for anomaly detection, so as to use the single-frame dynamic feature prediction neural network with faster response to obtain early warning. If the network bandwidth status is good (i.e., the network bandwidth status meets the real-time requirements), video data can be uploaded to the edge node device and the central cloud device, and step S340 is executed to use the lightweight single-frame dynamic feature prediction neural network in the edge node device and the multi-frame static feature prediction neural network in the central cloud device for collaborative anomaly detection to obtain a high-precision final anomaly detection result.

[0110] In an optional embodiment, the abnormal dynamic information detection based on the first reconstructed video frame and the abnormal static information detection based on the second reconstructed video frame are performed according to the following steps C1 to C2:

[0111] Step C1: Use the peak signal-to-noise ratio between the target video frame and the reconstructed video frame as the quality evaluation index of the target video frame. The quality evaluation index characterizes the normality of the target video frame. The reconstructed video frame includes: a first reconstructed video frame and a second reconstructed video frame.

[0112] Step C2: Normalize the quality assessment metrics of all target video frames in the video data to obtain anomaly detection results, which include dynamic information anomaly detection results and static information anomaly detection results.

[0113] In this embodiment, a first reconstructed video frame is reconstructed using a single-frame dynamic feature prediction neural network, and a second reconstructed video frame is reconstructed using a multi-frame static feature prediction neural network. Anomaly detection is performed based on the error between the reconstructed video frame and the target video frame. Specifically, the error between the target video frame and the reconstructed video frame is measured by the peak signal-to-noise ratio (PSNR), thus the PSNR between the target video frame and the reconstructed video frame is used as the quality evaluation index of the target video frame.

[0114] For example, the peak signal-to-noise ratio between the target video frame and the reconstructed video frame is expressed as:

[0115]

[0116] Among them, PSNR represents Peak Signal-to-Noise Ratio (a quality assessment metric), which characterizes the normality of the target video frame. The higher the value of the quality assessment metric, the greater the probability that the target video frame is normal; t Indicates the target video frame; This indicates the reconstruction of video frames. This indicates the first reconstructed video frame. This represents the second reconstructed video frame, and N represents the number of target video frames in the video data.

[0117] Then, the quality assessment metrics of all target video frames in the video data are normalized to the range [0,1], thus obtaining the anomaly detection result (i.e., normality score). Represented as:

[0118]

[0119] in, The quality assessment metric representing the smallest target video frame in the video data; This represents the quality assessment metric for the largest target video frame in the video data.

[0120] Finally, based on the dynamic information anomaly detection results and the static information anomaly detection results, the final anomaly detection results are obtained, including: by weighted fusion, the dynamic information anomaly detection results and the static information anomaly detection results are weighted and summed to obtain the final anomaly detection results.

[0121] For example, the final anomaly detection result S(T) is expressed as:

[0122]

[0123] in, This indicates the result of dynamic information anomaly detection. λ represents the static information anomaly detection result, and λ represents the fusion weight.

[0124] In an optional embodiment, the single-frame dynamic feature prediction neural network and the multi-frame static feature prediction neural network are obtained according to the following dynamic and static joint training steps D1 to D3:

[0125] Step D3: Obtain the first sample reconstructed video frame corresponding to the sample target video frame in the sample video data according to the single-frame dynamic feature prediction neural network, and obtain the second sample reconstructed video frame corresponding to the sample target video frame according to the multi-frame static feature prediction neural network.

[0126] Step D3: Construct a dynamic loss based on the reconstructed video frame and the target video frame based on the first sample, and construct a static loss based on the reconstructed video frame and the target video frame based on the second sample;

[0127] Step D3: Update the network parameters of the single-frame dynamic feature prediction neural network and the multi-frame static feature prediction neural network based on the dynamic loss and the static loss.

[0128] In this embodiment, to achieve accurate prediction of normal events during the training phase, a joint dynamic and static training strategy is adopted to reconstruct video frames from second samples generated by a multi-frame static feature prediction neural network. Directly used for approximate sample target video frame I T The first sample video frame is reconstructed from the bidirectional dynamic feature map M generated by the single-frame dynamic feature prediction neural network. Ensure semantic consistency between static and dynamic features.

[0129] To make the reconstructed video frames similar to the target video frames, a dynamic loss is constructed based on the first reconstructed video frames and the target video frames, and a static loss is constructed based on the second reconstructed video frames and the target video frames. The network parameters of the single-frame dynamic feature prediction neural network and the multi-frame static feature prediction neural network are then updated based on the dynamic loss and the static loss to complete the training.

[0130] Specifically, the dynamic loss and the static loss include: strength constraint loss, gradient constraint loss, perceptual constraint loss, and adversarial constraint loss.

[0131] For example, the dynamic loss and static loss are expressed as follows:

[0132] l=ω int l int +ω grad l grad +ω per l per+ω adv l adv

[0133] Where l represents dynamic loss or static loss, l int Indicates the strength constraint loss, l grad Represents the gradient constraint loss, l per Represents the perceptual constraint loss, l adv Represents the loss against constraints, ω int The hyperparameter ω represents the relative contribution of the strength constraint loss. grad The hyperparameter ω represents the relative contribution of the gradient constraint loss. per The hyperparameter ω represents the relative contribution of the perceptual constraint loss. adv Hyperparameters representing the relative contribution of the loss against constraints.

[0134] Specifically, the dynamic loss and the static loss are constructed according to steps E1 to E4:

[0135] Step E1: Obtain the intensity constraint loss based on the Euclidean distance between the sample reconstructed video frame and the sample target video frame. The sample reconstructed video frame includes: a first sample reconstructed video frame and a second sample reconstructed video frame.

[0136] In this embodiment, the Euclidean distance between two frames (i.e., the reconstructed video frame and the target video frame) in the intensity space is used as a basic commonality constraint to obtain the intensity constraint loss.

[0137] For example, the strength constraint loss is expressed as:

[0138]

[0139] in, This indicates the reconstruction of video frames. This indicates that the first sample is used to reconstruct the video frame. Indicates the second sample reconstructed video frame; I T This represents the target video frame of the sample.

[0140] Step E2: Based on the gradient magnitude similarity between the reconstructed video frame and the target video frame, obtain the gradient constraint loss.

[0141] In this embodiment, gradient magnitude similarity is used as another penalty signal to ensure the sharpness preservation of the reconstructed video frame, resulting in gradient constraint loss.

[0142] For example, the gradient-constrained loss is expressed as:

[0143]

[0144] Where C0 represents a constant that ensures numerical stability. This indicates the gradient magnitude plot calculation operation using a 3×3 Prewitt filter.

[0145] Step E3: Based on the local structural similarity error between the reconstructed video frame and the target video frame, obtain the perceptual constraint loss.

[0146] In this embodiment, the local structural similarity (SSIM) error, which reflects the perceptual degradation of video frames, is used as a penalty term to obtain the perceptual constraint loss.

[0147] For example, the perceptual constraint loss is expressed as:

[0148]

[0149] in, and These are the mean and variance of the target video frames, respectively. and These are the mean and variance of the sampled reconstructed video frames, respectively; constants C1 and C2 ensure numerical stability; the SSIM value is typically calculated at each pixel using the sliding window strategy described in [the document / section].

[0150] Step E4: Perform adversarial learning on the reconstructed video frame based on the sample and the target video frame to obtain the adversarial constraint loss.

[0151] In this embodiment, adversarial learning is introduced to produce more realistic prediction results, resulting in adversarial constraint loss.

[0152] For example, the adversarial constraint loss is expressed as:

[0153]

[0154] Where D represents the additional discrimination network; G represents the additional generator network; P g P represents the distribution of the generated images; r This represents the distribution of the real image. Since the bidirectional dynamic feature map contains an inherent correspondence between static and dynamic features, constraints are applied to the reconstructed video frame of the first sample to ensure the consistency of static and dynamic features.

[0155] Finally, the total constraint loss L, consisting of dynamic loss and static loss, can be expressed as:

[0156]

[0157] in, Indicates dynamic loss. This represents the static loss. Therefore, the network parameters of the single-frame dynamic feature prediction neural network and the multi-frame static feature prediction neural network are updated based on the total constraint loss to complete the training.

[0158] It should be noted that, in the embodiments of this application, the first spatial encoder and the second spatial encoder are shared components of the single-frame dynamic feature prediction neural network and the multi-frame static feature prediction neural network, and thus the intrinsic representations of static and dynamic can be learned simultaneously through this multi-task learning.

[0159] For example, Figure 4 This is an overall architecture diagram of an edge-intelligent collaborative detection method for dynamic and static features in surveillance videos, provided in an embodiment of this application. Specifically, the overall architecture includes a single-frame dynamic feature prediction neural network and a multi-frame static feature prediction neural network. The single-frame dynamic feature prediction neural network includes a first spatial encoder and a dynamic decoder, and the multi-frame static feature prediction neural network includes a second spatial encoder, a multi-scale spatiotemporal aggregator, and a static decoder.

[0160] Among them, the first spatial encoder and the second spatial encoder are shared components of the single-frame dynamic feature prediction neural network and the multi-frame static feature prediction neural network; and all convolutional layers and transposed convolutional layers (T-Conv2D) in the first spatial encoder, dynamic decoder, second spatial encoder, multi-scale spatiotemporal aggregator and static decoder use 3×3 convolutional kernels, max pooling layers with a pooling size of 2 are used for downsampling operations, while transposed convolutional layers are used for upsampling operations.

[0161] The process for anomaly detection of video data acquired by terminal devices is as follows: At the edge node device, a lightweight single-frame dynamic feature prediction neural network is used to process the target video frame in the video data. The single-frame dynamic feature prediction neural network extracts the spatial features of the target video frame through a first spatial encoder to obtain a bidirectional dynamic feature map associated with the preceding and following video frames of the target video frame. Then, a dynamic decoder is used to reconstruct the features based on the bidirectional dynamic feature map to obtain a first reconstructed video frame containing the dynamic features of the preceding and following video frames. Finally, anomaly dynamic information detection is performed based on the first reconstructed video frame to obtain the dynamic information anomaly detection result.

[0162] At the central cloud device, a multi-frame static feature prediction neural network is used to perform spatiotemporal aggregation and decoding on multiple adjacent video frames before and after the target video. Specifically: the multi-frame static feature prediction neural network uses a second spatial encoder to obtain the spatial features of each adjacent video frame; then, the spatial features of the first two adjacent video frames are input into a forward multi-scale spatiotemporal aggregator for aggregation to obtain forward spatiotemporal features, and the spatial features of the last two adjacent video frames are input into a backward multi-scale spatiotemporal aggregator for aggregation to obtain backward spatiotemporal features. Thus, based on the forward and backward spatiotemporal features, the static appearance features of the target video are obtained; the static decoder is used to reconstruct the video frames based on the static appearance features of the target video to obtain a second reconstructed video frame containing the static semantics of multiple adjacent video frames before and after.

[0163] Finally, based on the results of dynamic information anomaly detection and static information anomaly detection, the final anomaly detection result is obtained.

[0164] Thus, a cloud-edge collaborative anomaly detection framework is implemented to improve the efficiency of video anomaly detection in IoT scenarios. A lightweight single-frame dynamic feature prediction neural network in the edge node device can acquire the appearance dynamic information of a single target video frame (i.e., the dynamic features of adjacent video frames). Anomaly detection is then performed based on this dynamic information, yielding a dynamic information anomaly detection result. Since the lightweight single-frame dynamic feature prediction neural network in the edge node device processes a single target video frame from the video data, rapid anomaly detection can be achieved in the edge node device. Furthermore, a multi-frame static feature prediction neural network in the central cloud device acquires the static semantics of multiple adjacent video frames before and after the target video frame. More accurate anomaly detection can be achieved based on the static semantics of these multiple adjacent video frames. Finally, the dynamic and static information anomaly detection results are combined to obtain an accurate final anomaly detection result. This achieves efficient and accurate anomaly detection.

[0165] This application also provides an edge-intelligent collaborative detection device for dynamic and static features of surveillance videos, referring to... Figure 5 As shown, Figure 5 This is a schematic diagram of the structure of an edge-intelligent surveillance video dynamic and static feature collaborative detection device provided in an embodiment of this application. The device includes:

[0166] Data acquisition module 510 is used to acquire video data using terminal devices;

[0167] The first detection module 520 is used to use a lightweight single-frame dynamic feature prediction neural network in the edge node device to perform inter-frame bidirectional motion constraints and dynamic feature extraction on the target video frame in the video data, to obtain a first reconstructed video frame containing dynamic features of adjacent video frames, and to perform abnormal dynamic information detection based on the first reconstructed video frame to obtain a dynamic information abnormality detection result.

[0168] The second detection module 530 is used to use a multi-frame static feature prediction neural network in the central cloud device to perform spatiotemporal aggregation and decoding processing on multiple adjacent video frames before and after the target video to obtain a second reconstructed video frame containing the static semantics of multiple adjacent video frames before and after, and to perform abnormal static information detection based on the second reconstructed video frame to obtain a static information abnormality detection result.

[0169] The final result module 540 is used to obtain the final anomaly detection result based on the dynamic information anomaly detection result and the static information anomaly detection result.

[0170] In an optional embodiment, the device further includes:

[0171] The status detection module is used to monitor the network bandwidth status of the edge node devices and the central cloud devices;

[0172] The edge early warning module is used to perform anomaly detection using a lightweight single-frame dynamic feature prediction neural network in the edge node device when the network bandwidth status does not meet the real-time requirements, so as to provide real-time early warning based on the anomaly detection results of dynamic information.

[0173] The final result module includes:

[0174] The collaborative early warning module is used to perform collaborative anomaly detection by utilizing a lightweight single-frame dynamic feature prediction neural network in the edge node device and a multi-frame static feature prediction neural network in the central cloud device, when the network bandwidth status meets the real-time requirements, so as to obtain the final anomaly detection result.

[0175] In an optional embodiment, the device further includes:

[0176] The quality assessment index module is used to take the peak signal-to-noise ratio between the target video frame and the reconstructed video frame as the quality assessment index of the target video frame. The quality assessment index characterizes the normality of the target video frame. The reconstructed video frame includes: a first reconstructed video frame and a second reconstructed video frame.

[0177] The normalization processing module is used to normalize the quality evaluation indicators of all target video frames in the video data to obtain anomaly detection results, which include dynamic information anomaly detection results and static information anomaly detection results.

[0178] In one optional embodiment, the single-frame dynamic feature prediction neural network includes: a first spatial encoder and a dynamic decoder; the first detection module includes:

[0179] The first extraction module is used to extract the spatial features of the target video frame using the first spatial encoder to obtain a bidirectional dynamic feature map associated with the preceding and following video frames of the target video frame.

[0180] The first reconstruction module is used to perform feature reconstruction based on the bidirectional dynamic feature map using the dynamic decoder to obtain a first reconstructed video frame containing dynamic features of adjacent video frames.

[0181] In one optional embodiment, the first reconstruction module includes:

[0182] A construction module is used to construct adjacent video frames before and after the target video frame based on the bidirectional dynamic feature map and the target video frame;

[0183] The mapping module is used to perform an inverse mapping transformation based on the preceding and following video frames of the target video frame and the bidirectional dynamic feature map to obtain a first reconstructed video frame containing the dynamic features of the preceding and following video frames.

[0184] In one optional embodiment, the multi-frame static feature prediction neural network includes: a second spatial encoder, a multi-scale spatiotemporal aggregator, and a static decoder; the second detection module includes:

[0185] The spatial feature acquisition module is used to acquire the spatial features of each adjacent video frame using the second spatial encoder.

[0186] The feature aggregation module is used to aggregate the spatial features of each of the adjacent video frames using the multi-scale spatiotemporal aggregator to obtain the static appearance features of the target video.

[0187] The second reconstruction module is used to reconstruct video frames by utilizing the static decoder to reconstruct the static appearance features of the target video, thereby obtaining a second reconstructed video frame containing the static semantics of multiple adjacent video frames.

[0188] In one optional embodiment, the spatial features include: spatial features of two adjacent video frames in the preceding two frames and spatial features of two adjacent video frames in the following two frames; the multi-scale spatiotemporal aggregator includes: a forward multi-scale spatiotemporal aggregator and a backward multi-scale spatiotemporal aggregator; the feature aggregation module includes:

[0189] The forward aggregation module is used to input the spatial features of the first two adjacent video frames into the forward multi-scale spatiotemporal aggregator for aggregation to obtain forward spatiotemporal features.

[0190] The backward aggregation module is used to input the spatial features of the next two adjacent video frames into the backward multi-scale spatiotemporal aggregator for aggregation to obtain backward spatiotemporal features.

[0191] The static appearance feature module is used to obtain the static appearance features of the target video based on the forward spatiotemporal features and the backward spatiotemporal features.

[0192] In an optional embodiment, the apparatus includes a training module for training the single-frame dynamic feature prediction neural network and the multi-frame static feature prediction neural network, the training module comprising:

[0193] The video frame reconstruction module is used to obtain a first sample reconstructed video frame corresponding to a sample target video frame in the sample video data based on the single-frame dynamic feature prediction neural network, and to obtain a second sample reconstructed video frame corresponding to the sample target video frame based on the multi-frame static feature prediction neural network.

[0194] The loss construction module is used to construct a dynamic loss based on the reconstructed video frame and the target video frame based on the first sample, and to construct a static loss based on the reconstructed video frame and the target video frame based on the second sample.

[0195] The parameter update module is used to update the network parameters of the single-frame dynamic feature prediction neural network and the network parameters of the multi-frame static feature prediction neural network based on the dynamic loss and the static loss.

[0196] In one optional embodiment, the dynamic loss and the static loss include: strength constraint loss, gradient constraint loss, perceptual constraint loss, and adversarial constraint loss; the parameter update module includes:

[0197] The intensity constraint loss module is used to obtain the intensity constraint loss based on the Euclidean distance between the sample reconstructed video frame and the sample target video frame. The sample reconstructed video frame includes: a first sample reconstructed video frame and a second sample reconstructed video frame.

[0198] The gradient constraint loss module is used to obtain the gradient constraint loss based on the gradient magnitude similarity between the reconstructed video frame and the target video frame of the sample;

[0199] The perceptual constraint loss module is used to obtain the perceptual constraint loss by reconstructing the local structural similarity error between the video frame and the target video frame based on the sample.

[0200] The adversarial constraint loss module is used to perform adversarial learning on the reconstructed video frames and the target video frames based on the samples to obtain the adversarial constraint loss.

[0201] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.

[0202] This application describes embodiments of methods and apparatus according to flowchart illustrations and / or block diagrams. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, generate instructions for implementing the flowchart. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0203] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0204] These computer program instructions can also be loaded onto a computer or other programmable data processing terminal equipment, causing a series of operational steps to be performed on the computer or other programmable terminal equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable terminal equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0205] Although preferred embodiments of the present application have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of the embodiments of the present application.

[0206] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal device. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or terminal device that includes said element.

[0207] The above provides a detailed description of the edge-intelligent collaborative detection method and apparatus for dynamic and static features of surveillance video. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of this application. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.

Claims

1. A collaborative detection method for dynamic and static features of edge-intelligent surveillance videos, characterized in that, The method includes: Acquire video data using terminal devices; A lightweight single-frame dynamic feature prediction neural network in an edge node device is used to perform bidirectional motion constraints and dynamic feature extraction on the target video frame in the video data to obtain a first reconstructed video frame containing dynamic features of adjacent video frames. Anomaly detection is performed based on the first reconstructed video frame to obtain anomaly detection results. Using a multi-frame static feature prediction neural network in a central cloud device, spatiotemporal aggregation and decoding are performed on multiple adjacent video frames before and after the target video to obtain a second reconstructed video frame containing the static semantics of multiple adjacent video frames before and after. Based on the second reconstructed video frame, abnormal static information detection is performed to obtain a static information anomaly detection result. Based on the dynamic information anomaly detection results and the static information anomaly detection results, the final anomaly detection result is obtained; The single-frame dynamic feature prediction neural network includes a first spatial encoder and a dynamic decoder. Utilizing a lightweight single-frame dynamic feature prediction neural network in an edge node device, bidirectional motion constraints and dynamic feature extraction are performed on target video frames in the video data to obtain a first reconstructed video frame containing dynamic features of adjacent video frames. This includes: extracting spatial features of the target video frame using the first spatial encoder to obtain a bidirectional dynamic feature map associated with adjacent video frames; and reconstructing features based on the bidirectional dynamic feature map using the dynamic decoder to obtain the first reconstructed video frame containing dynamic features of adjacent video frames. The multi-frame static feature prediction neural network includes a second spatial encoder, a multi-scale spatiotemporal aggregator, and a static decoder. Using the multi-frame static feature prediction neural network in the central cloud device, spatiotemporal aggregation and decoding are performed on multiple adjacent video frames preceding and following the target video to obtain a second reconstructed video frame containing the static semantics of the multiple adjacent video frames. This includes: acquiring the spatial features of each adjacent video frame using the second spatial encoder; aggregating the spatial features of each adjacent video frame using the multi-scale spatiotemporal aggregator to obtain the static appearance features of the target video; and reconstructing the video frames using the static appearance features of the target video using the static decoder to obtain the second reconstructed video frame containing the static semantics of the multiple adjacent video frames preceding and following.

2. The edge-intelligent collaborative detection method for dynamic and static features of surveillance video according to claim 1, characterized in that, The method further includes: Monitor the network bandwidth status of the edge node devices and the central cloud devices; When the network bandwidth does not meet the real-time requirements, a lightweight single-frame dynamic feature prediction neural network in the edge node device is used for anomaly detection, so as to provide real-time early warning based on the anomaly detection results of dynamic information. Based on the dynamic information anomaly detection results and the static information anomaly detection results, the final anomaly detection results are obtained, including: When the network bandwidth meets the real-time requirements, a lightweight single-frame dynamic feature prediction neural network in the edge node device and a multi-frame static feature prediction neural network in the central cloud device are used for collaborative anomaly detection to obtain the final anomaly detection result.

3. The edge-intelligent collaborative detection method for dynamic and static features of surveillance video according to claim 1, characterized in that, The detection of abnormal dynamic information based on the first reconstructed video frame and the detection of abnormal static information based on the second reconstructed video frame are performed according to the following steps: The peak signal-to-noise ratio between the target video frame and the reconstructed video frame is used as the quality evaluation index of the target video frame. The quality evaluation index characterizes the normality of the target video frame. The reconstructed video frame includes: a first reconstructed video frame and a second reconstructed video frame. The quality assessment metrics of all target video frames in the video data are normalized to obtain anomaly detection results, which include dynamic information anomaly detection results and static information anomaly detection results.

4. The edge-intelligent collaborative detection method for dynamic and static features of surveillance video according to claim 1, characterized in that, Using a dynamic decoder to reconstruct features based on the bidirectional dynamic feature map, a first reconstructed video frame containing dynamic features of consecutive video frames is obtained, including: Based on the bidirectional dynamic feature map and the target video frame, construct the adjacent video frames before and after the target video frame; An inverse mapping transformation is performed based on the preceding and following video frames of the target video frame and the bidirectional dynamic feature map to obtain a first reconstructed video frame containing the dynamic features of the preceding and following video frames.

5. The edge-intelligent collaborative detection method for dynamic and static features of surveillance video according to claim 1, characterized in that, The spatial features include: the spatial features of the first two adjacent video frames and the spatial features of the last two adjacent video frames; the multi-scale spatiotemporal aggregator includes: a forward multi-scale spatiotemporal aggregator and a backward multi-scale spatiotemporal aggregator; The spatial features of each adjacent video frame are aggregated using a multi-scale spatiotemporal aggregator to obtain the static appearance features of the target video, including: The spatial features of the first two adjacent video frames are input into the forward multi-scale spatiotemporal aggregator for aggregation to obtain forward spatiotemporal features. The spatial features of the next two adjacent video frames are input into the backward multi-scale spatiotemporal aggregator for aggregation to obtain backward spatiotemporal features; Based on the forward spatiotemporal features and the backward spatiotemporal features, the static appearance features of the target video are obtained.

6. The edge-intelligent collaborative detection method for dynamic and static features of surveillance video according to claim 1, characterized in that, The single-frame dynamic feature prediction neural network and the multi-frame static feature prediction neural network are obtained according to the following joint dynamic and static training steps: The first sample reconstructed video frame corresponding to the sample target video frame in the sample video data is obtained according to the single-frame dynamic feature prediction neural network, and the second sample reconstructed video frame corresponding to the sample target video frame is obtained according to the multi-frame static feature prediction neural network. A dynamic loss is constructed based on the reconstructed video frame from the first sample and the target video frame from the sample, and a static loss is constructed based on the reconstructed video frame from the second sample and the target video frame from the sample. The network parameters of the single-frame dynamic feature prediction neural network and the network parameters of the multi-frame static feature prediction neural network are updated based on the dynamic loss and the static loss.

7. The edge-intelligent collaborative detection method for dynamic and static features of surveillance video according to claim 6, characterized in that, The dynamic loss and the static loss include: strength constraint loss, gradient constraint loss, perceptual constraint loss, and adversarial constraint loss; The dynamic loss and the static loss are constructed according to the following steps: The intensity constraint loss is obtained based on the Euclidean distance between the sample reconstructed video frame and the sample target video frame. The sample reconstructed video frame includes: a first sample reconstructed video frame and a second sample reconstructed video frame. Gradient constraint loss is obtained by comparing the gradient magnitude similarity between the reconstructed video frame and the target video frame. Based on the local structural similarity error between the reconstructed video frame and the target video frame, the perceptual constraint loss is obtained. Adversarial learning is performed on the reconstructed video frames based on the samples and the target video frames based on the samples to obtain the adversarial constraint loss.

8. A collaborative detection device for dynamic and static features of edge-intelligent surveillance video, characterized in that, The device includes: The data acquisition module is used to acquire video data using terminal devices; The first detection module is used to use a lightweight single-frame dynamic feature prediction neural network in the edge node device to perform inter-frame bidirectional motion constraints and dynamic feature extraction on the target video frame in the video data, to obtain a first reconstructed video frame containing the dynamic features of adjacent video frames, and to perform abnormal dynamic information detection based on the first reconstructed video frame to obtain a dynamic information anomaly detection result. The second detection module is used to use a multi-frame static feature prediction neural network in the central cloud device to perform spatiotemporal aggregation and decoding processing on multiple adjacent video frames before and after the target video to obtain a second reconstructed video frame containing the static semantics of multiple adjacent video frames before and after, and to perform abnormal static information detection based on the second reconstructed video frame to obtain a static information abnormality detection result. The final result module is used to obtain the final anomaly detection result based on the dynamic information anomaly detection result and the static information anomaly detection result; The single-frame dynamic feature prediction neural network includes a first spatial encoder and a dynamic decoder. Utilizing a lightweight single-frame dynamic feature prediction neural network in an edge node device, bidirectional motion constraints and dynamic feature extraction are performed on target video frames in the video data to obtain a first reconstructed video frame containing dynamic features of adjacent video frames. This includes: extracting spatial features of the target video frame using the first spatial encoder to obtain a bidirectional dynamic feature map associated with adjacent video frames; and reconstructing features based on the bidirectional dynamic feature map using the dynamic decoder to obtain the first reconstructed video frame containing dynamic features of adjacent video frames. The multi-frame static feature prediction neural network includes a second spatial encoder, a multi-scale spatiotemporal aggregator, and a static decoder. Using the multi-frame static feature prediction neural network in the central cloud device, spatiotemporal aggregation and decoding are performed on multiple adjacent video frames preceding and following the target video to obtain a second reconstructed video frame containing the static semantics of the multiple adjacent video frames. This includes: acquiring the spatial features of each adjacent video frame using the second spatial encoder; aggregating the spatial features of each adjacent video frame using the multi-scale spatiotemporal aggregator to obtain the static appearance features of the target video; and reconstructing the video frames using the static appearance features of the target video using the static decoder to obtain the second reconstructed video frame containing the static semantics of the multiple adjacent video frames preceding and following.