Video behavior key frame sampling method based on effective motion prior

By adaptively selecting key frames through inter-frame motion information and similarity analysis, the high computational complexity and low robustness of existing technologies are solved, and efficient video behavior recognition in complex environments is achieved.

CN118334551BActive Publication Date: 2026-06-26XIDIAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
XIDIAN UNIV
Filing Date
2024-04-12
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing video behavior recognition methods have significant problems in terms of computational resource consumption and complexity, and cannot accurately describe the entire process of target movement in complex environments. In particular, neural network-based methods consume a large amount of training data and time, and cannot analyze videos with unknown behaviors.

Method used

By calculating local motion information and overall similarity between video frames, key frames are adaptively selected to reduce the impact of abnormal event frames, thereby reducing computational complexity and improving robustness.

Benefits of technology

It enables accurate description of the entire process of target motion in complex environments, reduces computational complexity and resource consumption, and improves the accuracy of behavior recognition.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118334551B_ABST
    Figure CN118334551B_ABST
Patent Text Reader

Abstract

The application discloses a video behavior key frame sampling method based on effective motion prior, and mainly solves the problems of large calculation resource requirement, high complexity and low robustness of the prior art. The implementation scheme is as follows: 1) extracting local motion information between adjacent frames of a video; 2) detecting abnormal event frames and adjusting corresponding local motion information; 3) normalizing local motion information of all frames and accumulating local motion information along a time dimension; 4) uniformly dividing accumulated motion information into multiple intervals; 5) selecting a key information frame from each interval, using the selected key frames in all intervals to form a frame subset, and completing key frame sampling of the input video. The application does not need to introduce additional network parameters, can select key information frames from a complex scene video, provides high-quality original video subsequence data for accurately identifying a target action category, and can be used for video auditing, intelligent security and sports action analysis.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of video processing technology, and further relates to a keyframe sampling method, which can be used in video review, intelligent security, and sports analysis scenarios to extract keyframes from a given video sequence and identify subsequent behaviors, providing important reference for monitoring illegal videos and violent behaviors. Background Technology

[0002] With the rapid development of large-scale datasets and deep neural networks, video action recognition has made significant progress. However, existing action recognition models are typically large in scale, and using all frames in the video during the recognition process would incur enormous computational costs, making them difficult to apply in real-world production. To address this issue, frame sampling offers a promising solution for video action recognition.

[0003] In existing action recognition frameworks, commonly used frame sampling methods include dense sampling and uniform sampling. The former samples frames at equal intervals, while the latter divides the video evenly into segments and selects a frame from each segment. Both methods lack flexibility because they ignore the varying importance of each frame. To overcome this deficiency, some research focuses on designing keyframe selection modules based on neural networks. These sampling modules are incorporated into basic classification models, and the entire framework is optimized through mechanisms such as reinforcement learning. However, the drawback is that neural network-based methods require a large amount of training data and significant time investment, and cannot parse videos containing unknown behaviors during subsequent testing. In addition, some research selects key information frames in videos based on prior motion information or inter-frame differences. For example, Nanjing Zhihu Information Technology Co., Ltd. disclosed a method for extracting key information frames from surveillance videos in its patent application (application number 201510062263.X), which determines key and non-key frames by calculating the brightness difference between adjacent frames and setting a difference threshold. The shortcomings of this method are that it simply treats the brightness difference between adjacent frames as motion information of moving objects, and its applicability is limited to surveillance videos. In reality, videos can generate invalid pseudo-motion information due to abnormal events such as image shakiness, scene transitions, and lighting fluctuations. Therefore, in complex environments, it is impossible to robustly select key motion frames from the video and accurately describe the entire process of the target's motion.

[0004] Patent application CN202310199402.8 discloses a "method, system, medium, device, and terminal for extracting keyframes from surveillance videos." This method samples a set of video frames and filters the resulting image frame set. Then, it adaptively clusters the filtered image frame set and finally collects the clustering results to form a video summary. However, this method has high computational complexity when processing videos containing a large number of frames because it repeatedly calculates the similarity between some frames and other frames during adaptive clustering, thus affecting the preprocessing efficiency of the input video in behavior recognition. Summary of the Invention

[0005] The purpose of this invention is to address the shortcomings of the existing technology by proposing a video behavior keyframe sampling method based on effective motion priors, so as to accurately describe the entire process of target motion, reduce the computational complexity of video processing, and improve the keyframe sampling efficiency.

[0006] To achieve the above objectives, the technical approach of this invention is to acquire effective local motion prior information between adjacent frames in a video, perform global analysis on it, and adaptively select a subset of key information frames in the video.

[0007] Based on the above ideas, the technical solution of the present invention is as follows:

[0008] (1) Calculate the motion difference between each frame of the video and the previous frame to characterize the amount of local motion information M between adjacent frames;

[0009] (2) Calculate the overall similarity S between each frame of the video and its previous frame, and compare it with the set similarity threshold T to detect abnormal events that cause drastic changes in the frame during scene switching or lighting fluctuations.

[0010] If S≥T, then it is determined to be a normal active frame, and its local motion information M is calculated. a Execute step (3) directly;

[0011] Conversely, if the condition is not met, it is determined to be an abnormal event frame, and its local motion information M is calculated. b First, assign a weighting coefficient α and adjust it to obtain M'. b To reduce the amount of local motion information, step (3) is then performed.

[0012] (3) Local motion information M of normal active frames a and the adjusted local motion information M' of the abnormal event frame b Perform normalization;

[0013] (4) Accumulate the normalized local motion information along the time dimension time by time, and divide the accumulated local motion information evenly into multiple sub-intervals. Select a key information frame from each interval to form a frame subset in order to accurately describe the entire process of video behavior.

[0014] Compared with existing technologies, the present invention has the following advantages:

[0015] First, this invention can perceive frames containing rich motion information in any video by calculating the motion difference between each frame of the video and its previous frame without relying on an additional network model, thus reducing the complexity of sampling key frames of the video and the demand on graphics card memory resources.

[0016] Second, by calculating the overall similarity between each frame of a video and its previous frame, this invention detects abnormal event frames in the video and adjusts the amount of motion information corresponding to these abnormal event frames, thereby reducing the negative impact of abnormal event frames and improving the robustness of key information frame selection in complex scenes.

[0017] Third, in the implementation of this invention, the processing of each frame only involves calculating the motion difference and overall similarity with the previous frame, without repeatedly calculating the motion difference and overall similarity with other frames, which reduces the computational complexity of video processing and improves the preprocessing efficiency of input video in the behavior recognition process. Attached Figure Description

[0018] Figure 1 This is a flowchart of the present invention; Detailed Implementation

[0019] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, other embodiments obtained by those skilled in the art without creative effort should fall within the protection scope of the present invention.

[0020] This example is based on a video containing human behavior and actions. It extracts and adjusts the local motion information, performs a global analysis of the adjusted local motion information from a temporal perspective, and finally selects a subset of frames from the video based on the global analysis results to accurately describe the entire process of the target behavior and actions.

[0021] Reference Figure 1 The implementation steps for this example are as follows:

[0022] Step 1: Mark all frames in the video in chronological order: f1, f2, ..., fn .

[0023] Step 2: Obtain inter-frame motion information by sequentially calculating the motion difference between each frame and its predecessor to characterize the local motion information M between adjacent frames.

[0024]

[0025] Where μ i and σ i Let μ represent the mean and variance of all pixels in the i-th frame, respectively. i-1 and σ i-1 Let σ represent the mean and variance of all pixels in the (i-1)th frame, respectively. (i-1)i ζ1 and ζ2 represent the pixel covariance between the i-th frame and the (i-1)-th frame. ζ1 and ζ2 are two extremely small constants to prevent the denominator from being zero. The value of i ranges from 1 to n, and n is the total number of frames in the video. In particular, the motion difference of the first frame is set to 0.

[0026] Step 3: Calculate the overall similarity S between each frame of the video and its previous frame.

[0027] 3.1) Connect adjacent video frames f i and f i-1 Video frames converted from RGB color space to HSV color space i 'and f i ' -1 The HSV color space video frame f i 'and f i ' -1 Each contains three pixel channels;

[0028] 3.2) Statistically analyze the f values ​​of video frames in the HSV color space. i 'and f i ' -1 The pixel histograms for the three pixel channels are plotted, and the pixel histograms obtained from the two frames are flattened into one-dimensional statistics h. i and h i-1 ;

[0029] 3.3) h will be flattened into a one-dimensional statistic i and h i-1 Normalization is performed to obtain two normalized one-dimensional statistics h' i and h' i-1 ;

[0030] 3.4) Calculate h' i with h' i-1 The area of ​​the intersecting regions between frames is the inter-frame similarity S.

[0031] Step 4: Detect abnormal event frames based on the inter-frame similarity S.

[0032] Set a similarity threshold T, with a value ranging from 0 to 1. Users can make appropriate changes according to different application scenarios. In this example, the similarity threshold T is set to 0.6 based on the inter-frame similarity in a general scenario.

[0033] The inter-frame similarity S is compared with the set similarity threshold T to determine whether there are abnormal events that cause drastic changes in the frame during scene switching or lighting fluctuations, thereby detecting abnormal event frames.

[0034] If S≥T, then it is determined to be a normal active frame, and step six is ​​executed directly;

[0035] Conversely, if the frame is not found to be an abnormal event, step five is executed first, followed by step six.

[0036] Step 5: Adjust the amount of local motion information in the abnormal event frame.

[0037] The local motion information M of the abnormal event frame b A weighting coefficient α is assigned to adjust the local motion information content of the abnormal event frame, resulting in the adjusted local motion information content M'. b :

[0038] M b '=α*M b

[0039] Where M b This represents the local motion information of the abnormal event frame. α is a weighting coefficient, ranging from 0 to 1. Users can adjust α according to the intensity of the motion. In this example, α is set to 0.8 based on the motion variation in a general scenario. b It is the amount of local motion information after the abnormal event frame is adjusted.

[0040] Step 6: Normalize the local motion information of all video frames.

[0041] For a normal active frame, normalize its local motion information M. a The normalized local motion information of the normal activity frame is obtained.

[0042]

[0043] For an anomalous event frame, the normalized adjusted local motion information M' of the anomalous event frame b The normalized local motion information of the abnormal event frame is obtained.

[0044]

[0045] Step 7: Select a subset of keyframes based on the normalized local motion information.

[0046] 7.1) Accumulate the normalized local motion information along the time dimension time-by-time:

[0047]

[0048] in It is the normalized local motion information of the j-th frame. It is the cumulative local motion information of the i-th frame, specifically, The value is 0. It is the sum of the normalized local motion information of all frames in the video.

[0049] 7.2) Select a subset of keyframes based on the cumulative local motion information:

[0050] 7.2.1) Accumulate local motion information Divide the data evenly into multiple sub-intervals; from each sub-interval containing accumulated local motion information... Select the median value

[0051] 7.2.2) Among all video frames, select the one with the cumulative local motion information closest to the median value. The frame is used as the keyframe for that interval;

[0052] 7.2.3) Repeat step 7.2.2) to select keyframes from all intervals to form a frame subset, and complete the keyframe sampling of the input video.

[0053] The effects of the present invention will be further explained below with reference to simulation comparison experiments.

[0054] 1. Simulation experiment conditions:

[0055] The hardware platform for the simulation experiment is as follows: CPU model is Intel Xeon E5-2640 v4, 20 cores, with a main frequency of 2.4GHz and a memory size of 64GB; GPU is NVIDIA GeForce GTX 3080Ti with a video memory size of 12GB.

[0056] The software platform used for the simulation experiment was: Ubuntu 16.04 LTS operating system, OpenCV version 3.2.0, PyTorch version 1.12.1, and CUDA version 11.3.

[0057] The simulation experiment used the HMDB51 dataset, which consists of video clips extracted from movies, YouTube videos, and other sources. It contains 6849 video instances across 51 action categories, covering a wide range of human activities such as "climbing stairs," "combing hair," "drinking water," and "playing guitar." These categories include both simple and complex actions, presenting diverse challenges to action recognition algorithms. Each action category contains multiple video clips, averaging 101 clips per category. The duration of these clips varies, typically ranging from a few seconds to several minutes, capturing different instances of the same action.

[0058] 2. Content and Result Analysis of the Simulation Experiment:

[0059] The present invention, along with existing dense sampling methods, uniform sampling methods, and methods based on ordinary motion prior sampling, were used to perform behavior recognition on the test set of the HMDB51 dataset. The accuracy of the recognition results of the present invention and the other three methods was calculated using the following evaluation metrics:

[0060]

[0061] In this context, a correctly identified video during testing refers to a video whose label is the same as the video in the test set of the HMDB51 dataset. If the identification result in the simulation experiment of this invention is not the same as the video label in the HMDB51 dataset, the behavior identification result is considered incorrect.

[0062] The accuracy calculation results of the above four methods are shown in Table 1:

[0063] Table 1. Comparison of the accuracy of different methods for behavior recognition

[0064] method Behavior recognition accuracy Dense sampling method 66.51% Uniform sampling method 74.18% Based on ordinary motion prior sampling method 74.36% Method of the present invention 74.73%

[0065] As shown in Table 1, the behavior recognition accuracy of this invention is higher than that of existing methods. This indicates that by extracting effective local motion information from videos and further performing global analysis on this local motion information, this invention can adaptively select a set of key information frames in the video to accurately describe the entire process of target motion. It improves the accuracy of behavior recognition without introducing additional network parameters, while also solving the problems of poor flexibility and low robustness of existing sampling methods.

[0066] It should be noted that the step numbers in the specification and claims of this invention are only for the purpose of clearly describing the implementation method of this invention and facilitating understanding, and their order is not limited.

Claims

1. A video behavior keyframe sampling method based on effective motion priors, characterized in that, Includes the following steps: (1) Calculate the motion difference between each frame of the video and the previous frame to characterize the amount of local motion information M between adjacent frames; (2) Calculate the overall similarity S between each frame of the video and the previous frame, and compare it with the set similarity threshold T to detect abnormal events that cause drastic changes in the frame during scene switching or lighting fluctuations. If S≥T, it is determined to be a normal active frame, and its local motion information M is calculated. a Execute step (3) directly; Conversely, if the condition is not met, it is determined to be an abnormal event frame, and its local motion information M is calculated. b First assign a weight coefficient Adjustments were made to obtain To reduce the amount of local motion information, step (3) is then executed. (3) Local motion information M of normal active frames a and the adjusted local motion information of the abnormal event frame Perform normalization; (4) Accumulate the normalized local motion information along the time dimension time by time, and divide the accumulated local motion information evenly into multiple sub-intervals. Select a key information frame from each interval to form a frame subset in order to accurately describe the entire process of video behavior.

2. The method according to claim 1, characterized in that, Step (1) Calculate the motion difference between each frame of the video and its previous frame, using the following formula: , in and Let represent the mean and variance of all pixels in the i-th frame, respectively. and Let represent the mean and variance of all pixels in the (i-1)th frame, respectively. This represents the pixel covariance between frame i and frame i-1. and These are two extremely small constant terms to prevent the denominator from being 0. The value of i ranges from 1 to n, where n is the total number of frames in the video. In particular, the motion difference of the first frame is set to 0.

3. The method according to claim 1, characterized in that, In step (2), the overall similarity S between each frame of the video and its previous frame is calculated as follows: (2a) Connect adjacent video frames and Convert from RGB color space to HSV color space and ; (2b) Statistics and The pixel histograms of the three channels of the frame are flattened into one-dimensional statistics. and ; (2c) will and These two statistics are normalized to and ; (2d) Calculation and The area of ​​the intersecting regions between frames is the inter-frame similarity S.

4. The method according to claim 1, characterized in that, In step (2), the local motion information M of the abnormal event frame b Assign a weight coefficient Adjustments were made, and the formula is as follows: , Where M b It is the amount of local motion information in the abnormal event frame. It is a weighting coefficient. It is the amount of local motion information after the abnormal event frame is adjusted.

5. The method according to claim 1, characterized in that, Step (3) Calculate the local motion information M of the normal active frame. a Normalization is performed using the following formula: , Where M a It is the amount of local motion information in a normal active frame. It is the amount of local motion information after the abnormal event frame is adjusted. It is the amount of local motion information after normalization of normal activity frames.

6. The method according to claim 1, characterized in that, Step (3) Adjust the local motion information of the abnormal event frame Normalization is performed using the following formula: , Where M a It is the amount of local motion information in a normal active frame. It is the amount of local motion information after the abnormal event frame is adjusted. It is the amount of local motion information after the abnormal event frame is normalized.

7. The method according to claim 1, characterized in that, Step (4) accumulates the normalized local motion information along the time dimension time by time, as shown in the following formula: ; in It is the normalized local motion information of the j-th frame. It is the cumulative local motion information of the i-th frame; The value is 0. It is the sum of the normalized local motion information of all frames in the video.

8. The method according to claim 1, characterized in that, In step (4), the accumulated local motion information is evenly divided into multiple sub-intervals, and a key information frame is selected from each interval to form a frame subset, as follows: (4a) Accumulate local motion information Divide evenly into multiple sub-intervals; (4b) From each cumulative local motion information interval [ , Select the median value ; (4c) Among all video frames, select the one with the closest cumulative local motion information. The frames are used as keyframes for that interval; (4d) Repeat (4b) and (4c) to select keyframes from all intervals to form a subset of frames.