Action detection method and device, electronic equipment and storage medium
By using a pre-trained motion localization model, the motion categories and consecutive frames in the video are obtained, which solves the problem of low computational efficiency in existing technologies and enables accurate localization and efficient detection of exciting video segments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI JINSHENG COMM TECH CO LTD
- Filing Date
- 2022-10-26
- Publication Date
- 2026-06-26
Smart Images

Figure CN115661928B_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of computer technology, and specifically relates to an action detection method, device, electronic device, and readable storage medium. Background Technology
[0002] Existing temporal action detection methods typically involve first generating temporal action suggestions using a sliding window, and then using classifiers such as SVM to predict the start and end times of the actions and classify them. Because the sliding window method is computationally inefficient and, to some extent, limits the temporal boundaries of the actions, it is difficult to accurately locate highlight segments. More information is needed to accurately locate these highlights, which fails to meet the demand for high-quality video highlights from competitions. Summary of the Invention
[0003] In view of the above problems, this application proposes a liveness detection method, apparatus, electronic device and storage medium to improve the above problems.
[0004] In a first aspect, embodiments of this application provide an action detection method applied to an electronic device. The method includes: firstly acquiring a video to be detected; then inputting the video to be detected into a pre-trained action localization model, obtaining localization results corresponding to multiple actions included in the video to be detected, as output by the action localization model, wherein the localization results include the action category of the corresponding action and multiple consecutive video frames corresponding to the action category; and finally, based on the localization results corresponding to the multiple actions, obtaining and outputting multiple consecutive video frames corresponding to a preset action category from the video to be detected.
[0005] Secondly, embodiments of this application provide an action detection device, operating in an electronic device, the device comprising: a video acquisition unit for acquiring a video to be detected; a localization result acquisition unit for inputting the video to be detected into a pre-trained action localization model, acquiring localization results corresponding to multiple actions included in the video to be detected, the localization results including action categories corresponding to the multiple actions and multiple consecutive video frames corresponding to each action category; and a continuous video frame output unit for acquiring and outputting multiple consecutive video frames corresponding to preset action categories from the video to be detected based on the localization results corresponding to the multiple actions.
[0006] Thirdly, embodiments of this application provide an electronic device for a liveness detection method, including one or more processors and a memory; one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, and the one or more programs are configured to perform the above-described method.
[0007] Fourthly, embodiments of this application provide a computer-readable storage medium storing program code, wherein the above-described method is executed when the program code is run.
[0008] This application provides an action detection method, apparatus, electronic device, and storage medium. The action detection method includes: first, acquiring a video to be detected; then, inputting the video to be detected into a pre-trained action localization model, obtaining localization results corresponding to multiple actions included in the video to be detected, as output by the action localization model, wherein the localization results include the action category of the corresponding action and multiple consecutive video frames corresponding to the action category; finally, based on the localization results corresponding to the multiple actions, extracting and outputting multiple consecutive video frames corresponding to a preset action category from the video to be detected. Through the above method, by using a pre-trained action localization model to obtain localization results for multiple actions, and by obtaining consecutive video frames corresponding to a preset action category based on the localization results, the method accurately locates exciting segments of a video. Attached Figure Description
[0009] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0010] Figure 1 A flowchart of an action detection method according to an embodiment of this application is shown;
[0011] Figure 2 A schematic diagram of the process described in steps S110-S130 of this application is shown;
[0012] Figure 3 A flowchart of an action detection method according to another embodiment of this application is shown;
[0013] Figure 4 A flowchart of an action detection method according to another embodiment of this application is shown;
[0014] Figure 5This paper shows a structural block diagram of a motion detection device according to another embodiment of the present application;
[0015] Figure 6 This paper shows a structural block diagram of a motion detection device according to another embodiment of the present application;
[0016] Figure 7 This paper shows a structural block diagram of an electronic device used to perform the liveness detection method of the embodiments of this application in real time;
[0017] Figure 8 The present application shows a storage unit for storing or carrying program code that implements the liveness detection method according to the embodiments of the present application. Detailed Implementation
[0018] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.
[0019] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or server comprising a series of steps or units is not necessarily limited to those explicitly listed, but may include other steps or units not explicitly listed or inherent to these processes, methods, products, or devices.
[0020] When editing highlight videos of matches, motion detection is typically performed to extract key segments. Motion detection requires first locating the target, then identifying it. The relevant algorithms need to detect motion segments within a given, unsegmented long video, including start and end times and motion categories. The main task of these algorithms is to find the start and end frames of the motion and classify them.
[0021] The inventors, in their research on related motion detection methods, discovered that these methods generally rely on the VGG16 model to classify and locate actions in competition videos. The system divides the competition video into fixed-length segments, labels these segments chronologically, and then inputs each segment into a pre-trained VGG16 model for scoring. The system determines if the segment's score meets a preset threshold. If it does, the segment is acquired, and all segments meeting the threshold are arranged chronologically into a continuous video group. This continuous video group is then output as a highlight reel. However, in this method, the video segment length is fixed, and the time before and after the action's label represents the preparation and the actual action, respectively. Therefore, the system doesn't fully utilize the information before and after the action, resulting in inaccurate location of highlight segments.
[0022] Therefore, the inventors have proposed a motion detection method, apparatus, electronic device, and storage medium in the embodiments of this application. First, a video to be detected is acquired; then, the video to be detected is input into a pre-trained motion localization model, and the localization results corresponding to multiple actions included in the video to be detected are obtained from the motion localization model. The localization results include the motion categories corresponding to each of the multiple actions and multiple consecutive video frames corresponding to each motion category; finally, based on the localization results corresponding to the multiple actions, multiple consecutive video frames corresponding to preset motion categories are obtained from the video to be detected and output. Through the above method, by using a pre-trained motion localization model to obtain the localization results of multiple actions, and by obtaining consecutive video frames corresponding to preset motion categories based on the localization results, the exciting segments of the video can be accurately located.
[0023] The embodiments of this application will now be described in detail with reference to the accompanying drawings.
[0024] Please see Figure 1 This application provides an action detection method applied to an electronic device, the method comprising:
[0025] Step S110: Obtain the video to be tested.
[0026] In this embodiment, the video to be detected can be a video captured in real time by a video acquisition device, which can be a camera or a mobile phone camera; it can also be a pre-prepared video that needs to be processed, and no specific limitation is made here.
[0027] As one approach, when the video to be detected is captured in real-time by a video capture device, when motion detection is required, the video capture device is activated and acquires consecutive video frames corresponding to the scene video within a certain time period as the video to be detected. For example, when the current scene is a sports competition, the camera captures the action scenes in the current competition scene, recording the actions of each athlete from the start to the end of the competition, and uses the acquired video as the video to be detected.
[0028] Step S120: Input the video to be detected into a pre-trained action localization model, and obtain the localization results of each action in the video to be detected, which are output by the action localization model. The localization results include the action category of the corresponding action and multiple consecutive video frames corresponding to the action category.
[0029] In this embodiment, the pre-trained action localization model may include two parts: a feature extraction module and an action localization module. The feature extraction module extracts action features from the video to be detected to obtain feature representations, while the action localization module obtains the localization results corresponding to the feature representations in the video to be detected.
[0030] In this embodiment, the system acquires a video to be detected. First, the video is input into a pre-trained feature extraction module in the action localization model, which outputs a feature representation of the video. Then, the multiple feature representations of the video are input into a pre-trained action localization module, which outputs action categories corresponding to multiple actions included in the video and multiple consecutive video frames corresponding to those action categories. In these consecutive video frames, a key video frame serves as the first frame, followed by multiple consecutive video frames.
[0031] Step S130: Based on the positioning results corresponding to each of the multiple actions, obtain and output multiple consecutive video frames corresponding to the preset action categories from the video to be detected.
[0032] In this embodiment, after the motion localization model outputs the motion categories corresponding to multiple actions in the video to be detected and the consecutive video frames corresponding to the motion categories, the motion categories corresponding to multiple actions output by the motion localization model and the consecutive video frames corresponding to the motion categories are filtered according to the pre-selected motion categories. The motion categories corresponding to multiple actions output by the motion localization model that are the same as the pre-selected motion categories are selected, the consecutive video frames corresponding to the selected motion categories are obtained, and the obtained consecutive video frames are sorted according to the time order. The sorted consecutive video frames are then output.
[0033] For example, the processes described in steps S110, S120, and S130 can be as follows: Figure 2 As shown, the video to be detected is input into a diversified feature module (equivalent to a feature extraction module), which outputs multiple feature representations corresponding to the video. These feature representations are then input into a motion localization module, which outputs the localization results for each motion and their corresponding motion categories. The first and last frames of the localization results are marked as the start and end points for each motion category. The motion categories output by the motion localization module are then filtered using pre-selected motion categories, selecting those that match the pre-selected categories. The selected motion categories and their corresponding localization results are then input into a video highlights module. In this module, the localization results for the selected motion categories are cropped and sorted chronologically before being output. Simultaneously, the video to be detected, along with the motion localization data and corresponding motion categories output by the motion localization module, are also input into the video highlights module as backups.
[0034] For example, steps S110, S120 and S130 can be as shown in steps S111, S121 and S131.
[0035] Step S111: Obtain surveillance video
[0036] In this embodiment of the application, the surveillance video can be video captured in real time by a video acquisition device, which can be a camera.
[0037] Step S121: Input the monitoring video into the pre-trained motion localization model, and obtain the localization results corresponding to each of the multiple abnormal segments included in the monitoring video output by the motion localization model.
[0038] In this embodiment, the system acquires surveillance video and first inputs it into a pre-trained feature extraction module in the action localization model. The module outputs a feature representation of the surveillance video. Then, the multiple feature representations of the surveillance video are input into the pre-trained action localization module, which outputs action categories corresponding to multiple abnormal segments in the surveillance video and multiple consecutive video frames corresponding to these action categories. In these consecutive video frames, a key video frame serves as the first frame, followed by multiple consecutive video frames.
[0039] Anomaly segments are used to characterize scenes that would not appear in normal surveillance video. For example, in a security environment, there are generally no people in the monitored scene. When a person appears in the surveillance video, that moment can be identified as an anomaly, and the video is considered an anomaly segment until the person is no longer present in the surveillance video.
[0040] Step S131: Based on the localization results corresponding to each of the multiple abnormal segments, obtain and output multiple consecutive video frames corresponding to the preset action categories from the monitoring video.
[0041] In this embodiment of the application, after the motion localization model outputs the motion categories corresponding to multiple abnormal segments in the monitoring video and the multiple consecutive video frames corresponding to the motion categories, the motion categories corresponding to the multiple abnormal segments output by the motion localization model and the multiple consecutive video frames corresponding to the motion categories are filtered according to the pre-selected motion categories. The motion categories corresponding to the multiple abnormal segments output by the motion localization model that are the same as the pre-selected motion categories are selected, the multiple consecutive video frames corresponding to the selected motion categories are obtained, and the obtained multiple consecutive video frames are sorted according to the time order. The sorted multiple consecutive video frames are then output.
[0042] This application provides an action detection method. First, a video to be detected is acquired. Then, the video is input into a pre-trained action localization model to obtain the localization results of multiple actions within the video. Each localization result includes the action category of the corresponding action and multiple consecutive video frames corresponding to that action category. Finally, based on the localization results of the multiple actions, multiple consecutive video frames corresponding to a preset action category are extracted from the video and output. Through this method, by using a pre-trained action localization model to obtain localization results for multiple actions and extracting consecutive video frames corresponding to a preset action category based on the localization results, exciting segments of the video can be accurately located.
[0043] Please see Figure 3 This application provides an action detection method applied to an electronic device, the method comprising:
[0044] Step S210: Obtain the video to be tested.
[0045] Step S210 can be specifically explained in the above embodiments, and therefore will not be repeated in this embodiment.
[0046] Step S220: Input the video to be detected into the feature extraction module to obtain multiple feature representations corresponding to the video to be detected output by the feature extraction module.
[0047] In the embodiments of this application, the feature extraction module may employ algorithms such as Video Swin Transformer and Vision Transformer, and no specific limitations are made here.
[0048] For example, if the algorithm in the feature extraction module is Video Swin Transformer, Video Swin Transformer has three components: video to token, model stages, and head. Video to token groups 2*2*4 video blocks and performs linear embedding and position embedding. Model stages consist of multiple repeating stages, each including a Video Swin Transformer Block and patch merging. Patch merging is used to merge adjacent token features, followed by dimensionality reduction using linear layers. The head is used to obtain high-dimensional features from multiple frames after the model stages. If used for video classification, simple frame fusion needs to be performed in reverse.
[0049] In this embodiment, when the feature extraction module recognizes the input video to be detected, it extracts action features from the video using multiple feature extractors. Based on the multiple action labels corresponding to the multiple feature extractors, it obtains multiple action categories contained in the multiple action labels and the key video frame corresponding to each action. The multiple feature extractors are trained from the multiple action labels, and the output of the feature extraction module is the feature representation of the video to be detected.
[0050] For example, there can be three feature extractors and three action labels, with a one-to-one correspondence between the feature extractors and the action labels. The action labels can be 6s, 3s style1, and 3s style2, where the 6s action label corresponds to 18 action categories, the 3s style1 action label corresponds to 35 action categories, and the 3s style2 action label corresponds to 52 action categories.
[0051] In this embodiment, the feature extraction module can be a pre-trained extraction model based on a neural network model. The training process of the feature extraction model includes:
[0052] Step S221: Obtain a first training dataset. The first training dataset includes multiple videos. Each video includes multiple action categories, key video frames corresponding to each action, and a first video frame, a second video frame, and a third video frame corresponding to each key video frame. The first video frame includes video frames within a first preset time period before each key video frame and video frames within a second preset time period after each key video frame. The second video frame includes video frames within a third preset time period before each key video frame and video frames within a third preset time period after each key video frame. The third video frame includes video frames within a third preset time period before each key video frame, video frames within a third preset time period after each key video frame, and video frames within a third preset time period after the last frame of the video frames within a third preset time period after each key video frame.
[0053] In this embodiment, the first training dataset can be multiple videos related to the application scenario, which can be obtained from a pre-existing database. For example, if the application scenario is to output a compilation of exciting football match videos, multiple football match videos can be used as the first training dataset.
[0054] Each video in the first training dataset can contain multiple action categories, key video frames corresponding to each action, and first, second, and third video frames corresponding to each key video frame. The first video frame includes video frames within a first preset time interval before each key video frame and video frames within a second preset time interval after each key video frame. The first preset time can be set to 2 seconds, and the second preset time can be set to 4 seconds. The first video frame includes 18 action categories. The second video frame includes video frames within a third preset time interval before each key video frame and video frames within a third preset time interval after each key video frame. The third preset time can be 3 seconds. The second video frame includes 25 action categories, of which the video frames within the third preset time interval before the key video frame include 17 action categories, and the video frames within the third preset time interval after the key video frame include 17 action categories. The video frame includes 52 action categories, with 1 sub-action category; the third video frame includes video frames within a third preset time period before each key video frame, video frames within a third preset time period after each key video frame, and video frames within a third preset time period after the last frame of the video frames within a third preset time period after each key video frame. The third video frame includes 52 action categories, of which 17 action categories are included in the video frames within a third preset time period before the key video frame, 17 categories are included in the video frames within a third preset time period after the key video frame, 17 action categories are included in the video frames within a third preset time period after the last frame of the video frames within a third preset time period after the key video frame, and 1 sub-action category.
[0055] Step S222: Input the first training dataset into the neural network model to be trained, and train the neural network model until the training termination condition is met, thereby obtaining the feature extraction module.
[0056] In this embodiment, when training the neural network model to be trained, a first training dataset is input into the neural network model to be trained, and the neural network model is trained until the preset number of training iterations is reached, at which point the feature extraction module can be considered to have finished training. For example, the preset number of training iterations can be 30.
[0057] The neural network model to be trained may include three feature extractors: a first feature extractor, a second feature extractor, and a third feature extractor. When training the neural network model based on a first training dataset, different feature extractors can be trained using different training data. Specifically, the first feature extractor can be trained using multiple videos, each video including multiple action categories, key video frames corresponding to each action, and a first video frame corresponding to each key video frame; the second feature extractor can be trained using multiple videos, each video including multiple action categories, key video frames corresponding to each action, and a second video frame corresponding to each key video frame; the third feature extractor can be trained using multiple videos, each video including multiple action categories, key video frames corresponding to each action, and a third video frame corresponding to each key video frame; this process continues until the training termination condition is met, resulting in the trained first, second, and third feature extractors.
[0058] Step S230: Input the multiple feature representations into the action localization module to obtain the localization results corresponding to each of the multiple actions included in the video to be detected, output by the action localization module.
[0059] In this embodiment, the algorithm model in the action localization module can be an action temporal detection model. This model can include any one of the following: S-CNN, R-CNN, R-C3D, CDC, and Faster-TAD, without specific limitations. For the action temporal detection model, given an unsegmented long video, the model needs to detect the action segments of each action included in the video, including the start time, end time, and action category of each action. The task of the action detection model is to find the start and end frames of each action and classify each action.
[0060] In this embodiment of the application, when the action localization module recognizes multiple feature representations of the input, the action localization module predicts and obtains the video frames after the key video frame by using the multiple feature representations of the input video to be detected according to the relevant algorithm model, and at the same time obtains the action category corresponding to the video frame after the key frame, and then outputs the key frame, the video frame after the key frame and the corresponding action category.
[0061] In this embodiment, the action localization module can be a pre-trained localization module based on a neural network model. The training process for the action localization module includes:
[0062] Step S231: Obtain the second training dataset, which includes multiple videos. Each video includes multiple action categories, key video frames for each action, and video frames within a fourth preset time period after each key video frame.
[0063] In this embodiment, the second training dataset can be multiple videos related to the application scenario, which can be obtained from a pre-existing database. For example, if the application scenario is to output a compilation of exciting football match videos, multiple football match videos can be used as the second training dataset.
[0064] Each video in the second training dataset may include multiple action categories, key video frames for each action, and video frames within a fourth preset time interval after the first key video frames. The fourth preset time interval can be set to 4 seconds.
[0065] Step S232: Input the second training dataset into the neural network model to be trained, and train the neural network model until the training termination condition is met to obtain the action localization module.
[0066] In this embodiment, when training the neural network model to be trained, a second training dataset is input into the neural network model to be trained, and the neural network model is trained until the preset number of training iterations is reached, at which point the feature extraction module can be considered to have finished training. For example, the preset number of training iterations can be 10.
[0067] When training the feature extraction module and the action localization module, they can be trained independently or jointly. When jointly training the feature extraction module and the action localization module, to determine whether the training termination condition is met, corresponding weights can be pre-set for the loss functions of the feature extraction module and the action localization module. The training termination condition is determined to be met when the loss function values of the feature extraction module and the action localization module meet the preset values.
[0068] Step S240: Obtain the preset action category.
[0069] In this embodiment, the system detects the positioning result in the video to be detected output by the motion positioning module. After obtaining the motion category and the corresponding multiple consecutive video frames from the positioning result output by the motion positioning module, the system obtains a pre-set motion category. The pre-set motion category can be one type of motion or multiple types of motion, and is not specifically limited here.
[0070] As another approach, the timing for setting the action category can be determined after the action localization module outputs the action category in the video to be detected and the corresponding consecutive video frames, and then the required action category is set. In this way, the action category in the output video to be detected can be selected based on the set action category.
[0071] Step S250: Obtain the positioning result that is the same as the preset action category from the positioning results corresponding to each of the multiple actions.
[0072] In this embodiment of the application, when the system obtains the action category in the video to be detected and the multiple consecutive video frames corresponding to the action category output by the action positioning module, it compares the obtained action category in the video to be detected with the preset action category, and then selects the action category in the video to be detected that is the same as the preset action category. At the same time, the first and last frames of the multiple consecutive video frames corresponding to the selected action category are marked on the time axis.
[0073] Step S260: Obtain and output multiple consecutive video frames corresponding to the positioning result from the video to be detected.
[0074] In this embodiment of the application, after the system obtains the selected action category in the video to be detected according to the pre-set action category, it crops the first and last frames of multiple consecutive video frames corresponding to the marked selected action category, and then arranges the cropped multiple consecutive video frames according to the time order, and then outputs the arranged multiple consecutive video frames.
[0075] This application provides an action detection method. First, a video to be detected is acquired. Then, the video is input into a feature extraction module to obtain multiple feature representations corresponding to the video. These feature representations are then input into an action localization module to obtain localization results for each action within the video. A preset action category is then obtained. Next, localization results matching the preset action category are selected from the localization results for each action. Finally, multiple consecutive video frames corresponding to the localization results are extracted from the video and output. Through this method, a pre-trained action localization model is used to obtain localization results for multiple actions, and consecutive video frames corresponding to preset action categories are obtained based on these results, thereby accurately locating key video segments.
[0076] Please see Figure 4 This application provides an action detection method applied to an electronic device, the method comprising:
[0077] Step S310: Perform model pruning or compression operations on the motion localization model to obtain a lightweight motion localization model.
[0078] In this embodiment, the motion localization model occupies a large amount of storage space, requires a large amount of computation, and has too many data parameters, thus taking a considerable amount of time to process the video to be detected. To reduce the storage space occupied by the motion localization model and improve its computation speed, model compression is performed.
[0079] Model compression methods include network pruning, knowledge distillation, parameter quantization, architecture design, and dynamic computation. For example, when network pruning is used as a model compression method, the importance of weights and neurons in the action localization model is first evaluated. In evaluating weights, a weight with a value close to 0 is likely a low-importance weight; a large positive or negative value indicates a high-importance weight. In evaluating neurons, given a dataset, if a neuron's output is almost entirely 0, it is likely a low-importance neuron. Then, weights and neurons are ranked according to importance, and unimportant weights and neurons are removed. Since pruning the action localization model may decrease accuracy, the pruned model needs to be readjusted using the dataset until satisfactory performance is achieved.
[0080] Step S320: Obtain the video to be tested.
[0081] Step S320 can be specifically explained in the above embodiments, and therefore will not be repeated in this embodiment.
[0082] Step S330: Input the video to be detected into the lightweight motion localization model, and obtain the localization results corresponding to each of the multiple actions included in the video to be detected output by the lightweight motion localization model.
[0083] In this embodiment, after model compression, a lightweight motion localization model is obtained. The video to be detected is input into the lightweight motion localization model. Because the motion localization model occupies less storage space and has a faster calculation speed after model compression, but the accuracy decreases, the processing speed of the video to be detected is faster than that of the motion localization model without model compression. The lightweight motion localization model can obtain the localization results corresponding to each of the multiple actions included in the video to be detected more quickly. At the same time, due to the decrease in calculation accuracy, the accuracy of the obtained localization results corresponding to each of the multiple actions is also reduced.
[0084] Step S340: Obtain the preset action category.
[0085] Step S340 can be specifically explained in the above embodiments, and therefore will not be repeated in this embodiment.
[0086] Step S350: Obtain the positioning result that is the same as the preset action category from the positioning results corresponding to each of the multiple actions.
[0087] Step S350 can be found in the detailed explanation in the above embodiments, and therefore will not be repeated in this embodiment.
[0088] Step S360: Obtain and output multiple consecutive video frames corresponding to the positioning result from the video to be detected.
[0089] Step S360 can be specifically explained in the above embodiments, and therefore will not be repeated in this embodiment.
[0090] This application provides an action detection method. First, a motion localization model is pruned or compressed to obtain a lightweight motion localization model. Then, a video to be detected is acquired and input into the lightweight motion localization model. The localization results corresponding to multiple actions in the video to be detected are obtained from the output of the lightweight motion localization model. Next, a preset action category is obtained. Then, from the localization results corresponding to the multiple actions, the localization results corresponding to the preset action category are obtained. Finally, multiple consecutive video frames corresponding to the localization results are obtained from the video to be detected and output. Through this method, using a pre-trained motion localization model, localization results of multiple actions are obtained, and consecutive video frames corresponding to preset action categories are obtained based on the localization results, thereby accurately locating exciting segments of the video.
[0091] Please see Figure 5 This application provides an action detection device 400, which operates in an electronic device. The device 400 includes:
[0092] The video acquisition unit 410 is used to acquire the video to be detected.
[0093] In one way, the video acquisition unit 410 is also used to acquire surveillance video.
[0094] The localization result acquisition unit 420 is used to input the video to be detected into a pre-trained action localization model, and acquire the localization results of each of the multiple actions included in the video to be detected output by the action localization model. The localization results include the action categories corresponding to each of the multiple actions and multiple consecutive video frames corresponding to each action category.
[0095] In one manner, the localization result acquisition unit 420 is also used to input the video to be detected into the feature extraction module to obtain multiple feature representations corresponding to the video to be detected output by the feature extraction module; input the multiple feature representations into the action localization module to obtain localization results corresponding to each of the multiple actions included in the video to be detected output by the action localization module.
[0096] Optionally, the localization result acquisition unit 420 is further configured to perform model pruning or compression operations on the motion localization model to obtain a lightweight motion localization model; input the video to be detected into the lightweight motion localization model, and obtain the localization results corresponding to each of the multiple actions included in the video to be detected output by the lightweight motion localization model.
[0097] Optionally, the positioning result acquisition unit 420 is further configured to input the monitoring video into a pre-trained motion positioning model, and acquire the positioning results corresponding to each of the multiple abnormal segments included in the monitoring video output by the motion positioning model.
[0098] The continuous video frame output unit 430 is used to obtain and output a series of consecutive video frames corresponding to preset action categories from the video to be detected based on the positioning results corresponding to the multiple actions.
[0099] In one manner, the continuous video frame output unit 430 is also used to obtain a preset action category; obtain a positioning result that is the same as the preset action category from the positioning results corresponding to the plurality of actions; and obtain and output a plurality of consecutive video frames corresponding to the positioning result from the video to be detected.
[0100] Optionally, the continuous video frame output unit 430 is further configured to obtain and output multiple consecutive video frames corresponding to preset action categories from the monitoring video based on the positioning results corresponding to the multiple abnormal segments.
[0101] Please see Figure 6 The device 400 further includes:
[0102] The feature extraction module training unit 440 is used to acquire a first training dataset, which includes multiple videos. Each video includes multiple action categories, key video frames corresponding to each action, and a first video frame, a second video frame, and a third video frame corresponding to each key video frame. The first video frame includes video frames within a first preset time period before each key video frame and video frames within a second preset time period after each key video frame. The second video frame includes video frames within a third preset time period before each key video frame and video frames within a third preset time period after each key video frame. The third video frame includes video frames within a third preset time period before each key video frame, video frames within a third preset time period after each key video frame, and video frames within a third preset time period after the last frame of the video frames within a third preset time period after each key video frame. The first training dataset is input into the neural network model to be trained, and the neural network model to be trained is trained until the training termination condition is met, thus obtaining the feature extraction module.
[0103] The action localization module training unit 450 is used to acquire a second training dataset, which includes multiple videos. Each video includes multiple action categories, key video frames for each action, and video frames within a fourth preset time period after each key video frame. The second training dataset is input into the neural network model to be trained, and the neural network model to be trained is trained until the training termination condition is met, thereby obtaining the action localization module.
[0104] It should be noted that the device embodiments in this application correspond to the aforementioned method embodiments. The specific principles in the device embodiments can be found in the content of the aforementioned method embodiments, and will not be repeated here.
[0105] The following will combine Figure 7 This application describes an electronic device.
[0106] Please see Figure 7 Based on the aforementioned liveness detection method and apparatus, this application also provides another electronic device 500 capable of performing the aforementioned liveness detection method. The electronic device 500 includes one or more (only one shown in the figure) processors 502, a memory 504, and a network module 506 coupled together. The memory 504 stores programs capable of executing the contents of the aforementioned embodiments, and the processors 502 can execute the programs stored in the memory 504.
[0107] The processor 502 may include one or more processing cores. The processor 502 connects to various parts within the electronic device 500 using various interfaces and lines, and executes various functions of the server 500 and processes data by running or executing instructions, programs, code sets, or instruction sets stored in the memory 504, and by calling data stored in the memory 504. Optionally, the processor 502 may be implemented using at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), or Programmable Logic Array (PLA). The processor 502 may integrate one or a combination of several of the following: Central Processing Unit (CPU), Graphics Processing Unit (GPU), and modem. The CPU primarily handles the operating system, user interface, and applications; the GPU is responsible for rendering and drawing the displayed content; and the modem handles wireless communication. It is understood that the modem may also not be integrated into the processor 502 and may be implemented separately using a communication chip.
[0108] The memory 504 may include random access memory (RAM) or read-only memory (ROM). The memory 504 can be used to store instructions, programs, code, code sets, or instruction sets. The memory 504 may include a program storage area and a data storage area. The program storage area may store instructions for implementing an operating system, instructions for implementing at least one function (such as touch functionality, sound playback functionality, image playback functionality, etc.), and instructions for implementing the various method embodiments described below. The data storage area may also store data created by the electronic device 500 during use (such as phonebook data, audio and video data, chat log data, etc.).
[0109] The network module 506 is used to receive and transmit electromagnetic waves, realizing the mutual conversion between electromagnetic waves and electrical signals, thereby communicating with communication networks or other devices, such as audio playback devices. The network module 506 may include various existing circuit elements for performing these functions, such as antennas, radio frequency transceivers, digital signal processors, encryption / decryption chips, SIM cards, memory, etc. The network module 506 can communicate with various networks such as the Internet, corporate intranets, and wireless networks, or communicate with other devices via wireless networks. The aforementioned wireless networks may include cellular telephone networks, wireless local area networks, or metropolitan area networks. For example, the network module 506 can interact with base stations.
[0110] Please refer to Figure 8 This diagram illustrates a structural block diagram of a computer-readable storage medium provided in an embodiment of this application. The computer-readable medium 600 stores program code that can be called by a processor to execute the methods described in the above method embodiments.
[0111] The computer-readable storage medium 600 may be an electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read-Only Memory), EPROM, hard disk, or ROM. Optionally, the computer-readable storage medium 600 includes a non-transitory computer-readable storage medium. The computer-readable storage medium 600 has storage space for program code 610 that performs any of the method steps described above. This program code can be read from or written to one or more computer program products. The program code 610 may be compressed, for example, in a suitable form.
[0112] This application provides an action detection method, apparatus, electronic device, and storage medium. The action detection method includes: first, acquiring a video to be detected; then, inputting the video to be detected into a pre-trained action localization model, obtaining localization results corresponding to multiple actions included in the video to be detected, as output by the action localization model, wherein the localization results include the action category of the corresponding action and multiple consecutive video frames corresponding to the action category; finally, based on the localization results corresponding to the multiple actions, extracting and outputting multiple consecutive video frames corresponding to a preset action category from the video to be detected. Through the above method, by using a pre-trained action localization model to obtain localization results for multiple actions, and by obtaining consecutive video frames corresponding to a preset action category based on the localization results, the method accurately locates exciting segments of a video.
[0113] The embodiments of the present invention have been described above with reference to the accompanying drawings. However, the present invention is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other modifications under the guidance of the present invention without departing from the spirit and scope of the claims, and all of these modifications are protected by the present invention.
Claims
1. A motion detection method, characterized in that, Applied to electronic devices, the method includes: Obtain the video to be tested; The video to be detected is input into a pre-trained action localization model, and the localization results of each action in the video to be detected are obtained from the action localization model. The localization results include the action category of the corresponding action and multiple consecutive video frames corresponding to the action category. The action localization model includes a feature extraction module and an action localization module. The step of inputting the video to be detected into a pre-trained action localization model and obtaining the localization results corresponding to each of the multiple actions included in the video to be detected, as output by the action localization model, includes: Obtain a first training dataset, which includes multiple videos. Each video includes multiple action categories, key video frames corresponding to each action, and a first video frame, a second video frame, and a third video frame corresponding to each key video frame. The first video frame includes video frames within a first preset time period before each key video frame and video frames within a second preset time period after each key video frame. The second video frame includes video frames within a third preset time period before each key video frame and video frames within a third preset time period after each key video frame. The third video frame includes video frames within a third preset time period before each key video frame, video frames within a third preset time period after each key video frame, and video frames within a third preset time period after the last frame of the video frames within a third preset time period after each key video frame. The first training dataset is input into the neural network model to be trained, and the neural network model is trained until the training termination condition is met, thus obtaining the feature extraction module. The feature extraction module includes a first feature extractor, a second feature extractor, and a third feature extractor. The first feature extractor is trained based on multiple videos, each video including multiple action categories, key video frames corresponding to each action, and a first video frame corresponding to each key video frame; the second feature extractor is trained based on multiple videos, each video including multiple action categories, key video frames corresponding to each action, and a second video frame corresponding to each key video frame; the third feature extractor is trained based on multiple videos, each video including multiple action categories, key video frames corresponding to each action, and a third video frame corresponding to each key video frame; this process continues until the training termination condition is met, resulting in the trained first, second, and third feature extractors. Obtain a second training dataset, which includes multiple videos. Each video includes multiple action categories, key video frames for each action, and video frames within a fourth preset time period after each key video frame. The second training dataset consists of multiple videos related to the application scenario. The second training dataset is input into the neural network model to be trained, and the neural network model to be trained is trained until the training termination condition is met, thereby obtaining the action localization module. The video to be detected is input into the feature extraction module to obtain multiple feature representations corresponding to the video to be detected output by the feature extraction module; The multiple feature representations are input into the action localization module to obtain the localization results corresponding to each of the multiple actions included in the video to be detected, output by the action localization module. Based on the localization results corresponding to each of the multiple actions, a series of consecutive video frames corresponding to the preset action categories are obtained from the video to be detected and output.
2. The method according to claim 1, characterized in that, The step of obtaining and outputting multiple consecutive video frames corresponding to preset action categories from the video to be detected based on the localization results corresponding to the multiple actions includes: Get the preset action category; From the positioning results corresponding to each of the multiple actions, obtain the positioning result that is the same as the preset action category; Multiple consecutive video frames corresponding to the positioning result are obtained from the video to be detected and output.
3. The method according to claim 1, characterized in that, The process of acquiring the video to be detected includes: The motion localization model is pruned or compressed to obtain a lightweight motion localization model. The step of inputting the video to be detected into a pre-trained action localization model and obtaining the localization results corresponding to each of the multiple actions included in the video to be detected, as output by the action localization model, includes: The video to be detected is input into the lightweight motion localization model, and the localization results corresponding to each of the multiple actions included in the video to be detected are obtained from the output of the lightweight motion localization model.
4. The method according to claim 1, characterized in that, The acquisition of the video to be detected includes: Obtain surveillance video; The step of inputting the video to be detected into a pre-trained action localization model and obtaining the localization results corresponding to each of the multiple actions included in the video to be detected, as output by the action localization model, includes: The surveillance video is input into a pre-trained motion localization model to obtain the localization results of each of the multiple abnormal segments included in the surveillance video, as output by the motion localization model. The step of obtaining and outputting multiple consecutive video frames corresponding to preset action categories from the video to be detected based on the localization results corresponding to the multiple actions includes: Based on the location results corresponding to each of the multiple abnormal segments, multiple consecutive video frames corresponding to preset action categories are obtained from the monitoring video and output.
5. A motion detection device, characterized in that, Operating in an electronic device, the device includes: The video acquisition unit is used to acquire the video to be detected. The localization result acquisition unit is used to input the video to be detected into a pre-trained action localization model, and acquire the localization results of each of the multiple actions included in the video to be detected output by the action localization model. The localization results include the action categories corresponding to each of the multiple actions and multiple consecutive video frames corresponding to each action category. The action localization model includes a feature extraction module and an action localization module. The feature extraction module training unit is used to acquire a first training dataset, which includes multiple videos. Each video includes multiple action categories, key video frames corresponding to each action, and a first, second, and third video frame corresponding to each key video frame. The first video frame includes video frames within a first preset time period before each key video frame and video frames within a second preset time period after each key video frame. The second video frame includes video frames within a third preset time period before each key video frame and video frames within a third preset time period after each key video frame. The third video frame includes video frames within a third preset time period before each key video frame, video frames within a third preset time period after each key video frame, and video frames within a third preset time period after the last frame of the video frames within a third preset time period after each key video frame. The first training dataset is then input into the neural network model to be trained. The neural network model to be trained is trained until the training termination condition is met, resulting in the feature extraction module. The feature extraction module includes a first feature extractor, a second feature extractor, and a third feature extractor. The first feature extractor is trained based on multiple videos, each video including multiple action categories, key video frames corresponding to each action, and a first video frame corresponding to each key video frame. The second feature extractor is trained based on multiple videos, each video including multiple action categories, key video frames corresponding to each action, and a second video frame corresponding to each key video frame. The third feature extractor is trained based on multiple videos, each video including multiple action categories, key video frames corresponding to each action, and a third video frame corresponding to each key video frame. This process continues until the training termination condition is met, resulting in the trained first, second, and third feature extractors. The action localization module training unit is used to acquire a second training dataset, which includes multiple videos. Each video includes multiple action categories, key video frames for each action, and video frames within a fourth preset time period after each key video frame. The second training dataset consists of multiple videos related to the application scenario. The second training dataset is input into the neural network model to be trained, and the neural network model to be trained is trained until the training termination condition is met, thereby obtaining the action localization module. The localization result acquisition unit is used to input the video to be detected into the feature extraction module to obtain multiple feature representations corresponding to the video to be detected output by the feature extraction module; input the multiple feature representations into the action localization module to obtain localization results corresponding to each of the multiple actions included in the video to be detected output by the action localization module. The continuous video frame output unit is used to obtain and output a series of consecutive video frames corresponding to preset action categories from the video to be detected based on the positioning results corresponding to the multiple actions.
6. An electronic device, characterized in that, It includes one or more processors and memory, wherein one or more programs are stored in the memory and configured to be executed by one or more processors according to any one of claims 1-4.
7. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores program code, which includes instructions for performing the method as claimed in any one of claims 1-4.