Violent video recognition method for bimodal task learning based on attention mechanism

A video recognition and attention technology, applied in the field of violent video recognition, can solve the problems affecting the generalization ability of classifiers, ignoring interdependence, etc., to achieve the effect of improving generalization ability, improving feature appearance, and suppressing feature expression

Pending Publication Date: 2020-11-06
COMMUNICATION UNIVERSITY OF CHINA
0 Cites 1 Cited by

AI-Extracted Technical Summary

Problems solved by technology

Existing research methods basically only use video labels as supervisory signals, build and train network structures to obtain video violence/non-violence labels, but ignore the interdependence between fe...
View more

Abstract

The invention discloses a violent video recognition method for bimodal task learning based on an attention mechanism, and belongs to the technical field of natural interaction and intelligent image recognition. The method includes taking the analysis of the characteristics of the violent scene video as a starting point, and extracting video characteristics which are suitable for violent scene description and have space-time correlation; secondly, establishing an attention mechanism module for violent video features by taking capture of global feature information as a principle; and finally, fusing spatial-temporal features with a global attention relationship so as to realize multi-modal information complementation as a starting point, and researching a violent video recognition step of multi-task learning based on an attention mechanism of violent video features and violent video classification so as to form a complete violent video recognition detection framework. According to the violent video recognition method, intelligent and effective detection of the violent video is realized.

Application Domain

Character and pattern recognitionNeural architectures +1

Technology Topic

Task learningVideo recognition +2

Image

  • Violent video recognition method for bimodal task learning based on attention mechanism
  • Violent video recognition method for bimodal task learning based on attention mechanism
  • Violent video recognition method for bimodal task learning based on attention mechanism

Examples

  • Experimental program(2)

Example Embodiment

[0024] Embodiment 1: as figure 1 , figure 2 , image 3 and Figure 4 As shown, the violent video recognition method based on the dual-modal task learning of the attention mechanism includes the following steps:
[0025] Step 1: Add an attention mechanism module to the spatial flow deep neural network to capture the interdependence between the violent features of static frame pictures, and form the weight of the attention mechanism;
[0026] Step 2: Add an attention mechanism module to the time flow deep neural network to capture the interdependence between the violent features of the optical flow sequence diagram, and form the weight of the attention mechanism;
[0027] Step 3: Extract the feature information of the violent video on a single frame image, and establish a violent video recognition model based on a single frame image;
[0028] Step 4: Extract the feature information of the violent video on the motion optical flow, and establish a violent video recognition model based on the motion optical flow;
[0029] Step 5: Spatio-temporal feature fusion. Using the average fusion method, the scoring strategy of the violent video recognition model based on single frame image and the scoring strategy of the violent video recognition model based on motion optical flow are fused to give the final violence classification score.
[0030] Specifically, the steps of adding the attention mechanism module in the spatial stream deep neural network are as follows:
[0031] Step 11: Construct a deep neural network for spatial flow-based violent attention relation capture. Using the TSN network as the basic network, the attention mechanism module GCNet is embedded in the conv_bn_3c, conv_bn_4e and conv_bn_5b of the network to complete the deep neural network for capturing violent attention relationships based on spatial flow;
[0032] Step 12: Learn attention relation weights. Use the violent video sample data set to perform model training and learning under the deep neural network captured by the spatial flow-based violent attention relationship in step 11, and obtain the weight of the spatial flow-based violent attention relationship.
[0033] Step 13: Attention feature formation. Use the original feature and the weight of the violent attention relationship based on the spatial flow learned in step 12 to add and fuse the feature elements to obtain the features on the spatial flow with attention interdependence.
[0034] Specifically, the steps of adding the attention mechanism module in the time stream deep neural network are as follows:
[0035]Step 21: Construct a deep neural network for temporal flow-based capture of violent attention relations. Using the TSN network as the basic network, the attention mechanism module GCNet is embedded in the conv_bn_3c, conv_bn_4e and conv_bn_5b of the network to complete the deep neural network for capturing violent attention relationships based on time flow;
[0036] Step 22: Learn attention relation weights. Use the violent video sample data set to perform model training and learning under the deep neural network captured by the time-flow-based violence attention relationship in step 21, and obtain the time-flow-based violence attention relationship weight.
[0037] Step 23: Attention feature formation. Use the original features and the time-flow-based violent attention relationship weights learned in step 22 to add and fuse the feature elements to obtain the features on the time-flow with attention interdependence.
[0038] Specifically, the steps of extracting the feature information of a violent video on a single frame image are as follows
[0039] Step 31: Build a deep neural network based on single-frame image classification with attention relations. Using the combination of TSN network and attention mechanism module GCNet to complete the deep neural network based on single-frame image classification with attention relationship;
[0040] Step 32: use the violent video sample data set to do training on the deep neural network model based on single-frame image classification with attention relationship in step 31, and obtain the deep neural network model based on single-frame image classification;
[0041] Step 33: Use the deep neural network model based on single-frame image classification obtained in step 32 to output the prediction score for the violent video sample data.
[0042] Specifically, the steps to extract feature information of violent video on motion optical flow are as follows:
[0043] Step 41: Construct a deep neural network based on motion optical flow classification with attention relation. Using the combination of TSN network and attention mechanism module GCNet to complete the deep neural network based on motion optical flow classification with attention relationship;
[0044] Step 42: use the violent video sample data set to train the deep neural network model based on the motion optical flow classification with attention relationship in step 41, and obtain the deep neural network model based on the motion optical flow classification;
[0045] Step 43: Utilize the deep neural network model based on motion optical flow classification obtained in step 42 to output the prediction score of the violent video sample data to the violent video sample data.
[0046] Specifically, spatio-temporal feature fusion includes the following steps:
[0047] Step 51: Obtain the violence prediction scores under the two modal networks. Firstly, the single-frame image prediction score under the spatial flow network and the motion optical flow prediction score under the temporal flow network are respectively obtained;
[0048] Step 52: Construct post fusion of spatio-temporal features. After step 51, the violence prediction scores in the two modalities are averaged and fused to give the final violence prediction score.
[0049] figure 1 is a flowchart of attention-based dual-modal task learning. According to the flow sequence, the specific implementation process of each step of this algorithm is as follows:
[0050] read the video stream;
[0051] The system first obtains video stream data. The video data acquisition source may be a video file collected in advance.
[0052] Feature extraction with attention relation weights;
[0053] Extract a single-frame image in the video, and send the single-frame image information into the single-frame image feature extraction network model based on the TSN+GCNet network to extract features with attention relationship weights;
[0054] Extract the motion optical flow in the video, and send the optical flow information to the motion optical flow feature extraction network model based on TSN+GCNet network to extract features with attention relationship weights;
[0055] Fusion of spatio-temporal features;
[0056] The two kinds of feature information obtained in step 2 are trained to obtain two network models under the spatio-temporal feature.
[0057] After passing through the two models, the violent video prediction score under each model is given respectively;
[0058] After the prediction scores given by the two models in step 32 are averaged and fused, the classification result of the violent video is output.

Example Embodiment

[0059] Embodiment 2: as figure 1 , figure 2 , image 3 and Figure 4 As shown, the violent video recognition method based on the dual-modal task learning of the attention mechanism includes the following steps:
[0060] Step S101, adding an attention mechanism module to the deep neural network to capture interdependence between violent features;
[0061] Step S102, using a deep neural network with an attention mechanism to extract the features of the violent video on a single frame image;
[0062] Step S103, using a deep neural network with an attention mechanism to extract the features of the violent video on the motion optical flow;
[0063] Step S104, build a more reasonable violence recognition system based on the post-fusion multi-feature average fusion strategy.
[0064] The first basic neural convolutional network used is the TSN network, which is composed of a spatial stream convolutional neural network and a temporal stream convolutional neural network. The attention mechanism module is added to the two modal networks to perform global feature relations. The capture of the attention relationship weight is obtained, and the attention mechanism module is the GCNet module. Here, the position design of adding the attention mechanism module in the network includes the following steps:
[0065] Add the attention mechanism module in the conv_bn_3c and conv_bn_4e and conv_bn_5b of the spatial flow convolutional neural network;
[0066] Add attention mechanism module in conv_bn_3c and conv_bn_4e and conv_bn_5b of temporal stream convolutional neural network.
[0067] Step S102 also has the following features. First, the violent video sample library composed of positive and negative samples is divided into frames, and the single-frame image data of the video is saved, and then the single-frame data is sent to a deep neural network with an attention mechanism for classification. Training to obtain a feature extraction model with attention relationship based on spatial flow. Here, the design of deep convolutional neural network and the extraction of attention relationship features include the following steps:
[0068] Add the attention mechanism module in the conv_bn_3c and conv_bn_4e and conv_bn_5b of the spatial flow convolutional neural network;
[0069] After passing through the network layer with the attention module, the weight with the attention relationship is obtained, and then fused with the original feature to obtain the attention relationship feature of the single frame image.
[0070] Step S103 also has the following features. First, perform optical flow extraction processing on the violent video sample library composed of positive and negative samples, save the optical flow data of the video, and then send the optical flow data to a deep neural network with an attention mechanism module for classification Training to obtain a feature extraction model based on motion optical flow, where the deep convolutional neural network design and attention relationship feature extraction include the following steps:
[0071] Add attention mechanism module in conv_bn_3c and conv_bn_4e and conv_bn_5b of time stream convolutional neural network;
[0072] After passing through the network layer with the attention module, the weight with the attention relationship is obtained, and then fused with the original feature to obtain the attention relationship feature of the motion optical flow.
[0073] Step S104 also has the following features. First, the two features extracted in steps S102 and S103 are sent to the corresponding neural network training to obtain the corresponding model based on each mode, and then after the model is given, each mode is given. Finally, the decision scores in the two modalities are post-fused to give the final video decision result, and the post-fusion is mainly realized by the method of average fusion.

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.

Similar technology patents

Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products