[0024] Embodiment 1: as figure 1 , figure 2 , image 3 and Figure 4 As shown, the violent video recognition method based on the dual-modal task learning of the attention mechanism includes the following steps:
[0025] Step 1: Add an attention mechanism module to the spatial flow deep neural network to capture the interdependence between the violent features of static frame pictures, and form the weight of the attention mechanism;
[0026] Step 2: Add an attention mechanism module to the time flow deep neural network to capture the interdependence between the violent features of the optical flow sequence diagram, and form the weight of the attention mechanism;
[0027] Step 3: Extract the feature information of the violent video on a single frame image, and establish a violent video recognition model based on a single frame image;
[0028] Step 4: Extract the feature information of the violent video on the motion optical flow, and establish a violent video recognition model based on the motion optical flow;
[0029] Step 5: Spatio-temporal feature fusion. Using the average fusion method, the scoring strategy of the violent video recognition model based on single frame image and the scoring strategy of the violent video recognition model based on motion optical flow are fused to give the final violence classification score.
[0030] Specifically, the steps of adding the attention mechanism module in the spatial stream deep neural network are as follows:
[0031] Step 11: Construct a deep neural network for spatial flow-based violent attention relation capture. Using the TSN network as the basic network, the attention mechanism module GCNet is embedded in the conv_bn_3c, conv_bn_4e and conv_bn_5b of the network to complete the deep neural network for capturing violent attention relationships based on spatial flow;
[0032] Step 12: Learn attention relation weights. Use the violent video sample data set to perform model training and learning under the deep neural network captured by the spatial flow-based violent attention relationship in step 11, and obtain the weight of the spatial flow-based violent attention relationship.
[0033] Step 13: Attention feature formation. Use the original feature and the weight of the violent attention relationship based on the spatial flow learned in step 12 to add and fuse the feature elements to obtain the features on the spatial flow with attention interdependence.
[0034] Specifically, the steps of adding the attention mechanism module in the time stream deep neural network are as follows:
[0035]Step 21: Construct a deep neural network for temporal flow-based capture of violent attention relations. Using the TSN network as the basic network, the attention mechanism module GCNet is embedded in the conv_bn_3c, conv_bn_4e and conv_bn_5b of the network to complete the deep neural network for capturing violent attention relationships based on time flow;
[0036] Step 22: Learn attention relation weights. Use the violent video sample data set to perform model training and learning under the deep neural network captured by the time-flow-based violence attention relationship in step 21, and obtain the time-flow-based violence attention relationship weight.
[0037] Step 23: Attention feature formation. Use the original features and the time-flow-based violent attention relationship weights learned in step 22 to add and fuse the feature elements to obtain the features on the time-flow with attention interdependence.
[0038] Specifically, the steps of extracting the feature information of a violent video on a single frame image are as follows
[0039] Step 31: Build a deep neural network based on single-frame image classification with attention relations. Using the combination of TSN network and attention mechanism module GCNet to complete the deep neural network based on single-frame image classification with attention relationship;
[0040] Step 32: use the violent video sample data set to do training on the deep neural network model based on single-frame image classification with attention relationship in step 31, and obtain the deep neural network model based on single-frame image classification;
[0041] Step 33: Use the deep neural network model based on single-frame image classification obtained in step 32 to output the prediction score for the violent video sample data.
[0042] Specifically, the steps to extract feature information of violent video on motion optical flow are as follows:
[0043] Step 41: Construct a deep neural network based on motion optical flow classification with attention relation. Using the combination of TSN network and attention mechanism module GCNet to complete the deep neural network based on motion optical flow classification with attention relationship;
[0044] Step 42: use the violent video sample data set to train the deep neural network model based on the motion optical flow classification with attention relationship in step 41, and obtain the deep neural network model based on the motion optical flow classification;
[0045] Step 43: Utilize the deep neural network model based on motion optical flow classification obtained in step 42 to output the prediction score of the violent video sample data to the violent video sample data.
[0046] Specifically, spatio-temporal feature fusion includes the following steps:
[0047] Step 51: Obtain the violence prediction scores under the two modal networks. Firstly, the single-frame image prediction score under the spatial flow network and the motion optical flow prediction score under the temporal flow network are respectively obtained;
[0048] Step 52: Construct post fusion of spatio-temporal features. After step 51, the violence prediction scores in the two modalities are averaged and fused to give the final violence prediction score.
[0049] figure 1 is a flowchart of attention-based dual-modal task learning. According to the flow sequence, the specific implementation process of each step of this algorithm is as follows:
[0050] read the video stream;
[0051] The system first obtains video stream data. The video data acquisition source may be a video file collected in advance.
[0052] Feature extraction with attention relation weights;
[0053] Extract a single-frame image in the video, and send the single-frame image information into the single-frame image feature extraction network model based on the TSN+GCNet network to extract features with attention relationship weights;
[0054] Extract the motion optical flow in the video, and send the optical flow information to the motion optical flow feature extraction network model based on TSN+GCNet network to extract features with attention relationship weights;
[0055] Fusion of spatio-temporal features;
[0056] The two kinds of feature information obtained in step 2 are trained to obtain two network models under the spatio-temporal feature.
[0057] After passing through the two models, the violent video prediction score under each model is given respectively;
[0058] After the prediction scores given by the two models in step 32 are averaged and fused, the classification result of the violent video is output.