Action recognition method based on attention mechanism of convolution recurrent neural network

A technology of recursive neural network and convolutional neural network, which is applied in the field of computer vision action recognition, can solve problems such as the inability to effectively extract salient areas, and achieve the effect of improving accuracy

Active Publication Date: 2017-10-20
DALIAN UNIV OF TECH
3 Cites 74 Cited by

AI-Extracted Technical Summary

Problems solved by technology

[0007] Aiming at the problem that the salient region cannot be effectively extracted in the process of action recognition, the present invention proposes an action recognition method based on a convolutional recurrent neural network ba...
View more

Method used

The preprocessing of 1 data. The size of the RGB image of the original video frame is not uniform, which is not suitable for subsequent processing. The present invention cuts the original image so that its size can be unified. At the same time, in order to speed up the subsequent processing, the present invention performs normalization processing on the image.
[0043] H, W and C represent the height, width and channel of the input feature map, respectiv...
View more

Abstract

The present invention belongs to the field of computer vision action recognition, and proposes an action recognition method based on the attention mechanism of the convolution recurrent neural network, in order to solve the problem that the obvious region cannot be effectively extracted in the action recognition and to improve the accuracy of classification. The method comprises: using the convolution neural network to extract the feature of the action video automatically; using the spatial transformation network to realize the attention mechanism based on the feature map, and extracting the obvious feature region by using the attention mechanism to generate the target feature map; and inputting the target feature map into the convolutional recurrent neural network to produce the final action recognition result. Experiments show that the proposed method has achieved good results on the benchmark action video test set such as UCF-11, HMDB-51, and the like, and improves the accuracy of action recognition.

Application Domain

Character and pattern recognitionNeural architectures

Technology Topic

Activity recognitionConvolution +9

Image

  • Action recognition method based on attention mechanism of convolution recurrent neural network
  • Action recognition method based on attention mechanism of convolution recurrent neural network
  • Action recognition method based on attention mechanism of convolution recurrent neural network

Examples

  • Experimental program(1)

Example Embodiment

[0030] An embodiment of the present invention provides an action recognition method based on an attention mechanism. The specific embodiments discussed are merely illustrative of implementations of the invention, and do not limit the scope of the invention. Embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings, specifically including the following steps:
[0031] 1 Data preprocessing. The size of the RGB image of the original video frame is not uniform, which is not suitable for subsequent processing. The present invention cuts the original image so that its size can be unified. At the same time, in order to speed up the subsequent processing, the present invention performs normalization processing on the image.
[0032] 2 feature extraction. In view of the success of the GoogleNet neural network in image feature representation, the present invention regards a video as an image collection composed of multiple frames, and then utilizes a convolutional neural network to extract frame features. The present invention selects GoogleNet as a model for feature extraction, first pre-trains GoogleNet on the ImageNet data set, and then uses the trained model for feature extraction of video frames. The present invention extracts features from the last convolutional layer of the GoogleNet model. figure 2 An example of using GoogleNet to extract video feature maps is given.
[0033] 3 Utilize attention mechanism to process feature vectors. The present invention utilizes a spatial transformation network (SpatialTransformerNetwork) to realize the attention mechanism. The spatial transformation network is a differentiable module, which performs spatial transformation operations on the video feature map during the forward propagation process, and performs different transformation operations according to different inputs. Spatial transformation network space transformation (SpatialTransformer) can be divided into positioning network, grid generator and sampler three parts, image 3 The model structure diagram of the space transformation network is given.
[0034] (1) Positioning network
[0035] The present invention utilizes the recursive neural network to realize the positioning network, such as Figure 4 shown. Based on the feature map U∈R generated in step 2 H×W×C , H, W and C represent the height, width and channel of the feature map, respectively, which are extracted from the last convolutional layer of GoogleNet. The present invention uses the positioning network to process the feature map to obtain the conversion parameters, θ=f loc (U), θ is the conversion parameter. First, the average pooling operation (Mean Pooling) is performed on the input feature map to make it a 1-dimensional feature vector; then the feature vector of multiple frames is input into the long short-term memory model (LSTM), and finally through a linear activation The fully connected layer (FC) of the function generates the transformation parameters θ corresponding to each frame.
[0036] (2) Grid generator
[0037] The present invention utilizes 2D affine transformation A θ to implement the grid generator, as shown in the formula:
[0038]
[0039] in is the target coordinate of the regular grid in the output feature map, is the coordinates of sampling points in the input feature map, A θ is the affine transformation matrix. The present invention first normalizes the height and width, so that Then based on the conversion parameter θ generated by the positioning network, plus the target coordinate value, the sampling coordinate required by the sampler is generated.
[0040] (3) Sampler for sampling
[0041] The present invention uses the bilinear kernel to sample the sampling points generated by the grid generator, and the bilinear kernel is as follows:
[0042]
[0043] H, W and C denote the height, width and channel of the input feature map, respectively. Is the value of the coordinate position (n, m) of the input feature map in channel c, V i c is the coordinate position of the output feature map in channel c pixel value at . The present invention samples each channel of the input feature map identically, so each channel is transformed in the same way, maintaining the spatial consistency between channels. This sampling kernel is differentiable and can be optimized simply by backpropagation.
[0044] (4) Model the video feature sequence. like Figure 5 As shown, the present invention uses a convolutional recurrent neural network (ConvLSTM) to model the sequence. This network model uses convolution operations to replace the original fully connected operations, and both input to state and state to state transitions Using the convolutional structure, by stacking multiple ConvLSTM layers and forming a sequence classification structure. The key equation of ConvLSTM is shown in the following formula, where "*" indicates the convolution operator and "ο" indicates the Hadamard product:
[0045]
[0046] W x~ and W h~ Represents the convolution kernel, the input gate i (t) , forget gate f (t) , the output gate o (t) , the memory unit c (t) and c (t -1) , hidden state h (t) and h (t-1) Both are 3D tensors.
[0047] The convolution operation will cause the size of the state to be inconsistent with the input. The present invention fills the state of the ConvLSTM before applying the convolution operation, so that the state of the ConvLSTM has the same size as the input. The present invention utilizes a convolutional recurrent neural network to generate categories for each frame in a video.
[0048] (5) Action classification. In step (4), the present invention can obtain category predictions about video frames, and the present invention uses these predictions to classify actions. For an action video, the present invention counts the category that is the most in all frames of the video, and then uses this category as the final classification result of the video. Image 6 The flow chart of the action recognition algorithm of the convolutional recurrent neural network based on the attention mechanism provided by the embodiment of the present invention.

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.

Similar technology patents

Imaging apparatus and flicker detection method

ActiveUS20100013953A1reduce dependencyimprove accuracy
Owner:RENESAS ELECTRONICS CORP

Color interpolation method

InactiveUS20050117040A1improve accuracy
Owner:MEGACHIPS

Emotion classifying method fusing intrinsic feature and shallow feature

ActiveCN105824922AImprove classification performanceimprove accuracy
Owner:CHONGQING UNIV OF POSTS & TELECOMM

Scene semantic segmentation method based on full convolution and long and short term memory units

InactiveCN107480726Aimprove accuracylow resolution accuracy
Owner:UNIV OF ELECTRONIC SCI & TECH OF CHINA

Classification and recommendation of technical efficacy words

  • improve accuracy

Golf club head with adjustable vibration-absorbing capacity

InactiveUS20050277485A1improve grip comfortimprove accuracy
Owner:FUSHENG IND CO LTD

Stent delivery system with securement and deployment accuracy

ActiveUS7473271B2improve accuracyreduces occurrence and/or severity
Owner:BOSTON SCI SCIMED INC

Method for improving an HS-DSCH transport format allocation

InactiveUS20060089104A1improve accuracyincrease benefit
Owner:NOKIA SOLUTIONS & NETWORKS OY

Catheter systems

ActiveUS20120059255A1increase selectivityimprove accuracy
Owner:ST JUDE MEDICAL ATRIAL FIBRILLATION DIV

Gaming Machine And Gaming System Using Chips

ActiveUS20090075725A1improve accuracy
Owner:UNIVERSAL ENTERTAINMENT CORP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products