A two-stream video classification method and device based on cross-modal attention mechanism

A video classification and attention technology, applied in the field of computer vision, can solve problems such as difficult to quickly and accurately locate key objects, "moving objects" without modeling methods, and less research, to achieve improved video classification accuracy, improved classification accuracy, and better The effect of compatibility

Active Publication Date: 2021-06-22
PEKING UNIV +2
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Second, the current technology is still difficult to quickly and accurately locate key objects
As for how to capture key clues, that is, to introduce the attention mechanism into video classification, there are relatively few studies. The more representative one is the non-local neural network (Non-local Neural Networks), but the network can only focus on a single Important information inside the modal, there is no special way to model "moving objects"

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A two-stream video classification method and device based on cross-modal attention mechanism
  • A two-stream video classification method and device based on cross-modal attention mechanism

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0035] In order to make the above objects, features and advantages of the present invention more comprehensible, the present invention will be further described in detail below through specific embodiments and accompanying drawings.

[0036] 1. Configuration of cross-modal attention module

[0037] The cross-modal attention module can handle input of any dimension, and can ensure that the shape of the input and output is consistent, so it has excellent compatibility. Taking the 2-dimensional configuration as an example, Q, K, and V are respectively obtained by 1x1 2-dimensional convolution operation (for 3-dimensional models, the convolution here is 1x1x1 3-dimensional convolution operation), in order to reduce computational complexity and save In GPU space, the above convolution operation performs dimensionality reduction in the channel dimension while obtaining Q, K, and V. In order to further simplify the operation, a max-pooling operation can be performed before the convolu...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a dual-stream video classification method and device based on a cross-modal attention mechanism. Different from the traditional two-stream method, the present invention fuses the information of two modalities (or even more modalities) before predicting the result, so it can be more efficient and sufficient. At the same time, due to the information interaction at an earlier stage , a single branch already has the important information of another branch in the later stage, the accuracy of the single branch has been equal to or even exceeded the traditional two-stream method, and the parameter amount of the single branch is much less than the traditional two-stream method; compared with the non-local neural network, this The attention module designed by the invention can cross modalities, instead of only using the attention mechanism within a single modality. The method proposed by the invention is equivalent to a non-local neural network when the two modalities are the same.

Description

technical field [0001] The invention relates to a video classification method, in particular to a dual-stream video classification method and device using an attention mechanism, belonging to the field of computer vision. [0002] technical background [0003] With the rapid development of deep learning in the image field, deep learning methods have gradually been introduced in the video field and have achieved certain achievements. However, the current technical level is far from reaching the desired effect, and the problems faced mainly include the following two aspects: [0004] First, current technology has yet to take full advantage of dynamic information. The difference between video and image is that the dynamic information between frames is unique and very important to video. For example, even for humans, it is difficult to judge various sub-categories of dances (such as tango and salsa) only by looking at one frame of images, and if the motion trajectory informatio...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/75G06F16/73G06K9/62G06N3/04
CPCG06N3/045G06F18/214
Inventor 迟禄严慧田贵宇穆亚东陈刚王成成黄波韩峻糜俊青
Owner PEKING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products