Multi-modal video dense event description algorithm of interactive Transform

An event description and multi-modal technology, applied in the field of video algorithms, can solve problems such as hindering the parallelization of training samples, achieve good semantic description effects, good video segmentation, and improve the effect of related evaluation indicators

Pending Publication Date: 2022-05-10
苏州零样本智能科技有限公司
View PDF0 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Most dense video descriptions are based on encoder-decoder architectures of RNN, LSTM and their variants, whose inherent sequential properties hinder parallelization between training samples

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multi-modal video dense event description algorithm of interactive Transform
  • Multi-modal video dense event description algorithm of interactive Transform

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0021] In order to make the technical means, creative features, goals and effects achieved by the present invention easy to understand, the present invention will be further described below in conjunction with specific embodiments.

[0022] like figure 1 As shown, in the present invention, the multimodal dense video description algorithm that can interact with Transformer, the main process is as follows:

[0023] 1. Complete the dense video description task based on the ActivityNet Captions dataset. First, the visual features, audio features, and speech features in the video are extracted through the I3D model, VGGish model, and ASR system. By extracting multi-modal features, the information in the video can be better expressed.

[0024] 2. Use the interactive Transformer to encode-decode the extracted features.

[0025] 3. The model training is completed in two stages; first, the description model is trained based on the real segment proposal, the weight of the trained des...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a multi-mode video dense event description algorithm of an interactive Transform, and belongs to the technical field of video algorithms. The method comprises the following steps: 1, extracting visual features, audio features and voice features in a video; information in the video is better utilized through multi-modal feature extraction; 2, fusing the visual features with the audio features and the voice features through an interactive attention module in the interactive Transform, and further coding the video features; 3, completing model training in two stages; firstly, a description model is trained based on real video segments, then the encoder weight of the trained description model is frozen, and then a segment proposal model is trained. According to the method, the feature information in the video is fully utilized, the multi-modal features are interactively fused, and a good dense video description effect is shown.

Description

technical field [0001] The invention relates to a multi-modal video intensive event description algorithm capable of interacting with Transformers, and belongs to the technical field of video algorithms. Background technique [0002] The existing algorithms for dense video description only extract the visual information in the video, but the video contains not only visual information but also audio information and even voice information, so the single-modal extraction of visual information in the video does not make full use of the video information. [0003] Most of the dense video descriptions are based on the encoder-decoder architecture of RNN, LSTM and its variants, whose inherent sequential property hinders parallelization among training samples. For long sequences, the limitation of machine memory capacity hinders the batch processing of training samples, which is time-consuming; the above-mentioned problems are solved by using the interactive Transformer architectur...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/738G06F16/783G06K9/62G06N3/04G06N3/08
CPCG06F16/739G06F16/7834G06F16/7847G06N3/08G06N3/047G06F18/253
Inventor 陈国文杨昊
Owner 苏州零样本智能科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products