Multi-modal video dense event description algorithm of interactive Transform

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
An event description and multi-modal technology, applied in the field of video algorithms, can solve problems such as hindering the parallelization of training samples, achieve good semantic description effects, good video segmentation, and improve the effect of related evaluation indicators

Pending Publication Date: 2022-05-10

苏州零样本智能科技有限公司

View PDF0 Cites 2 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0003] Most dense video descriptions are based on encoder-decoder architectures of RNN, LSTM and their variants, whose inherent sequential properties hinder parallelization between training samples

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0021] In order to make the technical means, creative features, goals and effects achieved by the present invention easy to understand, the present invention will be further described below in conjunction with specific embodiments.

[0022] like figure 1 As shown, in the present invention, the multimodal dense video description algorithm that can interact with Transformer, the main process is as follows:

[0023] 1. Complete the dense video description task based on the ActivityNet Captions dataset. First, the visual features, audio features, and speech features in the video are extracted through the I3D model, VGGish model, and ASR system. By extracting multi-modal features, the information in the video can be better expressed.

[0024] 2. Use the interactive Transformer to encode-decode the extracted features.

[0025] 3. The model training is completed in two stages; first, the description model is trained based on the real segment proposal, the weight of the trained des...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention relates to a multi-mode video dense event description algorithm of an interactive Transform, and belongs to the technical field of video algorithms. The method comprises the following steps: 1, extracting visual features, audio features and voice features in a video; information in the video is better utilized through multi-modal feature extraction; 2, fusing the visual features with the audio features and the voice features through an interactive attention module in the interactive Transform, and further coding the video features; 3, completing model training in two stages; firstly, a description model is trained based on real video segments, then the encoder weight of the trained description model is frozen, and then a segment proposal model is trained. According to the method, the feature information in the video is fully utilized, the multi-modal features are interactively fused, and a good dense video description effect is shown.

Description

technical field [0001] The invention relates to a multi-modal video intensive event description algorithm capable of interacting with Transformers, and belongs to the technical field of video algorithms. Background technique [0002] The existing algorithms for dense video description only extract the visual information in the video, but the video contains not only visual information but also audio information and even voice information, so the single-modal extraction of visual information in the video does not make full use of the video information. [0003] Most of the dense video descriptions are based on the encoder-decoder architecture of RNN, LSTM and its variants, whose inherent sequential property hinders parallelization among training samples. For long sequences, the limitation of machine memory capacity hinders the batch processing of training samples, which is time-consuming; the above-mentioned problems are solved by using the interactive Transformer architectur...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F16/738G06F16/783G06K9/62G06N3/04G06N3/08

CPCG06F16/739G06F16/7834G06F16/7847G06N3/08G06N3/047G06F18/253

Inventor 陈国文杨昊

Owner 苏州零样本智能科技有限公司

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Multi-modal video dense event description algorithm of interactive Transform

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology