A method for multimodal video question answering using frame-subtitle self-supervision

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A multi-modal, subtitle technology, applied in the field of video question answering, can solve the problems of expensive time tags, ignoring the correspondence between frames and subtitles, etc.

Active Publication Date: 2022-07-08

STATE GRID ZHEJIANG ELECTRIC POWER +3

View PDF6 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0004] This scheme requires time tags to train the decoder to improve the effect, but the annotation of time tags is empirical and expensive

In addition, the above method separates video frames and subtitles, ignoring the correspondence between frames and subtitles

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0071] The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

[0072] An embodiment of the present invention proposes a method for multi-modal video question and answer using frame-subtitle self-supervision, refer to figure 1 shown, including the following steps:

[0073] S1: For the input video, question and answer text, and...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention belongs to the field of video question and answer, and in particular relates to a method for multi-modal video question and answer using frame-subtitle self-supervision. It includes the following steps: extracting video frame features, question and answer features, subtitle features, and subtitle suggestion features; obtaining frame features with attention and subtitle features with attention, and obtaining fusion features; calculating temporal attention scores based on fusion features; using temporal attention The time boundary of the question is calculated by the score calculation; the answer to the question is calculated by using the fusion feature and time attention score; the neural network is trained by using the time boundary of the question and the question answer; the network parameters of the neural network are optimized, and the optimal neural network is used for video question and answer and planning set time boundaries. Instead of using expensive time annotations, the present invention generates problem-related time boundaries based on self-designed time attention scores. In addition, the present invention obtains a more accurate answer by mining the relationship between the subtitle and the corresponding video content.

Description

technical field [0001] The invention belongs to the field of video question and answer, and particularly relates to a method for multi-modal video question and answer using frame-subtitle self-supervision. Background technique [0002] The multimodal video question answering task is a challenging task that currently attracts a lot of attention. This task is designed in the two fields of computer vision and natural language processing, and requires the system to give the answer to the question for a specific video and delineate the corresponding time boundary of the question in the video. At present, the video question answering task is still a relatively novel task, and the research on it is immature. [0003] At present, the existing multi-modal video question answering tasks generally use convolutional neural network to encode video, and use recurrent neural network to encode question and answer and subtitles in the video. Design the decoder, train the decoder with the q...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Patents(China)

IPC IPC(8): G06F16/783G06V10/80G06V10/82G06V10/771G06K9/62G06N3/08

CPCG06F16/7844G06F16/783G06N3/08G06F18/2135G06F18/253

Inventor 张宏达胡若云沈然叶上维丁麒王庆娟陈金威熊剑峰丁莹赵洲陈哲乾李一夫丁丹翔姜伟昊

Owner STATE GRID ZHEJIANG ELECTRIC POWER

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

A method for multimodal video question answering using frame-subtitle self-supervision

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology