Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method for performing multi-mode video question answering by using frame-subtitle self-supervision

A multi-modal, subtitle technology, applied in the field of video question answering, can solve the problems of expensive time tags, ignoring the corresponding relationship between frames and subtitles, etc.

Active Publication Date: 2021-05-28
STATE GRID ZHEJIANG ELECTRIC POWER +3
View PDF6 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] This scheme requires time tags to train the decoder to improve the effect, but the annotation of time tags is empirical and expensive
In addition, the above method separates video frames and subtitles, ignoring the correspondence between frames and subtitles

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for performing multi-mode video question answering by using frame-subtitle self-supervision
  • Method for performing multi-mode video question answering by using frame-subtitle self-supervision
  • Method for performing multi-mode video question answering by using frame-subtitle self-supervision

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0071] The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. The following description of at least one exemplary embodiment is merely illustrative in nature and in no way taken as limiting the invention, its application or uses. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

[0072] The embodiment of the present invention proposes a method for multimodal video question answering using frame-subtitle self-supervision, refer to figure 1 shown, including the following steps:

[0073] S1: For the input video, question and answer text, and subtitle text, extract the video fra...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention belongs to the field of video questions and answers, and particularly relates to a method for performing multi-mode video question answering by using frame-subtitle self-supervision. The method includes the following steps: extracting video frame features, question and answer features, subtitle features and subtitle suggestion features; obtaining frame features with attention and caption features with attention, and obtaining fusion features; calculating and obtaining a time attention score based on the fusion feature; calculating and obtaining the time boundary of the question by using the time attention score; calculating and obtaining answers to the questions by adopting the fusion features and the time attention scores; training a neural network by using the time boundary of the question and the answer of the question; and optimizing network parameters of the neural network, performing video question answering by using the optimal neural network, and delimiting a time boundary. The time boundary related to the problem is generated according to the self-designed time attention score instead of using time annotation with high annotation cost. In addition, more accurate answers are obtained by mining the relation between the subtitles and the corresponding video content.

Description

technical field [0001] The invention belongs to the field of video question answering, and in particular relates to a method for multimodal video question answering using frame-subtitle self-supervision. Background technique [0002] The multimodal video question answering task is a challenging task that currently attracts much attention. This task is designed in two fields of computer vision and natural language processing. It is required that the system can give the answer to the question for a specific video and delineate the corresponding time boundary of the question in the video. At present, the video question answering task is still a relatively novel task, and the research on it is still immature. [0003] Existing multi-modal video question answering tasks generally use convolutional neural network to encode video, and use recurrent neural network to encode question and answer and subtitles in the video. The question and answer encoding, subtitle encoding and video...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/783G06K9/62G06N3/08
CPCG06F16/7844G06F16/783G06N3/08G06F18/2135G06F18/253
Inventor 张宏达胡若云沈然叶上维丁麒王庆娟陈金威熊剑峰丁莹赵洲陈哲乾李一夫丁丹翔姜伟昊
Owner STATE GRID ZHEJIANG ELECTRIC POWER
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products