Method and system for improving video question-answering precision based on multi-modal fusion model

A fusion model and multi-modal technology, applied in the field of natural language processing and deep learning, can solve problems such as large answer error, extraction, and difficulty in meeting the accuracy requirements of video question and answer, and achieve the goal of improving test accuracy and accuracy Effect

Active Publication Date: 2021-03-26
SHANDONG NORMAL UNIV
View PDF7 Cites 14 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] However, in real life, the questions people ask about pictures are often related to the target entities in the pictures. However, the information extracted by the current video question answering cannot realize the extraction of visua

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for improving video question-answering precision based on multi-modal fusion model
  • Method and system for improving video question-answering precision based on multi-modal fusion model
  • Method and system for improving video question-answering precision based on multi-modal fusion model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0045] Such as figure 1 As shown, the present disclosure provides a method for improving video question answering accuracy based on a multimodal fusion model, including:

[0046] Collect video data and question features, and obtain video question-and-answer questions;

[0047] Extract visual features and subtitle features from video data;

[0048] Perform fusion processing of visual features and subtitle features to obtain fused visual features and fused subtitle features;

[0049] Input the fused visual features, fused subtitle features and question features into the multimodal fusion model for training to obtain a trained multimodal fusion model;

[0050] Input the questions of video question answering into the trained multimodal fusion model, obtain the answers to the questions, and predict the probability of each answer being the correct answer.

[0051] Further, the collection of video data and question features, and the acquisition of video question-and-answer questio...

Embodiment 2

[0091] Such as figure 2 As shown, the framework of this disclosure aims to select the correct answer in video question answering.

[0092] The TVQA dataset is a benchmark for video question answering, containing 152,545 human-annotated multiple-choice question-answer pairs (84,768 what, 13,644 how, 17,777 where, 15,798 why, 17,654 who asked), from 6 TV shows (“ 21.8K video clips of The Big Bang Theory, Castle, How I Met Ben's Mother, Grey's Anatomy, MD's House, Friends). Questions in the TVQA dataset have five candidate answers, only one of which is the correct answer. The format of the test questions in the dataset is designed as follows:

[0093]"[What / How / Where / Why / who]___[when / before / after / …]___", both parts of the question require visual and verbal understanding. There are a total of 122,039 QAs in the training set, 15,253 QAs in the validation set, and 7,623 QAs in the test set.

[0094] Evaluations for this disclosure were performed on a computer equipped with an I...

Embodiment 3

[0101] A system for improving the accuracy of video question answering based on a multimodal fusion model, including:

[0102] The data collection module is configured to: collect video data and question features, and obtain video question-and-answer questions;

[0103] The data processing module is configured to: extract visual features and subtitle features from video data;

[0104] The feature fusion module is configured to: perform fusion processing on visual features and subtitle features to obtain fusion visual features and fusion subtitle features;

[0105] The model training module is configured to: input the fusion visual feature, the fusion subtitle feature and the question feature into the multimodal fusion model for training, and obtain the trained multimodal fusion model;

[0106] The output module is configured to: input the question of the video question answering into the trained multi-modal fusion model, and use the multi-head self-attention mechanism to obta...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method and system for improving video question-answering precision based on a multi-modal fusion model, and the method comprises the steps: collecting video data and questionfeatures, and obtaining a video question-answering question; extracting visual features and subtitle features from the video data; performing fusion processing on the visual features and the subtitlefeatures to obtain fused visual features and fused subtitle features; inputting the fused visual features, the fused caption features and the problem features into a multi-modal fusion model for training to obtain a trained multi-modal fusion model; inputting the questions of the video questions and answers into the trained multi-modal fusion model to obtain answers to the questions; different target entity instances are focused for different questions according to the characteristics of the questions, so that the accuracy of selecting answers by the model is improved.

Description

technical field [0001] The disclosure belongs to the technical field of natural language processing and deep learning, and relates to a method and system for improving the accuracy of video question answering based on a multimodal fusion model in video question answering. Background technique [0002] The statements in this section merely provide background information related to the present disclosure and do not necessarily constitute prior art. [0003] In recent years, research on video question answering (Video-QA) based on visual content and linguistic content has successfully benefited from deep neural networks. This task aims at the inference process of selecting the correct answer from answer candidates in videos. Similar to how babies learn to speak, machine understanding of images and videos is transitioning from labeling images with a few words to learning to generate complete sentences. Different from traditional image captioning tasks, multimodal video questio...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/332G06K9/00G06K9/62G06F40/289G06N3/04
CPCG06F16/3329G06F40/289G06V20/40G06V30/10G06N3/045G06F18/253
Inventor 徐卫志蔡晓雅曹洋于惠庄须强刘志远孙中志赵晗龙开放
Owner SHANDONG NORMAL UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products