Method and system for improving video question-answering precision based on multi-modal fusion model

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A fusion model and multi-modal technology, applied in the field of natural language processing and deep learning, can solve problems such as large answer error, extraction, and difficulty in meeting the accuracy requirements of video question and answer, and achieve the goal of improving test accuracy and accuracy Effect

Active Publication Date: 2021-03-26

SHANDONG NORMAL UNIV

View PDF7 Cites 14 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0006] However, in real life, the questions people ask about pictures are often related to the target entities in the pictures. However, the information extracted by the current video question answering cannot realize the extraction of visual information, and cannot effectively infer the target entity region of the image and the adjacent The subtitle information makes the answer error larger, and it is difficult to meet the accuracy requirements of the video question and answer

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0045] Such as figure 1 As shown, the present disclosure provides a method for improving video question answering accuracy based on a multimodal fusion model, including:

[0046] Collect video data and question features, and obtain video question-and-answer questions;

[0047] Extract visual features and subtitle features from video data;

[0048] Perform fusion processing of visual features and subtitle features to obtain fused visual features and fused subtitle features;

[0049] Input the fused visual features, fused subtitle features and question features into the multimodal fusion model for training to obtain a trained multimodal fusion model;

[0050] Input the questions of video question answering into the trained multimodal fusion model, obtain the answers to the questions, and predict the probability of each answer being the correct answer.

[0051] Further, the collection of video data and question features, and the acquisition of video question-and-answer questio...

Embodiment 2

[0091] Such as figure 2 As shown, the framework of this disclosure aims to select the correct answer in video question answering.

[0092] The TVQA dataset is a benchmark for video question answering, containing 152,545 human-annotated multiple-choice question-answer pairs (84,768 what, 13,644 how, 17,777 where, 15,798 why, 17,654 who asked), from 6 TV shows (“ 21.8K video clips of The Big Bang Theory, Castle, How I Met Ben's Mother, Grey's Anatomy, MD's House, Friends). Questions in the TVQA dataset have five candidate answers, only one of which is the correct answer. The format of the test questions in the dataset is designed as follows:

[0093]"[What / How / Where / Why / who]___[when / before / after / …]___", both parts of the question require visual and verbal understanding. There are a total of 122,039 QAs in the training set, 15,253 QAs in the validation set, and 7,623 QAs in the test set.

[0094] Evaluations for this disclosure were performed on a computer equipped with an I...

Embodiment 3

[0101] A system for improving the accuracy of video question answering based on a multimodal fusion model, including:

[0102] The data collection module is configured to: collect video data and question features, and obtain video question-and-answer questions;

[0103] The data processing module is configured to: extract visual features and subtitle features from video data;

[0104] The feature fusion module is configured to: perform fusion processing on visual features and subtitle features to obtain fusion visual features and fusion subtitle features;

[0105] The model training module is configured to: input the fusion visual feature, the fusion subtitle feature and the question feature into the multimodal fusion model for training, and obtain the trained multimodal fusion model;

[0106] The output module is configured to: input the question of the video question answering into the trained multi-modal fusion model, and use the multi-head self-attention mechanism to obta...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides a method and system for improving video question-answering precision based on a multi-modal fusion model, and the method comprises the steps: collecting video data and questionfeatures, and obtaining a video question-answering question; extracting visual features and subtitle features from the video data; performing fusion processing on the visual features and the subtitlefeatures to obtain fused visual features and fused subtitle features; inputting the fused visual features, the fused caption features and the problem features into a multi-modal fusion model for training to obtain a trained multi-modal fusion model; inputting the questions of the video questions and answers into the trained multi-modal fusion model to obtain answers to the questions; different target entity instances are focused for different questions according to the characteristics of the questions, so that the accuracy of selecting answers by the model is improved.

Description

technical field [0001] The disclosure belongs to the technical field of natural language processing and deep learning, and relates to a method and system for improving the accuracy of video question answering based on a multimodal fusion model in video question answering. Background technique [0002] The statements in this section merely provide background information related to the present disclosure and do not necessarily constitute prior art. [0003] In recent years, research on video question answering (Video-QA) based on visual content and linguistic content has successfully benefited from deep neural networks. This task aims at the inference process of selecting the correct answer from answer candidates in videos. Similar to how babies learn to speak, machine understanding of images and videos is transitioning from labeling images with a few words to learning to generate complete sentences. Different from traditional image captioning tasks, multimodal video questio...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F16/332G06K9/00G06K9/62G06F40/289G06N3/04

CPCG06F16/3329G06F40/289G06V20/40G06V30/10G06N3/045G06F18/253

Inventor 徐卫志蔡晓雅曹洋于惠庄须强刘志远孙中志赵晗龙开放

Owner SHANDONG NORMAL UNIV

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Method and system for improving video question-answering precision based on multi-modal fusion model

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology