Visual question and answer fusion enhancement method based on multi-modal fusion

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A multi-modal and visual technology, applied in the fields of natural language and computer vision, can solve the problem that the answer features do not play, ignore, and cannot play a huge role in the answer information, and achieves the improvement of accuracy, accuracy and diversity. Effect

Active Publication Date: 2019-10-25

HANGZHOU DIANZI UNIV

View PDF4 Cites 51 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0011] Disadvantages: These methods almost only consider the relationship between the question and the image, but ignore the hidden relationship between the image-question-answer triplet. We can imagine that when you know the specific answer, you also There may be the ability to speculate on the question, so the role of the answer may play a very important role in the reasoning process, but these methods ignore this important information. Of course, there are also works that take this relationship into account and try to represent it by image features. , question feature representation, and answer feature representation for simple splicing, or to fuse images and questions and then map to answer features, but these two methods are difficult to fully express the relationship between triplets

[0012] Combining the above technologies, it is not difficult to see that the current visual question answering is mainly based on the fusion of image features and question features, but the answer features do not play their due role. Sometimes the answer features can even greatly improve the task accuracy, and the image- The complex relationship between questions and answers cannot be fully expressed, and has the following disadvantages:

[0013] 1. Without effective use of the answer information, the huge role of the answer information cannot be brought into play;

[0014] 2. When multi-modal fusion of question features and image features, it is impossible to use the attention mechanism concisely and effectively to obtain the most worthy of attention.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0022] In order to make the above objects, features and advantages of the present invention more comprehensible, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

[0023] The multi-modal fusion-based visual question-and-answer fusion enhancement method proposed by the present invention, such as Figure 1-4 shown, including the following three steps:

[0024] Step 1. Use the GRU (Gated Recurrent Unit) structure to construct a time series model, obtain the feature representation learning of the problem, and use the output of the bottom-up attention model extracted from Faster R-CNN as the feature representation of the image. In the present invention, each word in the sentence is sequentially input into the GRU model in sequence, and the GRU output of the last word in the sentence can represent the entire sentence.

[0025] Such as figure 1 As shown, there are two gates in the GRU, one is the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a visual question and answer fusion enhancement method based on multi-modal fusion. The method comprises the following steps of 1, constructing a time sequence model by utilizing a GRU structure, obtaining feature representation learning of a problem, and utilizing output which is extracted from Faster R-CNN and is based on an attention model from bottom to top as the feature representation; 2, performing multi-modal reasoning based on an attention model Transformer, and introducing the attention model for performing multi-modal fusion on a picture-problem-answer tripleset, and establishing an inference relation; and 3, providing different reasoning processes and result outputs for different implicit relationships, and performing label distribution regression learning according to the result outputs to determine answers. According to the method, answers are obtained based on specific pictures and questions and directly applied to applications serving the blind,the blind or visually impaired people can be helped to better perceive the surrounding environment, the method is also applied to a picture retrieval system, and the accuracy and diversity of pictureretrieval are improved.

Description

technical field [0001] The invention belongs to the technical fields of computer vision and natural language. In particular, the invention relates to a multimodal fusion-based visual question-answer fusion enhancement method. Background technique [0002] Visual Question Answer (VQA for short) is a task that combines the fields of computer vision and computer natural language. What it needs to solve is to ask a specific question for a specific picture and reason its answer. VQA has many potential application scenarios, the most direct ones are those that help blind and visually impaired users. It can understand the surrounding environment for blind or visually impaired users. Through interactive programs, it can perceive the Internet and real life scenes. ; Another obvious application is to integrate VQA into an image retrieval system, which can affect image retrieval through natural language, which has a huge impact on social or business. [0003] VQAtask mainly solves the...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F16/332G06K9/62

CPCG06F16/3329G06F18/253

Inventor 颜成钢俞灵慧孙垚棋张继勇张勇东

Owner HANGZHOU DIANZI UNIV

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Visual question and answer fusion enhancement method based on multi-modal fusion

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology