Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Visual question and answer fusion enhancement method based on multi-modal fusion

A multi-modal and visual technology, applied in the fields of natural language and computer vision, can solve the problem that the answer features do not play, ignore, and cannot play a huge role in the answer information, and achieves the improvement of accuracy, accuracy and diversity. Effect

Active Publication Date: 2019-10-25
HANGZHOU DIANZI UNIV
View PDF4 Cites 51 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0011] Disadvantages: These methods almost only consider the relationship between the question and the image, but ignore the hidden relationship between the image-question-answer triplet. We can imagine that when you know the specific answer, you also There may be the ability to speculate on the question, so the role of the answer may play a very important role in the reasoning process, but these methods ignore this important information. Of course, there are also works that take this relationship into account and try to represent it by image features. , question feature representation, and answer feature representation for simple splicing, or to fuse images and questions and then map to answer features, but these two methods are difficult to fully express the relationship between triplets
[0012] Combining the above technologies, it is not difficult to see that the current visual question answering is mainly based on the fusion of image features and question features, but the answer features do not play their due role. Sometimes the answer features can even greatly improve the task accuracy, and the image- The complex relationship between questions and answers cannot be fully expressed, and has the following disadvantages:
[0013] 1. Without effective use of the answer information, the huge role of the answer information cannot be brought into play;
[0014] 2. When multi-modal fusion of question features and image features, it is impossible to use the attention mechanism concisely and effectively to obtain the most worthy of attention.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Visual question and answer fusion enhancement method based on multi-modal fusion
  • Visual question and answer fusion enhancement method based on multi-modal fusion
  • Visual question and answer fusion enhancement method based on multi-modal fusion

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0022] In order to make the above objects, features and advantages of the present invention more comprehensible, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

[0023] The multi-modal fusion-based visual question-and-answer fusion enhancement method proposed by the present invention, such as Figure 1-4 shown, including the following three steps:

[0024] Step 1. Use the GRU (Gated Recurrent Unit) structure to construct a time series model, obtain the feature representation learning of the problem, and use the output of the bottom-up attention model extracted from Faster R-CNN as the feature representation of the image. In the present invention, each word in the sentence is sequentially input into the GRU model in sequence, and the GRU output of the last word in the sentence can represent the entire sentence.

[0025] Such as figure 1 As shown, there are two gates in the GRU, one is the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a visual question and answer fusion enhancement method based on multi-modal fusion. The method comprises the following steps of 1, constructing a time sequence model by utilizing a GRU structure, obtaining feature representation learning of a problem, and utilizing output which is extracted from Faster R-CNN and is based on an attention model from bottom to top as the feature representation; 2, performing multi-modal reasoning based on an attention model Transformer, and introducing the attention model for performing multi-modal fusion on a picture-problem-answer tripleset, and establishing an inference relation; and 3, providing different reasoning processes and result outputs for different implicit relationships, and performing label distribution regression learning according to the result outputs to determine answers. According to the method, answers are obtained based on specific pictures and questions and directly applied to applications serving the blind,the blind or visually impaired people can be helped to better perceive the surrounding environment, the method is also applied to a picture retrieval system, and the accuracy and diversity of pictureretrieval are improved.

Description

technical field [0001] The invention belongs to the technical fields of computer vision and natural language. In particular, the invention relates to a multimodal fusion-based visual question-answer fusion enhancement method. Background technique [0002] Visual Question Answer (VQA for short) is a task that combines the fields of computer vision and computer natural language. What it needs to solve is to ask a specific question for a specific picture and reason its answer. VQA has many potential application scenarios, the most direct ones are those that help blind and visually impaired users. It can understand the surrounding environment for blind or visually impaired users. Through interactive programs, it can perceive the Internet and real life scenes. ; Another obvious application is to integrate VQA into an image retrieval system, which can affect image retrieval through natural language, which has a huge impact on social or business. [0003] VQAtask mainly solves the...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/332G06K9/62
CPCG06F16/3329G06F18/253
Inventor 颜成钢俞灵慧孙垚棋张继勇张勇东
Owner HANGZHOU DIANZI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products