Visual question and answer fusion enhancement method based on multi-modal fusion

A multi-modal and visual technology, applied in the fields of natural language and computer vision, can solve the problem that the answer features do not play, ignore, and cannot play a huge role in the answer information, and achieves the improvement of accuracy, accuracy and diversity. Effect

A multi-modal and visual technology, applied in the fields of natural language and computer vision, can solve the problem that the answer features do not play, ignore, and cannot play a huge role in the answer information, and achieves the improvement of accuracy, accuracy and diversity. Effect

CN110377710AActive Publication Date: 2019-10-25HANGZHOU DIANZI UNIV

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Visual question and answer fusion enhancement method based on multi-modal fusion
  • Visual question and answer fusion enhancement method based on multi-modal fusion
  • Visual question and answer fusion enhancement method based on multi-modal fusion

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0022] In order to make the above objects, features and advantages of the present invention more comprehensible, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

[0023] The multi-modal fusion-based visual question-and-answer fusion enhancement method proposed by the present invention, such as Figure 1-4 shown, including the following three steps:

[0024] Step 1. Use the GRU (Gated Recurrent Unit) structure to construct a time series model, obtain the feature representation learning of the problem, and use the output of the bottom-up attention model extracted from Faster R-CNN as the feature representation of the image. In the present invention, each word in the sentence is sequentially input into the GRU model in sequence, and the GRU output of the last word in the sentence can represent the entire sentence.

[0025] Such as figure 1 As shown, there are two gates in the GRU, one is the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a visual question and answer fusion enhancement method based on multi-modal fusion. The method comprises the following steps of 1, constructing a time sequence model by utilizing a GRU structure, obtaining feature representation learning of a problem, and utilizing output which is extracted from Faster R-CNN and is based on an attention model from bottom to top as the feature representation; 2, performing multi-modal reasoning based on an attention model Transformer, and introducing the attention model for performing multi-modal fusion on a picture-problem-answer tripleset, and establishing an inference relation; and 3, providing different reasoning processes and result outputs for different implicit relationships, and performing label distribution regression learning according to the result outputs to determine answers. According to the method, answers are obtained based on specific pictures and questions and directly applied to applications serving the blind,the blind or visually impaired people can be helped to better perceive the surrounding environment, the method is also applied to a picture retrieval system, and the accuracy and diversity of pictureretrieval are improved.

Description

technical field [0001] The invention belongs to the technical fields of computer vision and natural language. In particular, the invention relates to a multimodal fusion-based visual question-answer fusion enhancement method. Background technique [0002] Visual Question Answer (VQA for short) is a task that combines the fields of computer vision and computer natural language. What it needs to solve is to ask a specific question for a specific picture and reason its answer. VQA has many potential application scenarios, the most direct ones are those that help blind and visually impaired users. It can understand the surrounding environment for blind or visually impaired users. Through interactive programs, it can perceive the Internet and real life scenes. ; Another obvious application is to integrate VQA into an image retrieval system, which can affect image retrieval through natural language, which has a huge impact on social or business. [0003] VQAtask mainly solves the...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
25 Oct 2019
Publication
CN110377710A
IPC
G06F16/332; G06K9/62
CPC
G06F16/3329; G06F18/253
Inventors
颜成钢; 俞灵慧