Unlock instant, AI-driven research and patent intelligence for your innovation.

Visual question and answer method optimized by using position information

A technology of location information and vision, applied in the field of visual question answering, can solve problems such as complex real scenes, small scope of application, and poor performance, and achieve the effects of optimizing time efficiency, increasing training time, and improving performance

Pending Publication Date: 2022-07-29
SOUTH CHINA UNIV OF TECH
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In recent years, with the prosperity and development of the image and natural language communities, the cross-field of visual question answering is also full of various methods: 1) The multi-modal fusion method based on the attention mechanism can allow the model to focus on key objects in the picture and relevant information in the text. Nouns, but the model lacks a non-selection mechanism, and the normalized operation in the attention will lead to the inability to introduce quantitative information into the model
2) Based on the method of module network imitating the human reasoning process, this method performs grammatical analysis on the problem, designs several types of basic module networks, and inputs the problem into the corresponding module for processing according to the analyzed results, which has good interpretability. But the effect on the real picture data set is not good
3) Based on the method of graph learning, this method regards the objects in the picture as nodes, and the relationship between objects as edges to build a graph, but the effect is not much improved
4) The method of introducing external knowledge, there are many questions that require external knowledge to answer, but how to make the model better learn external knowledge is still difficult
5) For the optimization of the counting problem of the attention mechanism, etc., the problem that the aforementioned attention mechanism cannot introduce counting information into the model has been improved to a certain extent, but the scope of application is small
However, this method uses a matching algorithm between the visual modal scene graph and the language modal text graph to predict the answer, which is similar to the module network that was popular in academia a few years ago, and has good interpretability, but due to the complexity of the real scene , the effect of this method is not as good as that of various end-to-end methods that are popular today, and this method only performs matching operations between visual modalities and language modalities, and lacks steps for reasoning about problems

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Visual question and answer method optimized by using position information
  • Visual question and answer method optimized by using position information
  • Visual question and answer method optimized by using position information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0120] A visual question answering method optimized with location information, such as figure 1 shown, including the following steps:

[0121] S1. Collect training data, including pictures and questions related to a given picture, and then manually calibrate the answers to the questions; this embodiment directly uses the VQA v2.0 data set.

[0122] S2. Build a problem preprocessing module to preprocess the input problem to obtain the semantic feature vector input and position feature vector of the problem, including the following steps:

[0123] S2.1. Calculate the semantic feature vector of each word in the input question: initialize each word with GLoVe word embedding, and then input it to the long short-term memory network (LSTM) to get the semantic feature vector of a single word; since each word The length of the problem is different, the zero vector is used to fill or reduce, and the dimension is N×d 1 The semantic feature vector representing the question, where N is t...

Embodiment 2

[0228] In this embodiment, the difference from Embodiment 1 is that this embodiment uses the VQA v1.0 data set, b in step S7 is 248349, that is, the size of the VQA v1.0 training set is 248349, and r in step 7 is 2410, that is, the number of candidate answers for VQA v1.0 is 2410.

Embodiment 3

[0230] In this embodiment, the difference from Embodiment 1 is that this embodiment adopts the COCO-QA data set, and b in step S7 is 78736, that is, the size of the COCO-QA training set is 78736, and r in step 7 is 435, That is, the number of candidate answers for COCO-QA is 435.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a visual question and answer method optimized by using position information. The method comprises the following steps: collecting training data including pictures and questions related to a given picture; carrying out pre-processing on the input problem; carrying out pre-processing on the input picture; performing multi-head position self-attention operation to obtain a fusion feature vector of words in the question: performing position self-attention operation, and fusing a visual mode and a language mode by using a position joint attention mechanism to obtain a fusion feature vector of an object in the picture; the fusion feature vector of the object and the fusion feature vector of the word are compressed and then fused; the method comprises the following steps: constructing a visual question-answering model, predicting answers to questions, calculating differences between the answers and true values, training the visual question-answering model through back propagation, and inputting data into the trained visual question-answering model to perform visual question-answering. The visual question answering method provided by the invention can better understand questions, and is helpful for the model to understand sentence semantics.

Description

technical field [0001] The invention relates to the technical field of visual question answering, in particular to a visual question answering method optimized by using position information. Background technique [0002] Visual question answering is a research hotspot in the field of artificial intelligence in recent years, which has attracted extensive attention of scholars. Given a picture and a question related to the content of the picture, the task requires the machine to correctly understand the image and the question, and make an answer to the question. Answer appropriately. This task requires the model to have a fine-grained understanding of visual and linguistic information, and to be able to perform cross-modal reasoning like a human (associate objects in pictures with words in questions, understand logical relationships in questions, and be able to combine The content in the figure infers the corresponding answer). The research results in this field have been wi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/35G06K9/62G06N3/04G06N3/08
CPCG06F40/35G06N3/084G06N3/045G06F18/253
Inventor 毛爱华林肯
Owner SOUTH CHINA UNIV OF TECH
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More