Visual question-answer method based on equal attention graph network
An attention and network technology, applied in the field of image visual question answering, can solve the problems of ignoring the image structure and being unable to effectively lock the scene target, etc., and achieve the effect of sufficient evidence and improved performance
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0022] A method for visual question answering based on an equal attention graph network, comprising the following steps:
[0023] Step 1, preprocess the input image I, send the image I to the feature extraction network, and obtain the regional target features composed of K regional features with the highest confidence;
[0024] As a preferred solution, the feature extraction network used in step 1 is a Faster R-CNN network, the value of K is 36, and each regional target feature is represented by a 2048-dimensional vector.
[0025] Step 2. In order to obtain the input feature representation, the image I is converted into a graph representation G by using the regional target features obtained in step 1. G includes the nodes represented by the target object and the relationship edges corresponding to the relationship between objects, and the input question text Q Perform word embedding processing and encoding to obtain the question feature q;
[0026] As a preferred solution, th...
Embodiment 2
[0034] A method for visual question answering based on an equal attention graph network, comprising the following steps:
[0035] Step 1. Preprocess the input image I, send the image I to the feature extraction network, and obtain the regional target features composed of the features of K regions with the highest confidence. The feature extraction network used here is the Faster R-CNN network, the value of K is 36, and each regional target feature is represented by a 2048-dimensional vector.
[0036] Specifically, the training process of the Faster R-CNN network here is to first use the ResNet-101 network pre-trained on the ImageNet dataset to initialize the Faster R-CNN model, and then use the labeling information of the Visual Genome dataset to perform model training. train.
[0037] Step 2. In order to obtain the input feature representation, the image I is converted into a graph representation G by using the regional target features obtained in step 1. G is composed of th...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 


