Visual question and answer and visual question and answer model training method and device, equipment and storage medium

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A training method and visual technology, applied in the computer field, can solve problems such as the inability to capture the relationship between multiple texts, the inability to effectively understand the relationship between natural language text and visual content, etc.

Pending Publication Date: 2021-09-14

ALIBABA GRP HLDG LTD

View PDF0 Cites 8 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0004] However, the learning method of the existing VQA system is unable to capture the relationship between the above-mentioned multiple text problems due to the independent learning of visual content and text problems, so that it cannot effectively understand the connection between natural language text and visual content, and thus cannot Perform efficient reasoning to obtain valid VQA answers

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0032] refer to Figure 1A, shows a flowchart of steps of a method for training a visual question answering model according to Embodiment 1 of the present invention.

[0033] The training method of the visual question answering model of the present embodiment comprises the following steps:

[0034] Step S102: Receive and input training samples of the visual question answering model through the input part.

[0035] In this embodiment, the visual question answering model includes an input part, a feature extraction part, an expression learning part and an output part. Wherein, the input part may be an input layer of the visual question answering model, through which data to be processed can be input to the visual question answering model. Specifically in this embodiment, what is input is a training sample, and the training sample includes a sample image and multiple text questions corresponding to the sample image.

[0036] It should be noted that, in the embodiments of the pr...

Embodiment 2

[0056] refer to figure 2 , shows a flowchart of steps of a method for training a visual question answering model according to Embodiment 2 of the present invention.

[0057] The training method of the visual question answering model of the present embodiment comprises the following steps:

[0058] Step S202: Receive and input training samples of the visual question answering model through the input part.

[0059] Wherein, the training samples include sample images and multiple text questions corresponding to the sample images. As mentioned above, there can be one or more sample images, and each sample image corresponds to multiple text questions.

[0060] Since there may be multiple sample images, in a feasible manner, the sample images may be multiple continuous video key frame images, and the multiple text questions are texts converted from audio corresponding to multiple video key frame images. Continuous multi-frame video frames can form a video clip. In a video clip, ...

Embodiment 3

[0085] refer to Figure 3A , shows a flowchart of steps of a method for training a visual question answering model according to Embodiment 3 of the present invention.

[0086] In this embodiment, a sample image is taken as an example of continuous video key frame images to illustrate the training method of the visual question answering model provided by the embodiment of the present invention.

[0087] In this example, the visual question answering model used is Figure 3B As shown, it includes: input part, feature extraction part, expression learning part and output part.

[0088] Figure 3B In , the word vectors of 5 text questions and the corresponding vectors of 16 continuous video key frame images are input to the visual question answering model through the input part.

[0089] The feature extraction part adopts the form of Bi-GRU layer and FC layer for the text problem, wherein, after the FC layer, there are also ReLU layer and Dropout layer (the following is a brief ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The embodiment of the invention provides a visual question-answering and visual question-answering model training method and device, electronic equipment and a computer storage medium. The visual question-answering model training method comprises the steps: receiving and inputting a training sample of a visual question-answering model through an input part, wherein the training sample comprises a sample image and a plurality of text questions corresponding to the sample image; through a feature extraction part of the visual question and answer model, performing feature extraction on the plurality of text questions to obtain a plurality of corresponding semantic vectors, and performing feature extraction on the sample image to obtain a corresponding image feature vector; in an expression learning part of the visual question and answer model, processing the image feature vector and the semantic vectors by using an attention mechanism to obtain an image feature expression vector and a question feature expression vector; finally, through an output part of the visual question and answer model, performing question result prediction according to the image feature expression vector and the question feature expression vector, and performing training of the visual question and answer model according to a question result prediction result.

Description

technical field [0001] The embodiments of the present invention relate to the field of computer technology, and in particular, to a training method of a visual question answering model and a visual question answering method, and devices, electronic equipment, and computer storage media corresponding to the training method of the visual question answering model and the visual question answering method respectively. Background technique [0002] Visual Question Qanswering (VQA) is a learning task involving computer vision and natural language processing. A VQA system takes as input an image and a free-form, open-ended natural language question about the image, and generates a natural language answer as output. In short, VQA is question answering for a given image. [0003] In the traditional VQA system, it is necessary to take image and text questions as input, combine these two parts of information, and generate an answer as output. In a specific implementation, it summariz...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F16/9032G06F40/30G06K9/00G06N3/04H04N21/2187H04N21/466H04N21/44

CPCG06F16/90332H04N21/2187H04N21/4668H04N21/44008G06N3/045

Inventor雷陈奕王国鑫李朝唐海红

OwnerALIBABA GRP HLDG LTD

Visual question and answer and visual question and answer model training method and device, equipment and storage medium

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology