A visual question answering method based on stacked attention and gate fusion

By combining stacked attention and gating fusion in a visual question answering method, and integrating multimodal skip connection attention and gating fusion networks, the robustness and flexibility of visual question answering models in multimodal information interaction and fusion are addressed, achieving significant improvements on the VQA-v2 dataset.

CN116595133BActive Publication Date: 2026-06-19ZHEJIANG SCI-TECH UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHEJIANG SCI-TECH UNIV
Filing Date
2023-04-18
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing visual question answering models lack robustness and flexibility in multimodal information interaction and fusion, making it difficult to achieve good results on the VQA-v2 dataset without using additional datasets.

Method used

We adopt a stacked attention and gating fusion approach, which uses a multimodal skip connection attention and gating fusion network, combined with a two-stream interaction model and a single-stream interaction model, to perform early and late feature fusion, thereby enhancing the computational efficiency and robustness of the model.

Benefits of technology

Significant improvements were achieved on the VQA-v2 dataset, demonstrating strong robustness and flexibility, and achieving good results without using additional datasets.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116595133B_ABST
    Figure CN116595133B_ABST
Patent Text Reader

Abstract

This invention belongs to the field of cross-modal task technology combining natural language processing and computer vision. The purpose is to provide a visual question answering method based on stacked attention and gating fusion. This invention can achieve good results on different visual question answering tasks, and has strong robustness and flexibility. It can achieve good results on the VQA-v2 dataset without using additional datasets. The technical solution is: a visual question answering method based on stacked attention and gating fusion, including the following steps: (1): partitioning the dataset; (2): constructing text features of the question; (3): constructing visual features of the image; (4): performing early fusion of text features and visual features; (5): constructing a stacked attention network; (6): constructing an attention ablation network; (7): constructing a gating fusion network for late fusion; (8): predicting the answer.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of cross-modal task technology that combines natural language processing and computer vision, and specifically relates to a visual question answering method based on stacked attention and gating fusion. Background Technology

[0002] Visual question answering (VQA) is a challenging multimodal task that requires algorithms to finely understand the semantic information of images and questions, and to accurately predict answers through complex reasoning. Unlike other image tasks, most current image tasks, such as image classification, object detection, and object segmentation, do not require a complete understanding of the information contained in the image to achieve good results. The questions in VQA are arbitrary, thus encompassing many image tasks: if the question is "What's in the image?", it's an object recognition task; if the question is "Is there a dog in the image?", it's an object detection task. Beyond these, there are more complex questions, such as "What's between the man and the TV?" and "Why is the woman crying?", which involve spatial relationships and common-sense reasoning. Therefore, VQA is very complex and requires the integration of multiple AI questions.

[0003] The rapid development of deep learning in recent years has enabled computers to handle complex multimodal tasks, giving rise to tasks that combine computer vision and natural language processing, such as image captioning and visual question answering. These tasks require a fine-grained understanding of both vision and language. Image captioning only requires a low-level algorithm to search a given image and then generate a free-form text description. In contrast, visual question answering is a more challenging multimodal task that requires algorithms to have a fine understanding of the semantic information of the image and the question, and to accurately predict the answer through complex reasoning. This makes visual question answering present with the following three difficulties:

[0004] (1) How to effectively construct a robust model. The mainstream methods in visual question answering can be broadly divided into two categories: two-stream interaction models and single-stream interaction models. In a two-stream interaction model, the text encoder and visual encoder each represent a stream. Generally, a Long Short-Term Memory (LSTM) network is used to capture text features, while convolutional neural networks such as Deep Residual Networks (ResNet) from image classification are used to obtain image grid features, or object detection models such as Faster-RCNN are used to obtain image bounding box features. In a single-stream interaction model, the text encoder and visual encoder obtain their respective features and then concatenate them, putting both visual and text features into a single model. However, the performance of both two-stream and single-stream interaction models is highly dependent on the specific problem, resulting in relatively poor robustness, diversity, and flexibility.

[0005] (2) How to effectively carry out the interaction between multimodal information. The MCAN model proposes a deep modular collaborative attention network to learn the dense interaction relationships within and between modalities. However, the MCAN model ignores the influence of image features on text features, resulting in insufficient interaction between modalities.

[0006] (3) How to effectively perform modal fusion between multiple modalities. There are three main methods for multimodal fusion: simple operation, attention-based methods, and tensor-based methods. Simple operation involves fusing features from multiple modalities through simple concatenation or weighted summation, but the fusion effect is not particularly ideal. Attention-based methods dynamically sum the corresponding weights using attention, resulting in improved accuracy compared to simple operation. However, the disadvantage is the large number of parameters and slow computation speed. Tensor-based methods, such as bilinear pooling, still have shortcomings in mathematical effectiveness and significant room for improvement. Summary of the Invention

[0007] The purpose of this invention is to overcome the shortcomings of the above-mentioned background technology and propose a visual question answering method based on stacked attention and gating fusion. This invention can achieve good results on different visual question answering methods, and has strong robustness and flexibility. It can achieve good results on the VQA-v2 dataset without using additional datasets.

[0008] The technical solution adopted in this invention is a visual question answering method based on stacked attention and gating fusion, comprising the following steps:

[0009] Step (1): Divide the dataset;

[0010] Step (2): Construct the textual features of the problem;

[0011] Step (3): Construct the visual features of the image;

[0012] Step (4): Perform early fusion of textual and visual features;

[0013] Step (5): Construct a stacked attention network;

[0014] Step (6): Construct an attention ablation network;

[0015] Step (7): Construct a gated fusion network for late-stage fusion;

[0016] Step (8): Predict the answer.

[0017] Furthermore, the partitioning of the dataset described in step (1) is as follows:

[0018] The dataset in question is the VQA-v2 dataset, derived from the MS-COCO dataset. The dataset is divided into three subsets: training set, validation set, and test set, with the data volume of the three subsets accounting for 40%, 20%, and 40%, respectively.

[0019] Furthermore, the textual features for constructing the problem described in step (2) are as follows:

[0020] For the input text, since only 0.25% of the questions in the VQA-v2 dataset are longer than 14 words, the length k of each question is limited to 14 words (padded with 0s if less than 14, discarded if more). Word embedding is performed using the large-scale pre-trained GloVe word vector corpus, converting each word into a word vector. The word vector features of each sentence are as follows: d y =300 is the dimension of the word vector features, and the text input features of the question are obtained through a single-layer LSTM network. d = 512 is the dimension of the text input feature, and the specific formula is as follows:

[0021] Y input =LSTM(Y question ).

[0022] Furthermore, the construction of visual features of the image in step (3) is as follows:

[0023] For the input image, the open-source object detection model Faster R-CNN is used to extract objects from the image. A series of candidate boxes are selected, and the images corresponding to the candidate box regions are re-input into the object detection network to obtain the image target region features.

[0024] m is the number of candidate boxes, d x =2048 is the dimension of the visual features of the input image;

[0025] Subsequently, a linear layer was used. Image target region features X image Further processing is performed to ensure consistency with the dimensions of the text input features, resulting in visual input features.

[0026] The specific formula is as follows:

[0027] X input =Linear(X) image ).

[0028] Furthermore, the early fusion of textual and visual features described in step (4) is as follows:

[0029] Text-visual fusion input features Based on text input features Y input and visual input features X input It is pieced together; specifically, it is represented as follows:

[0030] Z input =[Y input ,X input ].

[0031] Furthermore, the construction of the stacked attention network in step (5) is as follows:

[0032] The constructed stacked attention network includes a text depth stacked attention layer, a visual depth stacked attention layer, and a visual-text hybrid depth stacked attention layer;

[0033] The text depth stacked attention layer described in step (5.1) is composed of stacked self-attention (SA) modules.

[0034] The self-attention module is defined by the following formula:

[0035] Y'=LN(Y+MHSA(Y))

[0036] Y”=LN(Y'+FFN(Y'))

[0037] As can be seen from the above formula, the self-attention (SA) module consists of a multi-head attention unit (MHSA) and two fully connected layers (FFN).

[0038] Where MHSA is a multi-head self-attention unit, FFN is a 2-layer fully connected layer, LN is a layer normalization operation, Y is the vector of formula input, Y' is a temporary vector, and Y” is the vector of formula output.

[0039] The formula for MHSA is as follows:

[0040]

[0041] Q = YW Q

[0042] K = YW K

[0043] V = YW V

[0044] in The weight matrix is ​​a linear transformation, where Q, K, and V are all obtained by linear transformation of Y;

[0045] The formula for FFN is as follows:

[0046] FFN(Y')=W F ' FN(Dropout(GELU(W F " FN Y')))

[0047] in is the linear transformation weight matrix, Dropout is the random deactivation layer, GELU is the activation function, and Y is the vector input to the formula;

[0048] The formula for stacked self-attention (SA) modules is as follows:

[0049] U (n) =SA(U (n-1) )

[0050] Among them: U (0) =Y input That is, the input of the first SA module is the text input feature Y. input The output of the last SA module is text output features.

[0051] The visual depth stacked attention layer described in step (5.2) is composed of stacked guided self-attention (SGA) modules.

[0052] A guided self-attention module is defined by the following formula:

[0053] X'=LN(X+MHSA(X))

[0054] X”=LN(X'+MHGA(X',Y))

[0055] X”'=LN(X”+FFN(X”))

[0056] As can be seen from the above formula, the Guided Self-Attention (SGA) module consists of a Multi-Head Attention Unit (MHSA), a Multi-Head Guided Self-Attention (MHGA), and two fully connected layers (FFN).

[0057] The formula for MHSA is as follows:

[0058]

[0059] Q = XW Q

[0060] K = XW K

[0061] V = XW V

[0062] in: Let X be the linear transformation weight matrix, X be the vector input to the module, X' and X” be temporary vectors within the module, and X”' be the vector output to the module.

[0063] The formula for MHGA is as follows:

[0064]

[0065] Q = X'W Q

[0066] K = YW K

[0067] V = YW V

[0068] in: The linear transformation weight matrix d q Q is the dimension of X', which is obtained by linear transformation of X'. K and V are obtained by linear transformation of Y, and X' and Y are the input vectors in the formula.

[0069] The formula for FFN is as follows:

[0070] FFN(Y”)=W F ' FN (Dropout(GELU(W F " FN Y”)))

[0071] in is the linear transformation weight matrix, Dropout is the random deactivation layer, GELU is the activation function, and Y” is the input vector in the formula;

[0072] The formula for the Stacked Guided Self-Attention (SGA) module is as follows:

[0073] U (n) =SGA(U (n-1) ,Y)

[0074] Among them: U (0) =X input That is, the input of the first SGA module is the visual input feature X. input Y represents the output text feature of the deep stacked attention layer. output The output of the last SGA module is the visual output feature.

[0075] The visual text mixing deep stacked attention layer described in step (5.3) is composed of stacked multimodal skip connection attention (MSCA) modules.

[0076] The multimodal skip connection attention module is defined by the following formula:

[0077] Z input =[Y input ,X input ]

[0078] Y'=LN(Y input +MHSA(Y input ))

[0079] X'=LN(X+MHSA(X))

[0080] X”=LN(X'+MHGA(X',Y'))

[0081] X”'=LN(X”+FFN(X”))

[0082] Z' = [Y input ,X”']

[0083] Z output =LN(Z'+MHSA(Z'))

[0084] As can be seen from the above formula, the Multimodal Skip Connection Attention (MSCA) module consists of a Multi-Head Attention Unit (MHSA), a Multi-Head Guided Self-Attention Unit (MHGA), and two fully connected layers (FFN).

[0085] Where: MHSA is a multi-head self-attention unit, FFN is a 2-layer fully connected layer, LN is a layer normalization operation, and X input and Y input Z is the input vector in the formula. input By X input and Y input It is formed by concatenation, where X', X”, X”', Y', and Z' are temporary vectors in the formula, and Z is... output The output vector of the formula;

[0086] The formula for MHSA is as follows:

[0087]

[0088] Q = ZW Q

[0089] K = ZW K

[0090] V = ZW V

[0091] in Let Z be the linear transformation weight matrix, and Z be the input vector in the formula. Q, K, and V are all obtained by linear transformation of Z.

[0092] The formula for FFN is as follows:

[0093] FFN(Z')=W F ' FN (Dropout(GELU(WF " FN Z')))

[0094] in Z' is the linear transformation weight matrix, Dropout is the random deactivation layer, GELU is the activation function, and Z' is the input vector in the formula.

[0095] The formula for MHGA is as follows:

[0096]

[0097] Q = X'W Q

[0098] K = Y'W K

[0099] V = Y'W V

[0100] in The linear transformation weight matrix d q Q is the dimension of Y, which is obtained by linear transformation of Y, while K and V are obtained by linear transformation of X.

[0101] The formula for stacked multimodal skip connection attention (MSCA) modules is as follows:

[0102] U (n) =MSCA(U (n-1) )

[0103] Among them U (0) =Z input That is, the input of the first MSCA module is the fused input feature Z of text and vision. input The output of the last MSCA module is the fusion output feature of text and vision.

[0104] Furthermore, the construction of the attention ablation network in step (6) is as follows:

[0105] Step (6.1) Construct a text feature attention ablation network:

[0106] The text output feature Y of the previous layer output The attention weights for all text features are calculated using two linear layers, Linear1 and Linear2, in the text feature attention ablation network. The specific formula is as follows:

[0107] W Y =softmax(Linear1(Linear2(Y output )))

[0108] =softmax(GELU(Y) output W Y ')W Y ”)

[0109] in: Let be the linear transformation weight matrix in two linear layers, and GELU be the activation function;

[0110] Text attention weight W Y and text output features Y output We obtain the comprehensive text features by performing a weighted summation. The specific formula is as follows:

[0111]

[0112] Step (6.2) Construct a visual feature attention ablation network:

[0113] The visual output feature X of the previous layer output The attention weights for all visual features are calculated using two linear layers, Linear1 and Linear2, in the visual feature attention ablation network. The specific formula is as follows:

[0114] W X =softmax(Linear1(Linear2(X) output )))

[0115] =softmax(GELU(X) output W X ')W X ”)

[0116] in: Let be the linear transformation weight matrix in two linear layers, and GELU be the activation function;

[0117] Visual attention weight W X and visual output features X output The weighted summation yields the comprehensive visual features. The specific formula is as follows:

[0118]

[0119] Step (6.3) Construct a text-visual fusion feature attention ablation network:

[0120] Text-visual fusion feature vector Z output The attention weights of all fused text visual features are calculated using two linear layers, Linear1 and Linear2, in the text feature attention ablation network. The specific formula is as follows:

[0121] W Z =softmax(Linear1(Linear2(Z)) output )))

[0122] =softmax(GELU(Z) output W Z ')W Z ”)

[0123] in: Let be the linear transformation weight matrix in two linear layers, and GELU be the activation function;

[0124] Text-visual fusion feature attention weight W Z The fusion output feature Z of text and vision output Weighted summation is performed to obtain the fusion features of the integrated text and visual representation. The specific formula is as follows:

[0125]

[0126] Furthermore, step (7) involves constructing a gated fusion network for late-stage fusion; specifically as follows:

[0127] In a gated fusion network, the combined text features Y from the upstream network are... final and integrated visual features X final By splicing, we get [Y] final ,X final The input is then fed into a linear layer to generate a gated vector, denoted as V. t The specific formula is as follows:

[0128] V t =Linear([Y final ,X final ])

[0129] =σ(W yx [Y final ,X final ]+b yx )

[0130] Among them W yx Let b be the linear transformation weight matrix of the linear layer. yx σ is the bias coefficient, and σ() is the nonlinear activation function sigmoid;

[0131] Gating vector V t With integrated visual features X final The gating vector obtained by multiplication yields a non-text offset vector H, which is predominantly textual and accompanied by visual modality; specifically as follows:

[0132] H1 = V t X final

[0133] To avoid the magnitude of the non-linguistic offset vector being too large compared to the original word vectors, a scaling factor γ is introduced to keep the magnitude of H1 within a suitable range; the specific formula is as follows:

[0134]

[0135] The output vector F is obtained by weighted summing of the combined text features and the non-text offset vector H1. t The details are as follows:

[0136] F t =Y final +γH1

[0137] F t The fusion feature Z of integrated text vision final By splicing, we get [F] t Z final The input is then fed into a linear layer to generate a gated vector, denoted as V. m The specific formula is as follows:

[0138] V m =Linear([F t Z final ])

[0139] =σ(W fz [F t Z final ]+b fz )

[0140] Among them W fz Let b be the linear transformation weight matrix of the linear layer. fz σ is the bias coefficient, and σ() is the nonlinear activation function sigmoid;

[0141] Gating vector V m With F t Multiplying them yields the offset vector H2; the details are as follows:

[0142] H2 = V m F t

[0143] Similarly, a scaling constraint factor γ is introduced to keep the magnitude of H2 within a suitable range; the specific formula is as follows:

[0144]

[0145] Ft We perform a weighted sum with H2 to obtain the final output feature F. final The details are as follows:

[0146] F final =F t +γH.

[0147] Furthermore, the predicted answer in step (8) is as follows:

[0148] For F t Perform layer normalization and then input it into a linear layer to obtain the final output. Specifically as follows:

[0149]

[0150] W final L is the linear transformation weight matrix of the linear layer, and LN is the layer normalization.

[0151] Calculate output F final 'Score distribution with actual answers' The word corresponding to the largest index difference is output as the predicted answer. The loss function used is binary cross-entropy (BCE), and the formula is as follows:

[0152]

[0153] The beneficial effects of this invention are:

[0154] This invention proposes a visual question answering method based on stacked attention and gating fusion. It combines a two-stream interaction model and a one-stream interaction model, enhancing the computational efficiency of stacked attention by proposing a multimodal skip connection attention approach. Stacked attention aligns visual and textual features, and the gating fusion network adaptively selects features from both the two-stream and one-stream interaction models to answer questions during later fusion, making the trained model more robust. This invention achieves significant improvements in visual question answering tasks, demonstrating strong performance across various datasets. Attached Figure Description

[0155] Figure 1 This is a network model structure diagram of the method described in this invention.

[0156] Figure 2 This is a schematic diagram of the stacked attention method described in this invention.

[0157] Figure 3 This is a schematic diagram of the self-attention module of the method described in this invention.

[0158] Figure 4 This is a schematic diagram of the guided self-attention module in the method described in this invention.

[0159] Figure 5 This is a schematic diagram of the multimodal skip connection attention module in the method described in this invention.

[0160] Figure 6 This is a schematic diagram of the gated fusion network in the method described in this invention. Detailed Implementation

[0161] This invention combines a two-stream interaction model and a single-stream interaction model. It performs an early multimodal feature fusion to form early fused features, and then introduces the features into a stacked attention in a three-stream manner. It proposes a multimodal skip connection attention to improve the computational efficiency of the stacked attention. In the stacked attention, features within and between modalities are fully interacted and aligned. After passing through an attention ablation network, a comprehensive feature is obtained. Finally, a gated fusion network is used for late-stage fusion of multimodal features.

[0162] The embodiments shown in the accompanying drawings are described in further detail below.

[0163] like Figure 1 As shown, the visual question answering method based on stacked attention and gating fusion includes the following steps:

[0164] Step (1) Split the dataset, as follows:

[0165] The dataset in question is the VQA-v2 dataset, derived from the MS-COCO dataset. The dataset is divided into three subsets: training set, validation set, and test set, with the data volume of the three subsets accounting for 40%, 20%, and 40%, respectively.

[0166] Step (2) Construct the textual features of the question, as follows:

[0167] For the input text, since only 0.25% of the questions in the VQA-v2 dataset are longer than 14 words, the length k of each question is limited to 14 words (padded with 0s if less than 14, discarded if more). Word embedding is performed using the large-scale pre-trained GloVe word vector corpus, converting each word into a word vector. The word vector features of each sentence are as follows: d y =300 is the dimension of the word vector features, and the text input features of the question are obtained through a single-layer Long Short-Term Memory (LSTM) network. d = 512 is the dimension of the text input feature, and the specific formula is as follows:

[0168] Y input =LSTM(Y question ).

[0169] Step (3) Construct the visual features of the image, as follows:

[0170] For the input image, the open-source object detection model Faster R-CNN is used to extract objects from the image. A series of candidate boxes are selected, and the images corresponding to the candidate box regions are re-input into the object detection network to obtain the image target region features. m is the number of candidate boxes, d x =2048 is the dimension of the visual features of the input image. Subsequently, a linear layer is used. Image target region features X image Further processing is performed to ensure consistency with the dimensions of the text input features, resulting in visual input features. The specific formula is as follows:

[0171] X input =Linear(X) image ).

[0172] Step (4) involves early fusion of the semantic features of the problem and the visual features of the image, as follows:

[0173] Fusion features of problem images Based on the semantic features Y of the problem input and image visual features X input It is pieced together, as shown below:

[0174] Z input =[Y input ,X input ].

[0175] Step (5) Construct a stacked attention network, as follows:

[0176] like Figure 2 As shown, the stacked attention network includes a text depth stacked attention layer, a visual depth stacked attention layer, and a visual-text hybrid depth stacked attention layer; wherein the text depth stacked attention layer is composed of stacked self-attention (SA) modules, the visual depth stacked attention layer is composed of stacked guided self-attention (SGA) modules, and the visual-text hybrid depth stacked attention layer is composed of stacked multimodal skip connection attention (MSCA) modules.

[0177] Step (5.1) Construct a text depth stacked attention layer:

[0178] The text depth stacked attention layer is composed of stacked self-attention (SA) modules:

[0179] like Figure 3 As shown, a self-attention module is defined by the following formula:

[0180] Y'=LN(Y+MHSA(Y))

[0181] Y”=LN(Y'+FFN(Y'))

[0182] As can be seen from the above formula, the self-attention (SA) module consists of a multi-head attention unit (MHSA) and two fully connected layers (FFN).

[0183] Where MHSA is a multi-head self-attention unit, FFN is a 2-layer fully connected layer, LN is a layer normalization operation, Y is the vector of formula input, Y' is a temporary vector, and Y” is the vector of formula output.

[0184] The formula for MHSA is as follows:

[0185]

[0186] Q = YW Q

[0187] K = YW K

[0188] V = YW V

[0189] in: Let Y be the linear transformation weight matrix, where Y is the vector input to the formula, and Q, K, and V are all obtained by linear transformation of Y.

[0190] The formula for FFN is as follows:

[0191] FFN(Y')=W' FFN (Dropout(GELU(W)) FFN Y')))

[0192] in is the linear transformation weight matrix, Dropout is the random deactivation layer, GELU is the activation function, and Y is the vector input to the formula;

[0193] The formula for stacked self-attention (SA) modules is as follows:

[0194] U (n) =SA(U (n-1) )

[0195] Among them U (0) =Y input That is, the input of the first SA module is the text input feature Y. input The output of the last SA module is text output features.

[0196] Step (5.2) Construct the visual depth stacked attention layer:

[0197] The visual depth stacked attention layer is composed of stacked guided self-attention (SGA) modules:

[0198] like Figure 4 As shown, a guided self-attention module is defined by the following formula:

[0199] X'=LN(X+MHSA(X))

[0200] X”=LN(X'+MHGA(X',Y))

[0201] X”'=LN(X”+FFN(X”))

[0202] As can be seen from the above formula, the Guided Self-Attention (SGA) module consists of a Multi-Head Attention Unit (MHSA), a Multi-Head Guided Self-Attention (MHGA), and two fully connected layers (FFN).

[0203] The formula for MHSA is as follows:

[0204]

[0205] Q = XW Q

[0206] K = XW K

[0207] V = XW V

[0208] in: Let X be the linear transformation weight matrix, X be the vector input to the module, X' and X” be temporary vectors within the module, and X”' be the vector output to the module.

[0209] The formula for MHGA is as follows:

[0210]

[0211] Q = X'W Q

[0212] K = YW K

[0213] V = YW V

[0214] in: The linear transformation weight matrix d q Q is the dimension of X', which is obtained by linear transformation of X'. K and V are obtained by linear transformation of Y, and X' and Y are the input vectors in the formula.

[0215] The formula for FFN is as follows:

[0216] FFN(Y”)=WF ' FN (Dropout(GELU(W F " FN Y”)))

[0217] in is the linear transformation weight matrix, Dropout is the random deactivation layer, GELU is the activation function, and Y” is the input vector in the formula;

[0218] The formula for the Stacked Guided Self-Attention (SGA) module is as follows:

[0219] U (n) =SGA(U (n-1) ,Y)

[0220] Among them: U (0) =X input That is, the input of the first SGA module is the visual input feature X. input Y represents the output text feature of the deep stacked attention layer. output The output of the last SGA module is the visual output feature.

[0221] Step (5.3) Construct a visual-text blending deep stacked attention layer:

[0222] The visual-text hybrid deep stacked attention layer is composed of stacked multimodal skip connection attention (MSCA) modules:

[0223] like Figure 5 As shown, the multimodal skip connection attention module is defined by the following formula:

[0224] Z input =[Y input ,X input ]

[0225] Y'=LN(Y input +MHSA(Y input ))

[0226] X'=LN(X+MHSA(X))

[0227] X”=LN(X'+MHGA(X',Y'))

[0228] X”'=LN(X”+FFN(X”))

[0229] Z' = [Y input ,X”']

[0230] Z output =LN(Z'+MHSA(Z'))

[0231] As can be seen from the above formula, the Multimodal Skip Connection Attention (MSCA) module consists of a Multi-Head Attention Unit (MHSA), a Multi-Head Guided Self-Attention Unit (MHGA), and two fully connected layers (FFN).

[0232] Where: MHSA is a multi-head self-attention unit, FFN is a 2-layer fully connected layer, LN is a layer normalization operation, and X input and Y input Z is the input vector in the formula. input By X input and Y input It is formed by concatenation, where X', X”, X”', Y', and Z' are temporary vectors in the formula, and Z is... output The output vector of the formula;

[0233] The formula for MHSA is as follows:

[0234]

[0235] Q = ZW Q

[0236] K = ZW K

[0237] V = ZW V

[0238] in Let Z be the linear transformation weight matrix, and Z be the input vector in the formula. Q, K, and V are all obtained by linear transformation of Z.

[0239] The formula for FFN is as follows:

[0240] FFN(Z')=W F ' FN (Dropout(GELU(W F " FN Z')))

[0241] in Z' is the linear transformation weight matrix, Dropout is the random deactivation layer, GELU is the activation function, and Z' is the input vector in the formula.

[0242] The formula for MHGA is as follows:

[0243]

[0244] Q = X'W Q

[0245] K = Y'W K

[0246] V = Y'WV

[0247] in: Let d be the linear transformation weight matrix, X' and Y' be the input vectors in the formula, and d q Q is the dimension of Y, which is obtained by linear transformation of Y, while K and V are obtained by linear transformation of X.

[0248] The formula for stacked multimodal skip connection attention (MSCA) modules is as follows:

[0249] U (n) =MSCA(U (n-1) )

[0250] Among them U (0) =Z input That is, the input of the first MSCA module is the fused input feature Z of text and vision. input The output of the last MSCA module is the fusion output feature of text and vision.

[0251] Step (6) Construct the attention ablation network, as follows:

[0252] Step (6.1) Construct a text feature attention ablation network:

[0253] The text output feature Y of the previous layer output The attention weights for all text features are calculated using two linear layers, Linear1 and Linear2, in the text feature attention ablation network. The specific formula is as follows:

[0254] W Y =softmax(Linear1(Linear2(Y output )))

[0255] =softmax(GELU(Y) output W Y ')W Y ”)

[0256] in, Let be the linear transformation weight matrix in two linear layers, and GELU be the activation function;

[0257] Text attention weight W Y and text output features Y output We obtain the comprehensive text features by performing a weighted summation. The specific formula is as follows:

[0258]

[0259] Step (6.2) Construct a visual feature attention ablation network:

[0260] The visual output feature X of the previous layer output The attention weights for all visual features are calculated using two linear layers, Linear1 and Linear2, in the visual feature attention ablation network. The specific formula is as follows:

[0261] W X =softmax(Linear1(Linear2(X) output )))

[0262] =softmax(GELU(X) output W X ')W X ”)

[0263] in, Let be the linear transformation weight matrix in two linear layers, and GELU be the activation function;

[0264] Visual attention weight W X and visual output features X output The weighted summation yields the comprehensive visual features. The specific formula is as follows:

[0265]

[0266] Step (6.3) Construct a text-visual fusion feature attention ablation network:

[0267] Text-visual fusion feature vector Z output The attention weights of all text visual fusion features are calculated using two linear layers, Linear1 and Linear2, in the text visual fusion feature attention ablation network. The specific formula is as follows:

[0268] W Z =softmax(Linear1(Linear2(Z)) output )))

[0269] =softmax(GELU(Z) output W Z ')W Z ”)

[0270] in, Let be the linear transformation weight matrix in two linear layers, and GELU be the activation function;

[0271] Text-visual fusion feature attention weight W ZThe fusion output feature Z of text and vision output Weighted summation is performed to obtain the fusion features of the integrated text and visual representation. The specific formula is as follows:

[0272]

[0273] Step (7) Construct a gated fusion network for late-stage fusion, as follows:

[0274] like Figure 5 As shown, in the gated fusion network, the comprehensive text features Y from the upstream network are... final and integrated visual features X final By splicing, we get [Y] final ,X final The input is then fed into a linear layer to generate a gated vector, denoted as V. t The specific formula is as follows:

[0275] V t =Linear([Y final ,X final ])

[0276] =σ(W yx [Y final ,X final ]+b yx )

[0277] Among them W yx Let b be the linear transformation weight matrix of the linear layer. yx σ is the bias coefficient, and σ() is the nonlinear activation function sigmoid;

[0278] Gating vector V t With integrated visual features X final The gating vector obtained by multiplication yields a non-text offset vector H1, which is predominantly textual and accompanied by visual modality, as detailed below:

[0279] H1 = V t X final

[0280] To avoid the magnitude of the non-linguistic offset vector being too large compared to the original word vectors, a scaling factor γ is introduced to keep the magnitude of H1 within a suitable range. The specific formula is as follows:

[0281]

[0282] The output vector F is obtained by weighted summing of the combined text features and the non-text offset vector H1. t The details are as follows:

[0283] Ft =Y final +γH1

[0284] F t The fusion feature Z of integrated text vision final By splicing, we get [F] t Z final The input is then fed into a linear layer to generate a gated vector, denoted as V. m The specific formula is as follows:

[0285] V m =Linear([F t Z final ])

[0286] =σ(W fz [F t Z final ]+b fz )

[0287] Among them W fz Let b be the linear transformation weight matrix of the linear layer. fz σ is the bias coefficient, and σ() is the nonlinear activation function sigmoid;

[0288] Gating vector V m With F t Multiplying them yields a non-text offset vector H, as follows:

[0289] H = V m F t

[0290] Similarly, a scaling constraint factor γ is introduced to keep the magnitude of H within a suitable range, as shown in the following formula:

[0291]

[0292] F t We perform a weighted sum with H to obtain the final output feature F. final The details are as follows:

[0293] F final =F t +γH.

[0294] Step (8) Predict the answer, as follows:

[0295] For F t Perform layer normalization and then input it into a linear layer to obtain the final output. Specifically as follows:

[0296] F final =LN(Linear(F)final ))

[0297] =LN(F final W final

[0298] L is the linear transformation weight matrix of the linear layer, and LN is the layer normalization.

[0299] Calculate output F final 'Score distribution with actual answers' The word corresponding to the largest index difference is output as the predicted answer. The loss function used is binary cross-entropy (BCE), and the formula is as follows:

[0300]

[0301] Example 1

[0302] like Figure 1 , 2 As shown in 3, 4, 5, and 6, the visual question answering method based on stacked attention and gating fusion provided by the present invention.

[0303] Step (1) Split the dataset, as follows:

[0304] The dataset used is the VQA-v2 dataset, which comes from the MS-COCO dataset. The dataset is divided into three subsets: training set, validation set, and test set. The training set contains 82,783 images and 443,757 questions, the validation set contains 40,504 images and 214,354 questions, and the test set contains 81,434 images and 447,793 questions.

[0305] Step (2) Construct the textual features of the question, as follows:

[0306] Word embedding is performed using GloVe word vectors to obtain word vector features. Obtain the text input features of the question using a Long Short-Term Memory (LSTM) network.

[0307] Step (3) Construct the visual features of the image, as follows:

[0308] Extracting features of target regions in images using the open-source model Faster R-CNN. Through a linear layer Convert to visual input features

[0309] Step (4) involves early fusion of textual and visual features, as detailed below:

[0310] Input text into feature Yinput and visual input features X input The splicing formula is as follows:

[0311] Z input =[Y input ,X input ]

[0312] Results obtained:

[0313]

[0314] Step (5) Construct a stacked attention network, as follows:

[0315] The text depth stacked attention layer consists of 6 self-attention (SA) modules stacked together;

[0316] The visual depth stacked attention layer consists of six guided self-attention (SGA) modules stacked together;

[0317] The visual text hybrid deep stacked attention layer consists of four multimodal skip connection attention (MSCA) modules stacked together.

[0318] Step (6) Construct the attention ablation network, as follows:

[0319] Text output feature Y output Obtained through a text feature attention ablation network Text attention weight W Y and text output features Y output We obtain the comprehensive text features by performing a weighted summation.

[0320] Visual output feature X output Obtained through visual feature attention ablation network Visual attention weight W X and visual output features X output The weighted summation yields the comprehensive visual features.

[0321] Text-visual fusion output features Z output Obtained through a text visual feature attention ablation network Text-visual fusion feature attention weight W Z The fusion output feature Z of text and vision output Weighted summation is performed to obtain the fusion features of the integrated text and visual representation.

[0322] Step (7) Construct a gated fusion network for late-stage fusion, as follows:

[0323] In a gated fusion network, the combined text features Y from the upstream network are... final and integrated visual features X final By piecing them together, we obtain Obtained through gating fusion network F t The fusion feature Z of the integrated text vision from the upstream network final spliced ​​together The final output features are obtained through a gated fusion network.

[0324] Step (8) Predict the answer, as follows:

[0325] For the final output feature F final Perform layer normalization and then input it into a linear layer To obtain the final output The candidate answer set size is 3129, output F' final The word corresponding to the largest index in the algorithm is used as the predicted answer. The loss function is binary cross-entropy (BCE), and the optimizer is Adam.

[0326] The innovation of this invention mainly includes the following two points:

[0327] (1) A multimodal skip connection attention (MSCA) module was used to construct a visual-text hybrid deep stacked attention layer. The multimodal skip connection attention module consists of multi-head self-attention units (MHSA), multi-head guided self-attention (MHGA), and two fully connected layers (FFN). By skip connections, the number of parameters in the stacked attention network is reduced, and the accuracy is improved, resulting in a significant improvement in computational efficiency.

[0328] (2) A gated fusion network is proposed to achieve effective fusion between modalities. The gated fusion network adaptively selects the weights of visual features and text features, and adaptively selects single-stream interaction features and dual-stream interaction features.

[0329] Model experiments and characterization of experimental results:

[0330] 1. Dataset

[0331] This invention was tested on two publicly available visual question answering datasets: VQA-v2 and GQA. VQA-v2 is a mainstream dataset for visual question answering tasks. The images in VQA-v2 are from the MS-COCO dataset, and the question-answering key-value pairs are manually labeled with three corresponding questions for each image. All questions are categorized into three classes: "Yes / No," "Number," and "Other." The dataset is divided into three subsets: a training set (80k images and 444k questions), a validation set (40k images and 214k questions), and a test set (80kk images and 448k questions). Furthermore, the test set is divided into two subsets: a test development set and a test standard set. The test results include the three categories ("Yes / No," "Number," and "Other") and an overall accuracy value. To verify the robustness of this invention, experiments were also conducted on the GQA dataset. The GQA dataset covers a series of tasks including real-world image reasoning, scene understanding, and synthetic question answering, and consists of 11... The dataset consists of 3,000 images and 22 million different questions. It measures a range of reasoning skills, such as object attribute recognition, transitive relationship tracking, spatial reasoning, and logical reasoning comparison. Corresponding to the dataset is a new set of metrics that not only evaluate overall accuracy but also introduce new indicators: consistency, validity, reasonableness, and distribution. Consistency measures the consistency between different questions; validity and reasonableness measure whether the answers fall within the question's scope; and distribution measures the overall degree of match between the distribution of true answers and the model's predicted distribution.

[0332] 2. Experimental Environment

[0333] This invention utilizes the open-source deep learning framework PyTorch and employs GPUs to accelerate the experiments. The experimental server was configured with two Nvidia GeForce RTX 3090 graphics cards, totaling 48GB of VRAM. The Adam optimizer was used for training, with a maximum of 13 iterations and a batch size of 64. The learning rate was set to min(0.000025T, 0.0001), where T is the iteration number starting from 1. The learning rate was decayed by 0.2 before the 11th and 13th iterations.

[0334] 3. Analysis of Experimental Results

[0335]

[0336] Table 1. Performance on the VQA-v2 test-development set

[0337] Table 1 shows a performance comparison between various advanced models and our invention. It can be seen that our invention significantly outperforms other models. The overall accuracy on the VQA-v2 test-development set reaches 71.10%. Compared to the MCAN model, the champion of the 2019 VQA-v2 Visual Question Answering Challenge, our model shows a 0.53% improvement. The MCAN model also employs the idea of ​​stacked attention, feeding features into the stacked attention in a two-stream manner. Our invention, based on the MCAN model, adds multimodal skip connection attention and feeds features into the stacked attention in a three-stream manner. Regarding feature fusion, the MCAN model uses simple feature addition, while our invention proposes a gated fusion network to fuse features from both visual and textual modalities. Experimental results demonstrate that the stacked attention and gated fusion in our invention are significant for improving prediction accuracy.

[0338]

[0339]

[0340] Table 2 shows the performance on the GQA dataset.

[0341] As shown in Table 2, to verify the robustness of the model, performance evaluation tests were also conducted on the GQA dataset. It can be seen that the present invention also performs well on the GQA dataset, especially in terms of overall accuracy and consistency metrics, which are significantly better than other models.

[0342] In summary, this invention achieves a significant improvement in accuracy on visual question answering tasks by stacking attention and gating fusion, and the model is more robust, showing strong performance on different datasets.

Claims

1. A visual question answering method based on stacked attention and gated fusion, characterized in that, Includes the following steps: Step (1): Split the dataset; Step (2): Construct the textual features of the problem; Step (3): Construct the visual features of the image; Step (4): Perform early fusion of textual and visual features; Step (5): Construct a stacked attention network; Step (6): Construct the attention ablation network; Step (7): Construct a gated fusion network for late-stage fusion; Step (8): Predict the answer; The construction of the stacked attention network in step (5) is as follows: The constructed stacked attention network includes a text depth stacked attention layer, a visual depth stacked attention layer, and a visual-text hybrid depth stacked attention layer; (5-1) The text depth stacked attention layer described above is composed of stacked self-attention modules; (5-2) The visual depth stacked attention layer is composed of stacked guided self-attention modules; (5-3) The visual text mixing depth stacked attention layer is composed of stacked multimodal skip connection attention modules: Multimodal skip connection attention module: represented by the following formula: The multimodal skip connection attention module consists of multi-head self-attention units, multi-head guided self-attention units, and two fully connected layers. Where: MHSA stands for Multi-head Self-attention Unit, MHGA stands for Multi-head Guided Self-attention Unit, FFN stands for 2-layer Fully Connected Layer, and LN stands for Layer Normalization Operation. and The input vector in the formula. Depend on and It is pieced together. , , , and For the temporary vector in the formula, The output vector of the formula; The formula for MHSA is as follows: in , , The weight matrix is ​​a linear transformation matrix. The input vector in the formula. , and All are made by Obtained by linear transformation; The formula for MHGA is as follows: in: , , The weight matrix is ​​a linear transformation matrix. and The input vector in the formula. yes Dimensions yes Obtained by linear transformation, and yes Obtained by linear transformation; The formula for stacked multimodal skip connection attention modules is as follows: in: That is, the input of the first MSCA module is the fusion input features of text and vision. The output of the last MSCA module is the fusion output feature of text and vision. .

2. The visual question answering method based on stacked attention and gated fusion according to claim 1, characterized in that, The specific steps for partitioning the dataset in step (1) are as follows: The dataset is the VQA-v2 dataset, which comes from the MS-COCO dataset. The dataset is divided into three subsets: training set, validation set, and test set, with the data volume of the three subsets accounting for 40%, 20%, and 40%, respectively.

3. The visual question answering method based on stacked attention and gated fusion according to claim 2, characterized in that, The text features for constructing the problem described in step (2) are as follows: For the input text, the length k of each question is limited to 14 words. Word embedding is performed using GloVe word vectors, converting each word into a word vector. The word vector features of each sentence are as follows: , =300 is the dimension of the word vector features, and the text input features of the question are obtained through a single-layer long short-term memory network. , =512 is the dimension of the text input feature, and the specific formula is as follows: 。 4. The visual question answering method based on stacked attention and gating fusion according to claim 3, characterized in that, The construction of visual features of the image in step (3) is as follows: For the input image, the open-source object detection model Faster R-CNN is used to extract objects from the image, and a series of candidate boxes are selected to obtain the image target region features. m is the number of candidate boxes. =2048 is the dimension of the visual features of the input image; subsequently, a linear layer is used. Features of the target region in the image The processing ensures that the dimensions of the visual input features are consistent with those of the text input features, resulting in visual input features. The specific formula is as follows: 。 5. The visual question answering method based on stacked attention and gated fusion according to claim 4, characterized in that, The early fusion of textual and visual features described in step (4) is as follows: Text-visual fusion input features Based on text input features and visual input features It is pieced together, as shown below: 。 6. A visual question answering method based on stacked attention and gating fusion according to claim 5, characterized in that, In step (5-1): Define a self-attention module: where MHSA is a multi-head self-attention unit, FFN is a 2-layer fully connected layer, LN is a layer normalization operation, is a vector input to the formula, is a temporary vector, is a vector output from the formula; The formula for FFN is as follows: wherein , is a linear transformation weight matrix, Dropout is a random inactivation layer, GELU is an activation function, is a vector input by the formula; The formula for stacking self-attention modules is as follows: wherein: i.e. the input of the first SA module is the text input feature ; the output of the last SA module is the text output feature ; In step (5-2): Define a guided self-attention module: in: The vector input to the module, and For temporary vectors within the module, The vector output by the module; The formula for stacked guided self-attention modules is as follows: in: That is, the input of the first SGA module is visual input features. , The output text features of stacked attention layers for text depth The output of the last SGA module is the visual output feature. .

7. The visual question answering method based on stacked attention and gated fusion according to claim 6, characterized in that, The specific process of constructing the attention ablation network in step (6) is as follows: (6-1) Constructing a text-visual fusion feature attention ablation network: Text output features of the previous layer Two linear layers in the fusion feature attention ablation network through text vision , Calculate the attention weight of all text features , The specific formula is as follows: wherein: , is a linear transformation weight matrix in two linear layers, and GELU is an activation function. Text attention weights and text output features performing a weighted sum to obtain a composite text feature , specifically as follows: (6-2) Constructing a visual feature attention ablation network: Visual output features of the upper layer Two linear layers in the visual feature attention ablation network , To calculate the attention weights for all visual features The specific formula is as follows: wherein: , is a linear transformation weight matrix in two linear layers, and GELU is an activation function. visual attention weights and visual output features performing a weighted sum to obtain a composite visual feature , specifically as follows: (6-3) Constructing a text-visual fusion feature attention ablation network: Text-visual fusion feature vector Two linear layers in the text-visual fusion feature attention ablation network , To calculate the attention weights of the fused features of all text visions The specific formula is as follows: wherein: , is a linear transformation weight matrix in two linear layers, and GELU is an activation function. text-vision fusion feature attention weight and text-vision fusion output feature perform weighted sum to obtain comprehensive text-vision fusion feature The specific formula is as follows: 。 8. A visual question answering method based on stacked attention and gating fusion according to claim 7, characterized in that, Step (7) involves constructing a gated fusion network for late-stage fusion, as detailed below: In a gated fusion network, integrated text features from the upstream network are combined. and comprehensive visual features By piecing them together, we obtain Then it is input into a linear layer. To generate the gated vector, denoted as The specific formula is as follows: wherein: is a linear transformation weight matrix of the linear layer, is a bias coefficient, is a non-linear activation function sigmoid; Gating vector With comprehensive visual features Multiplication yields a non-text offset vector. The details are as follows: Introducing a scaling restriction factor 1, specifically as follows: combining text features and non-text offset vectors taking a weighted sum to get an output vector as follows: Will and integrated text visual fusion features Concatenate, get ; then input to a linear layer to produce the gating vector, denoted as , the specific formula as follows: wherein is a linear transformation weight matrix of the linear layer, is a bias coefficient, is a non-linear activation function sigmoid; Gating vector and Multiplication yields the offset vector The details are as follows: Similarly, a scaling limit factor is introduced. 2, making The amplitude is within a suitable range, and the specific formula is as follows: Will and Perform a weighted sum to obtain the final output features. The details are as follows: 。 9. A visual question answering method based on stacked attention and gating fusion according to claim 8, characterized in that, The predicted answer mentioned in step (8) is as follows: on the final output features layer normalization before being input to a linear layer to get the final output as follows: is a linear transformation weight matrix for the linear layer, LN is layer normalization; Calculation output Distribution of scores with actual answers The word corresponding to the largest index difference is output as the predicted answer. The loss function used is binary cross-entropy, and the formula is as follows: 。