A single-stage gating multi-modal fusion method with two-stage cross-attention
By employing a single-level gated multimodal fusion method with two-level cross-attention, the problems of low modal fusion performance and inference illusion in VQA models are solved, the accuracy of answer prediction is improved, and the computational resource requirements are reduced, making it suitable for edge devices.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- JIANGNAN UNIV
- Filing Date
- 2025-05-26
- Publication Date
- 2026-06-26
AI Technical Summary
Existing VQA models have low modality fusion performance and are prone to inference illusions, leading to a decrease in the accuracy of answer prediction.
A single-level gated multimodal fusion method with two-level cross-attention is adopted. Through an embedding layer, a text encoder, an image encoder, a multimodal feature fusion module, and a decoder, text and visual information are fused using two cross-attention and gating mechanisms.
It improves intermodal interaction, reduces model illusions, increases VQA response accuracy, and reduces computational resource requirements, making the model suitable for edge devices.
Smart Images

Figure CN120654812B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence technology, and in particular to a single-level gated multimodal fusion method with two-level cross-attention. Background Technology
[0002] Visual Question Answering (VQA) is a key task that combines traditional natural language processing (NLP) tasks with computer vision. Through the interaction between natural language questions and image content, it aims to obtain and understand the necessary information to solve the problem, thereby generating a reasonable and accurate natural language answer. VQA strives to reach or even surpass human capabilities in solving image understanding problems, enabling autonomous decision-making based on visual information or providing the knowledge needed to solve problems. VQA involves not only the fusion of multimodal data but also requires models to possess complex visual understanding, language reasoning, and common-sense reasoning abilities, making it an important research topic in the field of artificial intelligence with broad application value across multiple domains. For example, in assisting visually impaired individuals, VQA technology can provide visual information about the surrounding environment to help users understand their surroundings and respond promptly to user questions, helping them make better judgments and decisions. In the field of autonomous driving, VQA technology can provide vehicles with the ability to perceive complex traffic scenarios, allowing users to more easily obtain information about the vehicle and its surroundings. In medical image analysis, VQA technology can assist doctors in quickly obtaining key information from medical images, providing details that might be overlooked, and offering more reliable assurance for doctors' disease diagnosis. VQA greatly enhances the information interaction capabilities of computer systems.
[0003] While VQA technology has made significant progress in recent years, it still faces numerous challenges. First, the open-ended nature of free-form problems necessitates models possessing extensive world knowledge and reasoning capabilities to handle diverse issues. Second, there are still shortcomings in the interaction and fusion of visual and linguistic modalities, which play a near-decisive role in human cognition, primarily manifested in the difficulty of establishing correct semantic understanding relationships between information from different modalities. Furthermore, incomplete dataset types and sizes can lead to poor performance and insufficient generalization ability of learning models on specific types of problems. Therefore, improving the robustness, reasoning ability, and fairness of VQA models remains a key focus of current research.
[0004] The emergence of large language models such as GPT-3 and LLaMA has enabled VQA technology to achieve significant breakthroughs in handling open-ended questions and few-shot learning, demonstrating superior performance across multiple natural language processing tasks. Some works have fine-tuned instructions based on open-source language models, resulting in even better performance in downstream tasks. In reasoning and question answering tasks, these models take the question as input and perform a single inference step to arrive at the answer. However, because the intermediate reasoning process is unknown, they typically perform poorly on tasks requiring complex reasoning abilities. To address this, researchers have proposed a "thinking chain" technique, which mimics the human cognitive thinking process. It prompts the large model step-by-step based on the question to generate intermediate inferences and arrive at the final answer. This approach can more fully utilize the knowledge reserves of large models, enabling them to solve more complex reasoning tasks. In particular, the DeepSeek-R1 model, using this thinking chain technique with only a small model, achieved reasoning performance comparable to the OpenAI o1 model, fully demonstrating the crucial role of the thinking chain technique in enhancing the model's reasoning capabilities.
[0005] The thought chain technique enables models to possess stronger reasoning abilities without additional training, and the quality of the intermediate reasoning it generates impacts the accuracy of the final answer. These thought chain models typically focus solely on using linguistic modalities as input. However, when humans handle complex reasoning and question-answering tasks, they can not only extract information from the linguistic modality of the question but also combine it with visual image data for comprehensive analysis and reasoning, leading to more accurate answers.
[0006] To enable language models to achieve this, one approach involves summarizing image data into a short text description using a summarizing model. This description is then concatenated with the language modality data and encoded as input to a larger language model, allowing the model to interact with both modalities simultaneously. However, this approach essentially still enables interaction between the language model and the text modality. Its interaction with the visual modality is primarily achieved by applying the results of the image summarizing model. The model's visual understanding is influenced by the summarizing model's interpretation of the image, both in terms of perspective and accuracy. This significantly reduces the model's interactivity with the visual modality, thus impacting its prediction results.
[0007] Another approach involves allowing the language model to directly interact with visual features extracted from images, enhancing the model's interactivity with the visual modality. This is achieved by projecting linguistic and visual modality features onto the same encoding space using a projective connection layer, which is then used as input to the language model's decoder to obtain the predicted output. As a trainable layer, the projective connection layer enables the model to interact with both visual and linguistic features simultaneously, effectively enhancing its multimodal interaction understanding. The structural design of the projective connection layer represents the interaction method between the model and different modalities, directly affecting the final output. To address this, existing technologies use a linear projection method, connecting features from different modalities through a linear projection layer as input and outputting to a space with an acceptable input tag length for the language model. However, due to its simple connection method, this approach often prevents the model from effectively interacting with and understanding multimodal features.
[0008] In existing technologies, Enigma-CoT proposes using multi-hop gated cross-attention for modality fusion based on the T5 structure and demonstrates good performance in complex reasoning tasks using the thought chain technique. The model surpasses the evaluation results of BLIP-2 on the ScienceQA dataset with less than 250M parameters. However, because each cross-attention mechanism employs a gating mechanism, and cross-attention itself controls information flow, important information after fusion is excessively suppressed, reducing the modality fusion effect. Multimodal-CoT proposes a model that uses a single Transformer structure to connect the language and visual modalities, while applying the thought chain technique to the reasoning process of the T5 model. It demonstrates excellent performance in question-answering reasoning and achieved state-of-the-art results on the ScienceQA scientific question-answering reasoning dataset at the time. However, its single Transformer structure excessively suppresses intermodal interaction information, and the illusion problem still exists when generating reasoning content. Summary of the Invention
[0009] Therefore, the technical problem to be solved by the present invention is to overcome the problem that the modal fusion effect is low in the prior art, which easily leads to reasoning illusion and reduces the accuracy of answer prediction.
[0010] To address the aforementioned technical problems, this invention provides a single-level gated multimodal fusion method with two-level cross-attention, comprising:
[0011] Construct a multimodal fusion model, including an embedding layer, a text encoder, an image encoder, a multimodal feature fusion module, and a decoder;
[0012] The process by which a multimodal fusion model generates predicted text from input text and input image includes:
[0013] The input text is processed through an embedding layer and a text encoder to obtain the text vector H. l The input image is processed by an image encoder to obtain the image vector H. v ;
[0014] Text vector H l and image vector H v Input modal feature fusion module, text vector H l After one single-head self-attention encoding, vector A is obtained. l ;Transfer vector A l and image vector H v Perform a single-head cross-attention encoding to obtain vector A. v,0 ;Transfer vector A v,0 and image vector H v Perform another single-head cross-attention encoding to obtain the cross vector A. v,l ;Transfer text vector H l and cross vector A v,l After concatenation, the gating vector is predicted through linear transformation and the Sigmoid activation function; the text vector H is then processed. l and cross vector A v,l After proportionally fusing the gated vectors, a fused vector is obtained;
[0015] The fused vector is then processed by a decoder to obtain the predicted text.
[0016] Preferably, the inference process using a multimodal fusion model includes:
[0017] The question text and image are input into the multimodal fusion model in the inference stage to obtain the intermediate inference text; the intermediate inference text and the question text are concatenated and then input into the multimodal fusion model in the inference stage along with the image to obtain the answer text.
[0018] Preferably, the process of training the multimodal fusion model includes:
[0019] First, using inference data from the ScienceQA dataset as a supervision signal, and taking question text and images as input, the text encoder, multimodal feature fusion module, and decoder in the multimodal fusion model during the inference stage are trained to output intermediate inference text. Then, using answers from the ScienceQA dataset as a supervision signal, and taking intermediate inference text, question text, and images as input, the text encoder, multimodal feature fusion module, and decoder in the multimodal fusion model during the inference stage are trained to output answer text.
[0020] Preferably, the text vector H l After one single-head self-attention encoding, vector A is obtained. l The formula is:
[0021] A l =SHA1(H l H l H l )
[0022] Where SHA1 represents single-head self-attention encoding, H l A represents a text vector. l This represents the output vector of a single-head self-attention encoding.
[0023] Vector A l and image vector H v Perform a single-head cross-attention encoding to obtain vector A. v,0 The formula is:
[0024] A v,0 =SHA2(A l H v H v )
[0025] Where SHA2 represents the first single-head cross-attention encoding, H v A represents an image vector. v,0 This represents the output vector of the first single-head cross-attention encoding;
[0026] Vector A v,0 and image vector H v Perform another single-head cross-attention encoding to obtain the cross vector A. v,l The formula is:
[0027] A v,l =SHA3(A v,0 H v H v )
[0028] Where SHA3 represents the second single-head cross-attention code, A v,l This represents the output vector of the second single-head cross-attention encoding;
[0029] Text vector H l and cross vector A v,l After concatenation, the gating vector is predicted through linear transformation and the Sigmoid activation function, as shown in the formula:
[0030] λ = Sigmoid(Linear(H) l A v,l ),λ∈[0,1]
[0031] Where λ represents the gate vector, Linear(·) represents the linear transformation, and Sigmoid(·) represents the Sigmoid activation function;
[0032] Text vector H l and cross vector A v,l After proportionally fusing the gated vectors, the fused vector is obtained using the following formula:
[0033] H out =λ×A v,l +(1-λ)×H l
[0034] Among them, H out This represents the fusion vector.
[0035] Preferably, the image encoder is a VIT encoder.
[0036] Preferably, the image encoder is followed by a linear projection layer to map the image vectors to the same dimensional space as the text vectors.
[0037] Preferably, before the input text passes through the embedding layer, a word segmenter is used to split the input text into multiple sub-words, and each sub-word is mapped to an ID.
[0038] Preferably, the word segmenter splits the input text into multiple sub-words based on the Unigram language model.
[0039] Preferably, the text encoder is a T5 encoder, which includes multiple T5 blocks, each of which includes a self-attention layer and a feedforward layer connected in sequence.
[0040] Preferably, the decoder is a T5 decoder, which includes multiple T5 blocks, each of which includes a self-attention layer, a cross-attention layer and a feedforward layer connected in sequence.
[0041] Compared with the prior art, the above-described technical solution of the present invention has the following advantages:
[0042] This invention presents a single-level gated multimodal fusion method with two-level cross-attention, which uses only two cross-attention interactions between textual and visual modal information. This allows for intermodal interaction, and then the fusion of language and visual representations is achieved through a gating mechanism. This simplifies existing modal fusion mechanisms while preserving important information, thereby improving the effectiveness of intermodal interaction, reducing model-induced illusions, and increasing the accuracy of VQA responses. Furthermore, this invention effectively reduces the model's computational parameters and the computational resources required for prediction, achieving a balance between low parameter count and high performance. This enables the multimodal fusion model to be deployed on edge devices, demonstrating good versatility and practicality. Attached Figure Description
[0043] To make the content of this invention easier to understand, the invention will be further described in detail below with reference to specific embodiments and accompanying drawings, wherein:
[0044] Figure 1 This is a structural diagram of the multimodal fusion model of the present invention;
[0045] Figure 2 This is a structural comparison diagram of the multimodal fusion method of the present invention and the existing multimodal fusion method, wherein... Figure 2 (a) in the diagram is the structure of the multimodal fusion method of Enigma-CoT. Figure 2 (b) is a structural diagram of the multimodal fusion method of the present invention;
[0046] Figure 3 This is a structural diagram of the modal feature fusion module;
[0047] Figure 4 This is a classification statistics chart from the ScienceQA dataset;
[0048] Figure 5 This is a heatmap of the confusion matrix of the model prediction results of this invention, plotted according to the ScienceQA test set options;
[0049] Figure 6 This is a line graph comparing the prediction accuracy of two thought chain models with similar parameter counts on the ScienceQA dataset. Detailed Implementation
[0050] The present invention will be further described below with reference to the accompanying drawings and specific embodiments, so that those skilled in the art can better understand and implement the present invention. However, the embodiments described are not intended to limit the present invention.
[0051] In the application of artificial intelligence and deep learning, multimodal learning is gradually becoming a key technology for improving model understanding and generation capabilities. The effective fusion of multimodal information plays a crucial role in enabling models to understand multimodal information and generate appropriate reasoning content. In text and image synthesis tasks, a well-designed framework structure can fully utilize the complementary information of the two modalities, thereby improving the model's reasoning performance.
[0052] To mitigate the problem of reasoning illusion in reasoning question answering tasks and enable more effective interaction between multiple modalities, this invention proposes a Multimodal Fusion Model Combined Two-Level Cross Attention Modal with Single-Level Gated (M-TCM), which includes an embedding layer, a text encoder, an image encoder, a multimodal feature fusion module (Two-Level Cross Attention Modal with Single-Level Gated (TCM), and a decoder.
[0053] Embodiment 1 of this invention proposes a single-level gated multimodal fusion method with two-level cross-attention. This method utilizes a multimodal fusion model to generate predicted text from input text and input image. The process includes:
[0054] S1: The input text is processed through an embedding layer and a text encoder to obtain the text vector H. l .
[0055] The data involved in this invention mainly includes text modality problems and corresponding image modality information, which need to be effectively processed to adapt to the input space of the model and enable the model to effectively understand the input information.
[0056] Before the input text passes through the embedding layer, a tokenizer splits the input text into multiple sub-words, and each sub-word is mapped to an ID. This embodiment uses the tokenizer based on the Unigram language model's sub-word level encoding method for segmentation. Unlike traditional segmentation methods based on spaces, sub-word segmentation can be performed directly in the original text, better handling unseen words. Complex words can be automatically split into multiple sub-words, reducing the model's vocabulary size while preserving semantic information. The vocabulary size is 32100 words; each word in the table is mapped to a unique ID, and special characters such as... <pad>(ID:0, filler character) <unk>(ID:2, Unknown word)<extra_id_0> (ID:32099, mask word) etc. are used to represent special meanings. When the input text does not reach the set maximum length, it will use... <pad>The tokens are padded to meet the model's input requirements. After tokenization, each token is mapped to a unique integer ID.
[0057] The embedding layer is used to convert the IDs output by the word segmenter into a one-dimensional vector representation.
[0058] The text encoder uses a T5 encoder to extract text features from the one-dimensional vector representation output by the embedding layer, resulting in a text vector H. l The T5 encoder consists of multiple T5 blocks, each of which includes a self-attention layer and a feedforward layer connected in sequence to learn the contextual information of the sequence and obtain the hidden layer text self-attention vector.
[0059] The decoder also uses a T5 decoder, which includes multiple T5 blocks. Each T5 block includes a self-attention layer, a cross-attention layer, and a feedforward layer connected in sequence.
[0060] The vectors decoded by the decoder will be mapped to the vocabulary through a linear layer, and after the ID-to-vocabulary mapping, they will be converted into the corresponding text output.
[0061] The T5 architecture unifies all natural language processing tasks (such as machine translation and article summarization) into text-to-text conversion problems, employing a standard Transformer encoder-decoder architecture. The encoder is responsible for understanding the input text and extracting high-level semantic features; the decoder is responsible for generating the target text, progressively outputting the final result through an autoregressive generation approach. This architecture demonstrates excellent performance across diverse tasks, exhibiting strong generalization capabilities. It also allows for transfer learning across multiple tasks, avoiding the need to design different model architectures for different tasks. This simplifies the development process for natural language processing tasks, requiring only pre-training and fine-tuning on the architecture for specific tasks without requiring further architectural adjustments.
[0062] S2: The input image is processed by an image encoder to obtain the image vector H. v .
[0063] For feature extraction of image modalities, the most commonly used encoders are DETR, CLIP, and VIT. Since the VIT encoder is considered the best performing encoder in Multimodal-CoT and is widely used in various tasks, this embodiment also uses the VIT encoder as the image encoder for image feature extraction, processing all images into high-dimensional feature form through VIT.
[0064] In order to enable modal fusion with text features, a linear projection layer is connected after the image encoder to map the image vector to the same dimensional space as the text vector, so as to meet the multimodal fusion requirements of the multimodal feature fusion module.
[0065] In the multimodal fusion model of this invention, since T5 is a process of decomposing the encoder and decoder into input text and decoded output text, the addition of visual features needs to be fully and effectively fused with text information. A more suitable position is to introduce a modal feature fusion mechanism after encoding to fuse the encoded image and text features, with less intrusion into the original semantic information of the trained T5 model, so that different modalities can learn from each other and obtain effective semantic representation, which is convenient for the decoder to decode.
[0066] S3: Transfer the text vector H l and image vector H v Input the modal feature fusion module to obtain the fusion vector.
[0067] In question-answering tasks, the addition of visual information allows the model to obtain more contextual information during the process of understanding the question, thereby generating higher quality reasoning content, significantly reducing the proportion of illusion errors in the model's generated reasoning, and improving the accuracy of the final predicted answer.
[0068] Figure 2 Figure (a) shows the structure of the Enigma-CoT multimodal fusion method. Enigma-CoT mimics the human process of repeated thinking about problems, proposing to use a multi-hop cross-attention mechanism to fuse linguistic and visual modal information, thereby improving the quality of model generation and inference. This mechanism performs cross-attention encoding on the input text vector and image vector, and then uses a gating mechanism to fuse the encoded vector with the text vector. The fused vector is then repeatedly subjected to this cross-fusion process with the image vector to achieve the goal of fully understanding and fusing image content. In the gating mechanism, a gating vector is predicted to control the proportion of the fused output vector relative to the input vector in the fused vector. However, due to the control effect of the cross-attention mechanism on information flow, following it with a gating mechanism will suppress the expression of important information. Therefore, this invention proposes a simplified two-level cross-attention modal fusion single-level gating module as a modal feature fusion module to fuse different modal information. Figure 2 Figure (b) shows the structure of the multimodal fusion method of the present invention. The modal feature fusion module proposed in this invention removes the excessive suppression of important information, simplifies the network structure, and retains the gating mechanism of the last stage to ensure faster and more stable convergence of the network.
[0069] Figure 3 This is a structural diagram of the modal feature fusion module. Specifically, the text vector H l After one single-head self-attention encoding, vector A is obtained. l The formula is:
[0070] A l =SHA1(H l H l H l )
[0071] Where SHA1 represents single-head self-attention encoding, H l A represents a text vector. l This represents the output vector of a single-head self-attention encoding.
[0072] Vector A l and image vector H v Perform a single-head cross-attention encoding to obtain vector A. v,0 The formula is:
[0073] A v,0 =SHA2(A l H v H v )
[0074] Where SHA2 represents the first single-head cross-attention encoding, H v A represents an image vector. v,0 This represents the output vector of the first single-head cross-attention encoding;
[0075] Vector A v,0 and image vector A v Perform another single-head cross-attention encoding to obtain the cross vector A. v,l The formula is:
[0076] A v,l =SHA3(A v,0 H v H v )
[0077] Where SHA3 represents the second single-head cross-attention code, A v,l This represents the output vector of the second single-head cross-attention encoding;
[0078] Text vector H l and cross vector A v,l After concatenation, the gating weights are predicted using a linear transformation and a sigmoid activation function, as shown in the formula:
[0079] λ = Sigmoid(Linear(H) l A v , l ),λ∈[0,1]
[0080] Where λ represents the gate weight, Linear(·) represents the linear transformation, and Sigmoid(·) represents the Sigmoid activation function;
[0081] Text vector H l and cross vector A v,l After proportional fusion using gating weights, the fused vector is obtained, as shown in the formula:
[0082] H out =λ×A v,l +(1-λ)×H l
[0083] Among them, H out This represents the fusion vector.
[0084] S4: The fused vector is processed by the decoder to obtain the predicted text.
[0085] In reasoning question-answering tasks using the CoT (Cooperation in Reasoning) model, the M-TCM model can be trained using different supervision signals to generate reasoning content that meets the requirements. To improve the model's reasoning performance using CoT technology, we first use reasoning data from the ScienceQA dataset as supervision signals, with question text and images as input, to train the text encoder, multimodal feature fusion module, and decoder in the multimodal fusion model during the reasoning stage. This allows the model to generate intermediate process information for solving the corresponding problem based on the question and image information, and output intermediate reasoning text. Then, we use answers from the ScienceQA dataset as supervision signals, with intermediate reasoning text, question text, and images as input, to train the text encoder, multimodal feature fusion module, and decoder in the multimodal fusion model during the inference stage, outputting the answer text. By repeating this process, the multimodal fusion model can be applied to reasoning at multiple steps, generating more comprehensive answers.
[0086] The reasoning process using a multimodal fusion model includes: inputting the question text and image into the multimodal fusion model in the reasoning stage to obtain intermediate reasoning text; concatenating the intermediate reasoning text and the question text, and inputting them together with the image into the multimodal fusion model in the inference stage to obtain the answer text.
[0087] Specifically, consider the i-th problem P i and corresponding n answer options {a i,1 ,a i,2 ,…,a i,n The model is required to select the correct option a. i,label In the reasoning generation process, the goal is based on P. i and {a i,1 ,a i,2 ,…,a i,n }, generating inference data Y through a multimodal fusion model i Assume problem P i The language modality and visual modality data are X i,l and X i,v Then the target inference text Y i The probability generation is represented as:
[0088]
[0089] Where, θ r These are the learnable parameters for the multimodal fusion model during the inference phase. The generated target inference text Y represents i The length.
[0090] In the answer generation process, the goal is to base it on P. i 、{a i,1 ,a i,2 ,…,a i,n } and the intermediate inference Y generated in the previous stage i Infer the final answer A i The language modality X of the problem i,l and intermediate inference data Y i X is obtained by splicing i,ly Then, combined with visual modal information X i,v Inputting them together into a multimodal fusion model yields the predicted answer A. i A i The probability generation is represented as:
[0091]
[0092] Where, θ a This represents the learnable parameters of the answer inference module. For the generated target answer text A i The length.
[0093] The multimodal fusion model constructed in this invention mainly fuses preprocessed data into multiple modalities before decoding and predicting the output. First, during data preprocessing, text modal information is embedded into word vectors through an embedding layer, and then self-attention encoded using a T5 encoder. For image modal information, a VIT encoder is used for image feature extraction. Subsequently, in the multimodal fusion part, after obtaining the extracted image features, due to the discrepancy between their feature dimensions and text vectors, which hinders feature fusion by the multimodal feature fusion module, a linear projection layer is used to transform the image vector feature dimensions to the same dimensional space as the text vectors. Then, the features of the two modalities are cross-fused by the multimodal feature fusion module to ensure sufficient interaction and understanding between the modalities, resulting in feature vectors that conform to the decoder's input dimension. Finally, in the decoding and prediction part, the T5 block of the T5 decoder has a similar structure to the encoder's T5 block, except that a standard attention mechanism is added after each self-attention mechanism to focus on the encoder's output. After decoding through multiple T5 blocks, the final model output is obtained. It is worth noting that... Figure 1 The dotted lines in the diagram indicate that the model can use the generated reasoning content as input to further generate the next step of reasoning content or to derive the answer to the question, thus realizing the thought chain reasoning process.
[0094] In summary, the single-level gating multimodal fusion method with two-level cross-attention described in this invention uses only two cross-attention interactions between textual and visual modal information, and then completes the fusion of language and visual representations through a gating mechanism. This simplifies existing modal fusion mechanisms while retaining important information, thereby improving the effect of intermodal interaction, reducing model illusions, and increasing the accuracy of VQA responses. Furthermore, this invention effectively reduces the computational parameters of the model and the computational resources required for prediction, achieving a balance between low parameter count and high performance. This allows the multimodal fusion model to be deployed on edge devices, demonstrating good versatility and practicality.
[0095] To verify the effectiveness of the method under limited computing resources, this embodiment only used a Flan-Alpaca base model version with fewer than 300M parameters for training, instead of the larger 700M version. All experiments in this embodiment were conducted on a single NVIDIA RTX 4090 24G GPU.
[0096] I. Dataset Selection
[0097] The ScienceQA dataset is a large-scale science question-and-answer dataset designed specifically for multimodal machine learning tasks, covering multiple-choice questions. The questions in this dataset are primarily sourced from primary and secondary school science courses on the IXL resource learning platform. Potential formatting errors, sensitive information, or academic inaccuracies were manually removed, and core concepts were extracted and presented in a structured manner to ensure a complete logical chain of "premise-intermediate steps-conclusion." The annotated content underwent cross-review by educational experts to check scientific accuracy, logical rigor, and linguistic standardization. Human subjects were recruited to answer the questions to verify the ease of understanding of the annotated explanations. When a majority of respondents were confused by an explanation, it was revised to ensure the scientific validity and educational value of the questions. The ScienceQA dataset contains 21,208 question examples, of which 10,332 examples (48.7%) include image information, 10,220 examples (48.2%) provide textual contextual support, and 6,532 examples (30.8%) have both image and textual contextual information. This dataset is designed with multimodal learning in mind, enabling the model to perform comprehensive understanding and inference at multiple levels, including vision, language, and knowledge reasoning.
[0098] Furthermore, the ScienceQA dataset covers three major subject areas: natural sciences, social sciences, and linguistics, subdivided into 26 topics and 127 categories, involving 379 subject-related skills. This broad knowledge coverage makes the dataset suitable not only for general visual question answering (VQA) tasks, but also provides an important research foundation for intelligent tutoring systems in education, cross-modal reasoning tasks, and knowledge-driven question answering systems. To more intuitively illustrate the question distribution in the dataset, the dataset is divided into statistical question coverage by level and subject, such as... Figure 4 As shown, where Figure 4 (a) in the image is a pie chart showing the dataset categorized by rank. Figure 4 (b) is a bar chart of the dataset categorized by subject. It can be seen that the main problems are concentrated in the middle level 2 to 8. The total number of data with and without images is almost equal in the subject categories, and natural science problems dominate.
[0099] In the experiment, this embodiment strictly followed the data partitioning strategy provided by the official documentation, dividing the entire dataset into 12,726 training examples, 4,241 validation examples, and 4,241 test examples, with a partitioning ratio of approximately 3:1:1. It is worth noting that a key feature of the ScienceQA dataset is that most questions come with detailed solutions and supporting knowledge, enabling the model not only to learn how to select the correct answer but also to enhance its reasoning and interpretability during training. This characteristic helps improve the model's interpretability and provides valuable reference for further research.
[0100] The ScienceQA dataset, with its rich multimodal information, broad subject coverage, and detailed annotations, has become an important benchmark dataset for multimodal artificial intelligence research, science education support systems, and knowledge reasoning tasks.
[0101] II. Selection of Evaluation Indicators
[0102] The ScienceQA dataset is a benchmark dataset specifically designed to evaluate a model's ability to answer questions across multiple disciplines. It consists of multiple multiple-choice questions. Since the answer to each question is definite and unique, this example uses accuracy, BLEU score, Rouge-L score, and F1-Score as evaluation metrics. To better analyze the model's performance across different domains, the ScienceQA dataset further subdivides the evaluation metrics into the following eight categories:
[0103] (1) Natural Science (NAT): mainly covers issues in the fields of physics, chemistry, biology and other natural sciences.
[0104] (2) Social Science (SOC): This includes issues in social science fields such as psychology, sociology, and economics.
[0105] (3) Language Science (LAN): This involves language-related issues such as linguistics and semantics.
[0106] (4) Includes contextual hints (TXT): These types of questions provide additional text information as hints in the question stem.
[0107] (5) Includes image (IMG): The question includes image information, and the answer needs to be based on the image content.
[0108] (6) No contextual hints or images (NO): The question contains only the basic question stem and no additional contextual hints or images.
[0109] (7) Low complexity problems (G1-6): Relatively simple problems suitable for students in grades 1-6 of primary school.
[0110] (8) High complexity problems (G7-12): More complex problems suitable for students in grades 7-12 of middle school.
[0111] For each category, this embodiment calculates the model's prediction accuracy within that category, thus comprehensively evaluating the model's performance across different domains and difficulty levels. Finally, this embodiment uses the Average metric as the overall prediction accuracy evaluation indicator, which is calculated as follows:
[0112]
[0113] Where k represents one of the eight categories listed above, Correct Number k Total Number represents the number of questions correctly predicted by the model in category k. k For the total number of questions contained in category k, the Accuracy k The prediction accuracy is denoted by k. A higher accuracy for each category indicates better inference performance of the model in that category, and the more high-quality inference content can be generated to point to the correct answer.
[0114] This approach allows for a more comprehensive measurement of the model's overall performance on the ScienceQA dataset and provides direction for further model optimization.
[0115] BLEU is a metric commonly used to evaluate the quality of machine translation. It measures the degree of n-gram matching between machine-generated text and reference text, as shown in the following formula:
[0116]
[0117]
[0118] Where, p n Let w be the precision of the n-th gram. n denoted as the weight of the nth term, N is the total number of terms, BLEU represents the final score, BP is the length penalty term, c is the length of the generated text, and r is the length of the reference text.
[0119] The Rouge metric is used to evaluate the degree of overlap between generated text and one or more reference texts in natural language generation tasks. It has several variations, including 1-gram, 2-gram, and longest common subsequence (LCS) measurements. The LCS measurement, or Rouge-L, measures the structural consistency of the entire sentence and is unaffected by sentence length, making it suitable for complex natural language generation tasks. Its calculation formula is shown below:
[0120]
[0121] Where c is the length of the generated text, r is the length of the reference text, and LCS is the length of the longest common subsequence.
[0122] F1-Score is a performance evaluation metric commonly used for binary or multi-class classification problems, especially suitable for cases of class imbalance. It is the harmonic mean of precision and recall, and its calculation formula is shown below.
[0123]
[0124] Where Precision represents precision and Recall represents recall.
[0125] This approach allows for a more comprehensive measurement of the model's overall performance on the ScienceQA dataset and provides direction for further model optimization.
[0126] Furthermore, to initialize the weights of the M-TCM model, this embodiment used the Flan-Alpaca-Base model (223M) and further trained it. The initial learning rate was set to 8e-5. The batch sizes for training were 6 and 16 in the inference and answer generation phases, respectively, while the batch sizes for the evaluation phase were 32 and 64, respectively. The maximum output sequence length was set to 512 (inference phase) and 64 (answer generation phase), respectively. The entire training process lasted for 20 epochs.
[0127] III. Analysis of Experimental Results
[0128] The accuracy rates of the intermediate inference content generated by the model and the final predicted answer are shown in Tables 1 and 2, including evaluation metrics such as BLEU1, BLEU4, ROUGE-L, and accuracy. The BLEU1 and ROUGE-L metrics are 92.32% and 94.04%, respectively, indicating that the overall sentence structure of the generated text and the reference text are relatively consistent, with a large number of identical words. The BLEU4 metric is 79.51%, indicating relatively accurate and coherent phrase-level matching. This demonstrates that the model performs well in generating intermediate inference content, and the average accuracy rate of the final prediction result also reaches 86.39%.
[0129] Table 1. Scores of BLEU1 (%), BLEU4 (%), and ROUGE-L (%) for intermediate reasoning.
[0130]
[0131] Table 2 shows the prediction results (accuracy %) on the ScienceQA test set.
[0132]
[0133] The ScienceQA dataset consists of single-label multiple-choice questions with five options: A, B, C, D, and E. Each question has one definitive answer. A confusion matrix heatmap was plotted based on the statistical results of the ScienceQA test set options to evaluate the model's predictive performance. The results are as follows: Figure 5 As shown.
[0134] from Figure 5 The heatmap shows that the colors are mainly concentrated on the diagonal, indicating that the model's predictions are relatively accurate. However, there is also an uneven distribution of samples, mainly concentrated in the upper left corner, indicating that the distribution of question options is not uniform. Most questions have only 2-3 options, which can structurally distort the statistical F1-Score. Therefore, this embodiment calculates the F1-Score separately according to different numbers of options to better measure the model's predictive performance across different options. The grouped statistical results and the final weighted average results are shown in Table 3. After weighted averaging, it can be seen that the macro, micro, and weighted scores are almost identical, indicating that the statistical results are relatively balanced without significant bias. The accuracy is highest on double-choice questions, reaching 89.1%. The results on four-choice questions are slightly lower than those on double-choice questions, but still show good stability. The sample size for five-choice questions is small and cannot well reflect the model's prediction results. The final weighted average result reflects the overall predictive performance of the model well.
[0135] Table 3 shows the F1-Score statistics grouped by the number of options.
[0136]
[0137] IV. Experimental Comparison and Analysis
[0138] In the experiment, this embodiment compared the performance of the M-TCM model proposed in this invention with a series of benchmark models. First, this embodiment compared the modal alignment methods of some major models, classifying the alignment methods according to the different encoding methods of text and image modalities and the modal fusion methods, as shown in Table 4. The model of this invention uses almost all modal fusion methods with the fewest trainable parameters and achieves good performance in complex inference tasks.
[0139] Table 4 Comparison of Multimodal Alignment Methods
[0140]
[0141]
[0142] In the table, `Embedding` indicates direct embedding of the model using word vectors, `Linear` indicates matrix projection transformation after embedding, `&` indicates concatenation of the two operations, and `Cross` and `Self` represent cross-attention and self-attention mechanisms, respectively. NFNet is a variant of ResNet. Note that since Patch-TRM does not specify the exact number of parameters, this embodiment uses the Bert-Small model provided in the official code to calculate the trainable parameter count; for other models, the minimum parameter count model used in their respective papers is used.
[0143] Subsequently, this embodiment compares some major modality fusion benchmark models and mainstream large language models such as GPT3.5 and GPT-4 with the M-TCM model proposed in this invention on the ScienceQA dataset, and the results are summarized in Table 5.
[0144] Table 5 compares the prediction results with different multimodal fusion models (accuracy %).
[0145]
[0146] By comparing the experimental results in Table 5, it can be seen that the M-TCM multimodal fusion model of this invention outperforms other mainstream fusion models, demonstrating good performance in handling multimodal tasks while maintaining low parameters. The following is a detailed comparative analysis.
[0147] Because the OpenFlamingo model did not use instructions for fine-tuning, it clearly did not fully understand the question before giving an answer when answering different questions, with an accuracy of only 39.27%. BLIP-2, based on the Flan-T5-XXL model, was fine-tuned and used a frozen encoder-decoder method to train the projection layer. Because it was exposed to different instructions during training, it was more adaptable to question-answering scenarios and even slightly surpassed GPT-3.5, achieving an accuracy of 74.17%.
[0148] Pre-trained visual fusion models ViLT, Patch-TRM, and VisualBERT are primarily used in classification tasks. Their main characteristic is that they can only produce simple classification outputs when predicting answers, selecting the highest-scoring option as the answer by assigning a score to each choice. ViLT uses a relatively lightweight modal fusion method, while the Patch-TRM model uses a more complex image pyramid structure to fuse text and images at different levels. The VisualBERT model adopts a BERT-based bidirectional Transformer structure, inheriting some of BERT's language capabilities, and achieves a slightly higher accuracy than the former two. These three visual fusion models have a relatively small number of parameters (approximately 100M), limiting their expressive power, and their prediction accuracy is only around 61%. In contrast, the M-TCM model, through a single-level gating mechanism with two-level cross-attention, achieves a prediction accuracy of 86.39%, significantly surpassing these visual question-answering models.
[0149] LaVIN-13B is a model fine-tuned based on LLaMA-13B. By adding an adapter during training, it achieves rapid adaptation to text and image commands, resulting in superior performance compared to pre-trained visual models, with a prediction accuracy of 77.54%. Qwen-VL and LLaVA-1.5 are both models trained on large-scale image-text pairs for visual reasoning tasks. In versions with 7B parameters, they achieved 67.1% and 71.6% accuracy respectively on image processing-only prediction tasks, demonstrating significant advantages over other pre-trained visual models. LLaVA-Mini employs modal pre-fusion and visual token compression to improve prediction performance with lower computational cost. It achieved an 83.1% accuracy in IMG classification, second only to the Enigma-CoT model, indicating its significant advantage in multimodal fusion. DIEM segments the question and image separately, uses CLIP to match text and image modal information, and finally integrates visual information and the answers to sub-questions to generate the final result. This method uses the existing CLIP model to handle the correlation of multimodal information, and the final result is only 68.88%, significantly lower than other large language models. This embodiment also compares the performance of the M-TCM model with open-source and commercial large models such as GPT-3.5 and GPT-4. It can be found that GPT-3.5, GPT-3.5w / CoT, and GPT-4w / CoT versions, as reference models, achieved accuracy rates of 73.97%, 75.17%, and 83.99%, respectively. After using the thought chain technology, the prediction accuracy of GPT-3.5 improved by 1.2%, while GPT-4, due to its strong knowledge reserve of the basic model, showed a more significant performance improvement of 8.82% in its thought chain version compared to GPT-3.5. The prediction results of these models are all lower than those of the model in this invention, and the number of parameters is very large, which is not conducive to the lightweight deployment of the model.
[0150] In comparison, the M-TCM model, while maintaining a low number of parameters, outperformed models such as GPT-3.5 and GPT-4w / CoT, achieving a prediction accuracy of 86.39%.
[0151] The Hot-Base model employs a two-stage multimodal inference framework based on hypergraph thinking, constructing textual and visual hypergraph thinking respectively and achieving their interaction through cross-modal collaborative attention map learning. This results in a significant performance improvement over traditional modal fusion models, achieving an accuracy of 81.42%. The Multimodal-CoT and Enigma-CoT models utilize the thought chain technique. Table 5 shows the experimental results of the Multimodal-CoT model, which were retrained for 20 epochs on an RTX 4090 GPU using the parameters provided in the paper. For Enigma-CoT, since the training source code is not publicly available, the experimental results from the original paper are used here. As can be seen from Table 5, the M-TCM model outperforms the Multimodal-CoT model on all classification metrics. In terms of overall average accuracy, the M-TCM model improves upon Multimodal-CoT by 1.55%. Furthermore, in classification with textual and image contextual information, Multimodal-CoT achieves accuracy of only 83.04% and 80.66%, respectively, while M-TCM improves to 85.34% and 82.35%, representing improvements of 2.3% and 1.69%, respectively. These results demonstrate that the modal fusion method and inference selection strategy proposed in this invention have significant advantages in visual question answering tasks.
[0152] To facilitate a more intuitive comparison of the fusion effects of the Multimodal-CoT and Enigma-CoT models, which have similar parameter counts and utilize the thinking chain technique, with the M-TCM multimodal fusion model proposed in this invention, this embodiment further graphically displays their results in Table 5, as shown below. Figure 6 As shown in the line graph, the M-TCM model outperforms the Multimodal-CoT model across all metrics, with the most significant improvements in LAN language classification and TXT text context classification. While the Enigma-CoT model holds an advantage in several metrics, it achieved very low results in the SOC social science classification, possibly due to ineffective handling of the problem characteristics within that category. Its overall average performance remains lower than the M-TCM model, indicating that the TCM fusion mechanism has been effectively optimized, thus improving the fusion effect.
[0153] In summary, by comparing the performance of the M-TCM model with other benchmark models, it can be seen that the M-TCM outperforms other models in multimodal data fusion and inference tasks, achieving significant performance improvements across multiple classifications, particularly excelling in handling multimodal problems. Its ability to surpass large language models like GPT-4 with a moderate model size (256M) demonstrates the significant potential of the M-TCM model in multimodal inference tasks and provides valuable reference and guidance for future research.
[0154] This embodiment further conducted ablation experiments to explore the effectiveness of the module proposed in this invention under different settings. Specifically, for the modality fusion mechanism, the ablation experiments were set as follows: ① using a single Transformer fusion mechanism, ② using a multi-hop modality fusion mechanism, and ③ using a TCM modality fusion mechanism. In the first experiment, this embodiment set the modality fusion mechanism to a single Transformer, without considering the improved method proposed in this invention. In the second experiment, to highlight the improvement effect of this invention on the multi-hop cross-attention fusion method proposed in Enigma-CoT, this embodiment set the modality fusion mechanism to a multi-hop cross-attention mechanism for comparison. In the third experiment, the multimodal feature fusion module proposed in this invention was used as the modality fusion mechanism. By comparing these three different modality fusion mechanisms, the advantages of TCM in multimodal tasks can be revealed. The experimental results are shown in Table 6.
[0155] Table 6. Experimental results of different modality fusion mechanisms (accuracy %)
[0156]
[0157] Table 6 presents the experimental results of different modality fusion mechanisms. By comparing the model performance of the Single Transformer mechanism, the multi-hop cross-attention mechanism, and the TCM fusion mechanism, it can be clearly seen that the multi-hop cross-attention mechanism slightly improves the final result compared to the Single Transformer. Furthermore, the TCM mechanism, which improves upon the multi-hop cross-attention mechanism, further positively impacts the model's accuracy. When using the Single Transformer module, the model relies solely on traditional modality fusion methods for inference. In the text modality classification (TXT) task, the model's prediction accuracy is 83.04%; while in the multimodal classification (IMG) task containing image information, the model's prediction accuracy is 80.66%, resulting in an overall average prediction accuracy of 84.84%. This demonstrates the limitations of traditional modality fusion methods when dealing with complex multimodal data. When using the multi-hop cross-attention mechanism, the model's prediction accuracy was 84.21% in text modality classification, an improvement of 1.17% compared to the Single Transformer mechanism. In multimodal classification involving image information, the model's prediction accuracy was 80.07%, a decrease of 0.59% compared to the Single Transformer. The overall average prediction accuracy was 85.05%, an improvement of 0.21%. This indicates that although the multi-hop cross-attention mechanism improves the overall performance, it shows a decline in IMG classification involving multimodal fusion, suggesting that the mechanism does not effectively handle the flow of image information.
[0158] However, introducing the TCM module into the model resulted in significant performance improvements. In the Text Modal Classification (TXT) task, the model's prediction accuracy increased to 85.34%, a 2.3 percentage point improvement compared to the Single Transformer module and a 1.13 percentage point improvement compared to the Multi-hop module. More significantly, in the Image Modal Classification (IMG) task, the model's accuracy improved to 82.35%, a 1.69 percentage point improvement compared to the Single Transformer's 80.66% and a 2.28 percentage point improvement compared to the Multi-hop's 80.07%. The overall average prediction accuracy of the TCM modal fusion mechanism reached 86.39%, a 1.55 percentage point improvement compared to the Single Transformer's 84.84% and a 1.34 percentage point improvement compared to the Multi-hop's 85.05%. This result demonstrates that the TCM module improves both IMG classification and overall average prediction accuracy after refining the multi-hop cross-attention mechanism. It solves the problem of information flow suppression in the multi-hop mechanism and fully proves the effectiveness of the TCM module in multimodal data fusion. It can better capture and integrate relevant information between different modalities, thereby improving the model's prediction performance.
[0159] These ablation experiments verified that the TCM module can effectively improve the model's ability to process multimodal data, thereby improving the overall prediction accuracy of the model.
[0160] This invention proposes a multimodal fusion model and introduces a more efficient modal fusion method to fuse image and text modal information, providing a new approach for models to understand multimodal data. Overall, the M-TCM model achieves a prediction accuracy of 86.39% on the ScienceQA dataset using only 256M parameters, striking an effective balance between parameter count and performance, and demonstrating better adaptability in environments with limited computational power.
[0161] Based on the single-level gated multimodal fusion method with two-level cross-attention described in Embodiment 1, Embodiment 2 of the present invention proposes a single-level gated multimodal fusion device with two-level cross-attention, comprising:
[0162] The model building module is used to build a multimodal fusion model, including an embedding layer, a text encoder, an image encoder, a multimodal feature fusion module, and a decoder;
[0163] The prediction module, which utilizes a multimodal fusion model to generate predicted text from input text and input image, includes the following steps:
[0164] The input text is processed through an embedding layer and a text encoder to obtain a text vector H. l The input image is processed by an image encoder to obtain the image vector H. v ;
[0165] Text vector H l and image vector H v Input modal feature fusion module, text vector H l After one single-head self-attention encoding, vector A is obtained. l ;Transfer vector A l and image vector H v Perform a single-head cross-attention encoding to obtain vector A. v,0 ;Transfer vector A v,0 and image vector H v Perform another single-head cross-attention encoding to obtain the cross vector A. v,l ;Transfer text vector H l and cross vector A v,l After concatenation, the gating weights are predicted through linear transformation and the Sigmoid activation function; the text vector H is then... l and cross vector A v,l After proportional fusion with gated weights, a fusion vector is obtained;
[0166] The fused vector is then processed by a decoder to obtain the predicted text.
[0167] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0168] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0169] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0170] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0171] Obviously, the above embodiments are merely illustrative examples for clear explanation and are not intended to limit the implementation. Those skilled in the art will recognize that other variations or modifications can be made based on the above description. It is neither necessary nor possible to exhaustively list all possible implementations here. However, obvious variations or modifications derived therefrom are still within the scope of protection of this invention.< / pad> < / unk> < / pad>
Claims
1. A single-level gated multimodal fusion method with two-level cross-attention, characterized in that, include: Construct a multimodal fusion model, including an embedding layer, a text encoder, an image encoder, a multimodal feature fusion module, and a decoder; The process by which a multimodal fusion model generates predicted text from input text and input image includes: The input text is processed through an embedding layer and a text encoder to obtain the text vector H. l ; The input image is processed by an image encoder to obtain the image vector H. v ; Text vector H l and image vector H v Input modal feature fusion module, text vector H l After one single-head self-attention encoding, vector A is obtained. l ;Transfer vector A l and image vector H v Perform a single-head cross-attention encoding to obtain vector A. v,0 ;Transfer vector A v,0 and image vector H v Perform another single-head cross-attention encoding to obtain the cross vector A. v,l ;Transfer text vector H l and cross vector A v,l After splicing, The gating vector is predicted after linear transformation and Sigmoid activation function; the text vector H is then... l and cross vector A v,l After proportionally fusing the gated vectors, a fused vector is obtained; The fused vector is then processed by a decoder to obtain the predicted text.
2. The single-level gated multimodal fusion method with two-level cross-attention as described in claim 1, characterized in that, The process of reasoning using a multimodal fusion model includes: Input the question text and image into the multimodal fusion model in the reasoning stage to obtain the intermediate reasoning text; The intermediate reasoning text and the question text are concatenated and then combined with the multimodal fusion model of the image input inference stage to obtain the answer text.
3. The single-level gated multimodal fusion method with two-level cross-attention as described in claim 2, characterized in that, The process of training the multimodal fusion model includes: First, using inference data from the ScienceQA dataset as a supervision signal, and taking question text and images as input, the text encoder, multimodal feature fusion module, and decoder in the multimodal fusion model during the inference stage are trained to output intermediate inference text. Then, using answers from the ScienceQA dataset as a supervision signal, and taking intermediate inference text, question text, and images as input, the text encoder, multimodal feature fusion module, and decoder in the multimodal fusion model during the inference stage are trained to output answer text.
4. The single-level gated multimodal fusion method with two-level cross-attention as described in claim 1, characterized in that, Text vector H l After one single-head self-attention encoding, vector A is obtained. l The formula is: A l =SHA1(H l ,H l ,H l ) Where SHA1 represents single-head self-attention encoding, H l A represents a text vector. l This represents the output vector of a single-head self-attention encoding. Vector A l and image vector H v Perform a single-head cross-attention encoding to obtain vector A. v,0 The formula is: A v,0 SHA2(A l H v H v ) Where SHA2 represents the first single-head cross-attention encoding, H v A represents an image vector. v,0 This represents the output vector of the first single-head cross-attention encoding; Vector A v,0 and image vector H v Perform another single-head cross-attention encoding to obtain the cross vector A. v,l The formula is: A v,l SHA3(A v,0 H v H v ) Where SHA3 represents the second single-head cross-attention code, A v,l This represents the output vector of the second single-head cross-attention encoding; Text vector H l and cross vector A v,l After concatenation, the gating vector is predicted through linear transformation and the Sigmoid activation function, as shown in the formula: λ=Sigmoid(Linear(H l ,l v,l )),λ∈[0,1] Where λ represents the gate vector, Linear(·) represents the linear transformation, and Sigmoid(·) represents the Sigmoid activation function; Text vector H l and cross vector A v,l After proportionally fusing the gated vectors, a fused vector is obtained. The formula is: H out =λ×A v,k +(1-λ)×H l Among them, H out This represents the fusion vector.
5. The single-level gated multimodal fusion method with two-level cross-attention as described in claim 1, characterized in that, The image encoder is a VIT encoder.
6. The single-level gated multimodal fusion method with two-level cross-attention as described in claim 5, characterized in that, The image encoder is followed by a linear projection layer, which maps the image vectors to the same dimensional space as the text vectors.
7. The single-level gated multimodal fusion method with two-level cross-attention as described in claim 1, characterized in that, Before the input text passes through the embedding layer, a word segmenter is used to split the input text into multiple sub-words, and each sub-word is mapped to an ID.
8. The single-level gated multimodal fusion method with two-level cross-attention as described in claim 7, characterized in that, The word segmenter is based on the Unigram language model and splits the input text into multiple sub-words.
9. The single-level gated multimodal fusion method with two-level cross-attention as described in claim 1, characterized in that, The text encoder is a T5 encoder, which includes multiple T5 blocks. Each T5 block includes a self-attention layer and a feedforward layer connected in sequence.
10. The single-level gated multimodal fusion method with two-level cross-attention as described in claim 1, characterized in that, The decoder is a T5 decoder, which includes multiple T5 blocks. Each T5 block includes a self-attention layer, a cross-attention layer, and a feedforward layer connected in sequence.