Hallucination mitigation method and device for multi-modal large model, electronic equipment and medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By processing text and image data and performing attention vector analysis on a multimodal large language model, the hallucination phenomenon was resolved, improving the reliability and accuracy of the model, making it applicable to fields such as healthcare, autonomous driving, and robotics.

CN119128061BActive Publication Date: 2026-06-12TSINGHUA UNIVERSITY

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: TSINGHUA UNIVERSITY
Filing Date: 2024-09-19
Publication Date: 2026-06-12

Application Information

Patent Timeline

19 Sep 2024

Application

12 Jun 2026

Publication

CN119128061B

IPC: G06F16/334; G06F16/583

CPC: G06F16/3344; G06F16/5846

AI Tagging

Application Domain

Digital data information retrieval Special data processing applications

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing multimodal large language models suffer from illusion when generating open-ended responses, leading to erroneous results. Existing methods increase the complexity and cost of model development and deployment, while ignoring the interaction between input and output, thus affecting model performance.

⚗Method used

By processing the text and image data, text tokens and image tokens are obtained, and after alignment, an image token attention vector is constructed. The Jensen-Shannon divergence is used to measure the difference between decoding layers to determine the target layer, and an early exit mechanism is used to predict the probability distribution of the next token through a linear projection layer.

🎯Benefits of technology

It effectively mitigates and manages illusion phenomena in multimodal large language models, improves their reliability and accuracy in practical applications, avoids high costs and complexity, and maintains the model's high efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN119128061B_ABST

Patent Text Reader

Abstract

This application relates to a method, apparatus, electronic device, and medium for alleviating hallucinations in multimodal large language models. The method includes: processing input text-image data to obtain text tokens and image tokens, aligning the text tokens and image tokens to obtain an input sequence of text-image data, and constructing an image token attention vector for each position in the input sequence; based on the image token attention vectors, using Jensen-Shannon divergence to measure the difference in image understanding between different decoding layers, and determining the target layer with the largest distance to the image token attention vector of the last layer based on the measurement result; utilizing a pre-defined early exit mechanism, transforming the hidden state of the target layer into a probability distribution through a linear projection layer to predict the probability distribution of the next token. This solves the problems of existing techniques for alleviating hallucinations in multimodal large language models being overly complex, costly, and affecting model performance, thus improving their reliability and accuracy in practical applications.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of large language model technology, and in particular to a method, device, electronic device and medium for hallucination relief of multimodal large models. Background Technology

[0002] In recent years, Multimodal Large Language Models (MLLMs) have made significant progress in processing linguistic and other modal inputs to generate open-ended answers. The latest MLLMs have performed exceptionally well in various visual tasks, such as object detection, image captioning, and visual question answering. However, despite their remarkable success, these models commonly suffer from a serious problem: generating grammatically coherent but actually erroneous results, a phenomenon known as the "illusion." This phenomenon is particularly problematic in fields such as healthcare, autonomous driving systems, and robotics, as erroneous results can have disastrous consequences. Therefore, effectively mitigating and managing the illusion phenomenon in MLLMs to improve their reliability and accuracy in practical applications has become a pressing technical challenge.

[0003] In related technologies, most methods address the illusion problem by increasing data or training costs, such as building new manually labeled datasets, introducing external knowledge bases, or performing additional training or fine-tuning; other methods attempt to reduce illusions during inference by leveraging the model's implicit knowledge without increasing or only increasing training or data costs by negligible amounts.

[0004] However, the above methods require a large amount of high-quality labeled data and additional training costs, which significantly increases the complexity and cost of model development and deployment. On the other hand, they tend to ignore the interaction between input and output, thus affecting the overall performance of the model, which urgently needs to be addressed. Summary of the Invention

[0005] This application provides a method, apparatus, electronic device, and medium for mitigating hallucinations in multimodal large language models, in order to solve the problems that existing technologies for mitigating hallucinations in multimodal large language models are too complex, costly, and affect model performance. It effectively reduces and manages hallucinations in multimodal large language models, and improves their reliability and accuracy in practical applications.

[0006] To achieve the above objectives, the first aspect of this application proposes a method for alleviating hallucinations in a multimodal large model, comprising the following steps:

[0007] Obtain the input image and text data;

[0008] The image and text data are processed to obtain text tokens and image tokens, and the text tokens and image tokens are aligned to obtain the input sequence of the image and text data. An image token attention vector is constructed for each position in the input sequence.

[0009] Based on the image token attention vector at each position, the difference in image understanding between different decoding layers is measured by Jensen-Shannon divergence, and the target layer with the largest distance from the image token attention vector of the last layer is determined according to the measurement result;

[0010] By utilizing a pre-defined early exit mechanism, the hidden state of the target layer is transformed into a probability distribution through a linear projection layer in order to predict the probability distribution of the next token.

[0011] According to one embodiment of this application, the processing of the graphic data to obtain text tokens and image tokens includes:

[0012] The text token is obtained by processing the text input of the image and text data using a preset large-scale language model;

[0013] The image input of the graphic data is converted into a visual embedding through a preset encoder, and the visual embedding is mapped to the image token through a visual-language alignment connector.

[0014] According to one embodiment of this application, constructing the image token attention vector for each position in the input sequence includes:

[0015] The attention weight distribution for the image token is represented by the maximum weight in multi-head attention;

[0016] The maximum weight in the multi-head attention is normalized using the softmax function to obtain the image token attention vector for each position in the input sequence.

[0017] According to one embodiment of this application, the image token attention vector at each location is:

[0018]

[0019] in, For the maximum weight of each position, i s to i e Let n be the range of positions of the image token in the input sequence, n be the layer of the decoder, and t be each position.

[0020] According to one embodiment of this application, before transforming the hidden state of the target layer into a probability distribution through a linear projection layer using a preset early exit mechanism, the method further includes:

[0021] Adjust the attention weights of the target layer.

[0022] The hallucination mitigation method for multimodal large language models proposed in this application involves processing the input text-image data to obtain text tokens and image tokens, aligning the text tokens and image tokens to obtain the input sequence of text-image data, and constructing an image token attention vector for each position in the input sequence. Based on the image token attention vector, the Jensen-Shannon divergence is used to measure the difference in image understanding between different decoding layers, and the target layer with the largest distance from the image token attention vector of the last layer is determined according to the measurement result. A preset early exit mechanism is used to transform the hidden state of the target layer into a probability distribution through a linear projection layer to predict the probability distribution of the next token. This solves the problems of existing techniques for mitigating hallucination phenomena in multimodal large language models being overly complex, costly, and affecting model performance, effectively reducing and managing hallucination phenomena in multimodal large language models, and improving their reliability and accuracy in practical applications.

[0023] To achieve the above objectives, a second aspect of this application provides a multimodal large-scale hallucination relief device, comprising:

[0024] The acquisition module is used to acquire the input image and text data;

[0025] The construction module is used to process the graphic data to obtain text tokens and image tokens, align the text tokens and image tokens to obtain the input sequence of the graphic data, and construct the image token attention vector at each position in the input sequence;

[0026] The determination module is used to measure the difference in image understanding between different decoding layers based on the image token attention vector at each position using Jensen-Shannon divergence, and determine the target layer with the largest distance from the image token attention vector of the last layer based on the measurement result;

[0027] The prediction module is used to use a preset early exit mechanism to transform the hidden state of the target layer into a probability distribution through a linear projection layer in order to predict the probability distribution of the next token.

[0028] According to one embodiment of this application, the construction module is specifically used for:

[0029] The text token is obtained by processing the text input of the image and text data using a preset large-scale language model;

[0030] The image input of the graphic data is converted into a visual embedding through a preset encoder, and the visual embedding is mapped to the image token through a visual-language alignment connector.

[0031] According to one embodiment of this application, the construction module is specifically used for:

[0032] The attention weight distribution for the image token is represented by the maximum weight in multi-head attention;

[0033] The maximum weight in the multi-head attention is normalized using the softmax function to obtain the image token attention vector for each position in the input sequence.

[0034] According to one embodiment of this application, the image token attention vector at each location is:

[0035]

[0036] in, I is the maximum weight for each position. s To I e Let n be the range of positions of the image token in the input sequence, n be the layer of the decoder, and t be each position.

[0037] According to one embodiment of this application, before transforming the hidden state of the target layer into a probability distribution through a linear projection layer using a preset early exit mechanism, the prediction module is further configured to:

[0038] Adjust the attention weights of the target layer.

[0039] The hallucination mitigation device for multimodal large language models proposed in this application processes the input text and image data to obtain text tokens and image tokens, aligns the text tokens and image tokens to obtain the input sequence of text and image data, and constructs an image token attention vector for each position in the input sequence. Based on the image token attention vector, the Jensen-Shannon divergence is used to measure the difference in image understanding between different decoding layers, and the target layer with the largest distance from the image token attention vector of the last layer is determined according to the measurement result. Using a preset early exit mechanism, the hidden state of the target layer is transformed into a probability distribution through a linear projection layer to predict the probability distribution of the next token. This solves the problems of existing techniques for mitigating hallucination phenomena in multimodal large language models being overly complex, costly, and affecting model performance, effectively reducing and managing hallucination phenomena in multimodal large language models, and improving their reliability and accuracy in practical applications.

[0040] To achieve the above objectives, a third aspect of this application provides an electronic device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the multimodal large-scale hallucination relief method as described in the above embodiments.

[0041] To achieve the above objectives, a fourth aspect of this application provides a computer-readable storage medium having a computer program stored thereon, which is executed by a processor to implement the multimodal large-scale hallucination relief method as described in the above embodiments.

[0042] To achieve the above objectives, a fifth aspect of the present invention provides a computer program product comprising a computer program that, when executed by a processor, is used to implement the multimodal large-scale hallucination relief method as described in the above embodiments.

[0043] Additional aspects and advantages of this application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of this application. Attached Figure Description

[0044] The above and / or additional aspects and advantages of this application will become apparent and readily understood from the following description of the embodiments taken in conjunction with the accompanying drawings, wherein:

[0045] Figure 1 This is a flowchart of a multimodal large-scale hallucination relief method provided according to an embodiment of this application;

[0046] Figure 2 This is a flowchart of an image token attention-enhanced decoding process according to an embodiment of this application;

[0047] Figure 3 This is a block diagram of a multimodal large-scale hallucination relief device provided according to an embodiment of this application;

[0048] Figure 4 This is a schematic diagram of the structure of an electronic device provided according to an embodiment of this application. Detailed Implementation

[0049] The embodiments of this application are described in detail below. Examples of the embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and intended to explain this application, and should not be construed as limiting this application.

[0050] The following description, with reference to the accompanying drawings, outlines a method, apparatus, electronic device, and medium for alleviating hallucinations in a multimodal large model according to embodiments of this application.

[0051] Figure 1 This is a flowchart of a multimodal large-scale hallucination relief method according to an embodiment of this application.

[0052] Before introducing the hallucination relief method for multimodal large models proposed in the embodiments of this application, the relevant technical background will be introduced first.

[0053] Understandably, the term "illusion" was first defined in language models, referring to a model outputting seemingly correct grammar or semantics, but which is actually inconsistent with reality or irrelevant to the context. Illusions are widespread in large language models, severely impacting their practical application in various real-world scenarios, such as medical diagnosis, autonomous driving, and robotics.

[0054] To address the illusion problem in large language models, most methods increase data or training costs, such as constructing new manually labeled datasets, introducing external knowledge bases, or performing additional training or fine-tuning. Specific methods include training posterior illusion correctors to detect and correct illusionary outputs; using external knowledge bases for retrieval enhancement to reduce illusion; and employing Reinforcement Learning from Human Feedback (RLHF) to align model outputs with human feedback. Additionally, some methods attempt to reduce illusions during inference by leveraging the model's implicit knowledge; these methods increase training or data costs little or no. For example, by comparatively decoding the output probability distributions of different layers, illusions in large language models are reduced with an increase of no more than 10% in inference time.

[0055] For multimodal large language models, the main focus of academic research on the illusion problem lies in whether the output of the large language model is faithful to the image content. Compared with solutions to reduce the illusion problem in large language models, most methods introduce additional data and training costs. Only a few methods attempt to solve the illusion problem in the inference stage without introducing additional data and training costs. For example, the VCD (Visual Contrastive Decoding) method borrows the idea of contrastive decoding, inputting the original image and a noisy image into the multimodal large model to obtain two different probability distributions for the output, and then performing contrastive decoding on these two probability distributions. The OPERA method (a novel decoding method) is based on the discovery that illusion output is often accompanied by attention convergence, and proposes over-trust penalty and retrospection-allocation strategies. Although these methods do not introduce additional training and data costs, they bring a significant increase in inference latency.

[0056] It can be seen that the relevant technologies have the following drawbacks: (1) They rely on high-quality labeled data and additional training costs. That is, existing methods for mitigating hallucination often require a large amount of high-quality labeled data and additional training costs. These methods include building manually labeled datasets, integrating external knowledge bases, or performing additional training and fine-tuning, all of which increase the complexity and cost of model development and deployment; (2) They ignore the interaction between input and output. That is, although some methods attempt to use the internal representation of large models to mitigate hallucination during the inference phase, these methods usually focus on the separate processing of input and output, ignoring the interaction between them. This may lead to the inability to fully capture and utilize the complex relationship between input and output when processing multimodal inputs, thereby affecting the overall performance of the model; (3) The inference latency is significantly increased. That is, although some hallucination mitigation methods implemented during the inference phase do not require additional training or data costs, they often significantly increase the inference latency. This is unacceptable for scenarios with high real-time requirements in practical applications, such as autonomous driving and robotics, because the increase in inference latency may lead to a decrease in system response speed, affecting its practical application effect and safety.

[0057] Based on the aforementioned problems, this application proposes a method for mitigating hallucinations in multimodal large language models. First, by processing the input text-image data, text tokens and image tokens are obtained. Aligning the text tokens and image tokens yields the input sequence of the text-image data, and an image token attention vector is constructed for each position in the input sequence. Second, based on the image token attention vectors, the Jensen-Shannon divergence is used to measure the difference in image understanding between different decoding layers, and the target with the largest distance to the image token attention vector of the last layer is determined according to the measurement result. Finally, using a pre-defined early exit mechanism, the hidden state of the target layer is transformed into a probability distribution through a linear projection layer to predict the probability distribution of the next token. This solves the problems of existing techniques for mitigating hallucinations in multimodal large language models being overly complex, costly, and affecting model performance, effectively reducing and managing hallucinations in multimodal large language models, and improving their reliability and accuracy in practical applications.

[0058] For example, such as Figure 1 As shown, the hallucination relief method for this multimodal large model includes the following steps:

[0059] In step S101, the input graphic data is obtained.

[0060] In this context, image and text data refer to both image and text data. Multimodal large language models aim to process and generate data from different modalities (such as images and text) to provide users with a rich and natural interactive experience. The specific image and text data input can be determined based on the user's actual needs.

[0061] In step S102, the image and text data are processed to obtain text tokens and image tokens, and the text tokens and image tokens are aligned to obtain the input sequence of the image and text data, and an image token attention vector is constructed for each position in the input sequence.

[0062] In other words, combining Figure 2 As shown, after determining the input image and text data, the multimodal large language model can align the token spaces of these two modalities. That is, processing the image and text data separately yields corresponding text tokens and image tokens. To enable images and text to be processed in the same model, a mechanism (i.e., a vision-language alignment connector or cross-modal converter) is needed to align the representations of these two modalities into the same semantic space (i.e., in the same vector space), thus obtaining the input sequence of image and text data (which can be an alternating sequence of text and image tokens, or a parallel sequence containing text and image tokens), which can be denoted as... Where L represents the length of the input token, and we assume that the start and end positions of the image tokens in the input sequence are I and I, respectively. s and I e Then 1≤I s e ≤L. Based on the position of the image token in the input sequence, multi-head attention mechanisms and softmax functions can be used to construct the image token attention vector for each position in the sequence, thereby enhancing the model's understanding and utilization of image information, especially in the case of text processing, where it can focus on relevant image content.

[0063] The following details how to process text and image data to obtain text tokens and image tokens.

[0064] As one possible implementation, in some embodiments, text and image data are processed to obtain text tokens and image tokens, including: processing the text input of the text and image data using a preset large language model to obtain text tokens; converting the image input of the text and image data into a visual embedding through a preset encoder, and mapping the visual embedding into an image token through a visual-language alignment connector.

[0065] Text tokens refer to the representation obtained by segmenting text into a series of discrete symbols or tags. These tokens can be words, characters, or other language elements, depending on the model used and the task requirements. For example, in natural language processing, text tokenization is the process of converting text into a format that the model can understand, such as segmenting a sentence into words or phrases, with each word or phrase being converted into one or more tokens. Image tokens refer to the representation obtained by decomposing an image into a series of discrete symbols or tags. These tags can represent specific regions, objects, or features in the image. For example, in computer vision tasks, image tokenization divides an image into multiple small blocks, with each block being treated as a token.

[0066] Specifically, for text input, tokenization tools of pre-defined large language models (such as LLM (Large Language Model)) (e.g., BERT (Bidirectional Encoder Representations from Transformers, a pre-trained language model) and GPT (Generative Pre-trained Transformer, a deep learning model trained on internet data) can be used to process text input in image-text data, decomposing the text input into a series of text tokens. For image input, visual encoders (such as CNN (Convolutional Neural Network) and Vision Transformer) can be used to process image input in image-text data, encoding the image input into a series of visual embeddings (i.e., image feature vectors). These visual embeddings are mapped to a sequence of image tokens that match the dimension of the text tokens through a vision-language alignment connector (the length may differ from the text token sequence and requires further processing for alignment).

[0067] As one possible implementation, in some embodiments, constructing an image token attention vector for each position in the input sequence includes: representing the attention weight distribution of the image tokens by the maximum weight in multi-head attention; and normalizing the maximum weight in multi-head attention using a softmax function to obtain the image token attention vector for each position in the input sequence.

[0068] Multi-Head Attention (MHA) is a core component of the Transformer model. It allows the model to process information in parallel across different representation subspaces. When processing image-text data, applying MHA to image tokens allows for the computation of dependencies between them. Image token attention vectors represent the important regions that the model focuses on during image processing. These vectors not only contain information about specific regions in the image but also reflect the model's level of "attention" to these regions—that is, the model's assessment of the importance of these regions for understanding the entire image or completing a specific task (such as object detection or image classification).

[0069] Specifically, the image token embedding is divided into multiple "heads". Each "head" will independently pass through a self-attention layer. In each self-attention layer, the image token will calculate attention scores with all other tokens in its sequence (including itself). For each "head", an attention weight matrix can be obtained, where each element represents the attention weight of one image token to another image token.

[0070] For the Nth layer, the attention weights are:

[0071]

[0072] Among them, l t Represents a sequence The length of the input and output token sequences; Includes query For key Each token in the set (from the 1st to the 1st) t Attention weights for each position, satisfying...

[0073] Considering that the maximum weight in multi-head attention usually represents a higher level of confidence, this application embodiment selects the maximum weight for each position:

[0074]

[0075] When the position range of the image token in the input sequence is I s to i e At these locations, the maximum weight in the multi-head attention can be selected and applied through the softmax function. After normalization, for the nth layer decoder, the image token attention vector (i.e., iTaV) at position t (i.e., the image token attention vector at each position) is:

[0076]

[0077] in, The maximum weight at each position, i s To I e Let n be the range of positions of the image token in the input sequence, n be the layer of the decoder, and t be each position.

[0078] In step S103, based on the image token attention vector at each location, the difference in image understanding between different decoding layers is measured by Jensen-Shannon divergence, and the target layer with the largest distance from the image token attention vector of the last layer is determined according to the measurement result.

[0079] The Jensen-Shannon divergence (JSD) is an index that measures the similarity between two probability distributions and is often used to compare the differences between two probability distributions.

[0080] Specifically, for each decoding layer, the image token attention vector for each location is extracted. For each pair of adjacent decoding layers (or selected layer pairs), the Jensen-Shannon divergence between their respective sets of image token attention vectors is calculated. Since there is an image token attention vector for each location, the Jensen-Shannon divergence for all location vector pairs can be calculated. Then, the average or other aggregation methods are used to obtain a single divergence value to measure the difference in image understanding between different decoding layers (i.e., the distance between the corresponding iTaVs).

[0081]

[0082] Furthermore, iterate through all decoding layers (except the last layer), and for each layer, calculate its Jensen-Shannon divergence with the last layer. Based on the metric, determine the target layer that has the largest distance to the image token attention vector of the last layer.

[0083] Understandably, LLMs typically consist of multiple stacked layers, each processing its input representation and passing it to the next layer. As input data passes through the LLM's hierarchical structure, the LLM's internal representation progressively improves as the input advances through its layers; that is, the model gradually extracts and integrates more contextual information, semantic features, and patterns. Therefore, intuitively, the representation of the last decoding layer (i.e., ) than the intermediate layer (i.e. The representation of m (from 1 to N-1) is richer and more comprehensive. Meanwhile, the hallucinatory output of MLLMs often occurs when the attention weights on image tokens are reduced. Therefore, embodiments of this application propose extracting and amplifying interlayer improvements in image understanding within MLLMs to alleviate this hallucination.

[0084] Specifically, the input sequence h 0 The input can be directly fed into an LLM to generate a response. That is, the LLM uses multiple decoding layers composed of a multi-head attention mechanism and a multilayer perceptron (MLP) to process the input sequence h. 0 Decode the layer. Assume the LLM consists of N layers, with h... 0 As the hidden state input of the first layer, for each position t, we have:

[0085]

[0086] Following this pattern, we can obtain

[0087] In each decoding step t, select the target layer M with the largest distance between the iTaV of the current decoding layer and the iTaV of the last layer:

[0088]

[0089] in, yes A subset of. and Compared to other layers, the hidden state of the target layer M (i.e., This can maximize the effect. Improvements to image attention.

[0090] In step S104, a preset early exit mechanism is used to transform the hidden state of the target layer into a probability distribution through a linear projection layer in order to predict the probability distribution of the next token.

[0091] LLM can be used to project through a linear projection layer Mapping to a V-dimensional space, where V represents the vocabulary. Finally, the softmax function is used to transform the projection into a probability distribution for predicting the next token, as follows:

[0092]

[0093] Furthermore, embodiments of this application can also utilize a preset early exit mechanism (allowing the model to output results when it reaches a certain intermediate layer, without having to wait for the last layer; this mechanism can accelerate the inference process) to project the hidden state of the target layer M through a linear projection layer. Transform into a probability distribution:

[0094]

[0095] Therefore, for a given input, this application can obtain p N and p MSubsequently, iTaD obtains the final predicted token probability distribution p of the model based on the parameter λ (λ = -∞ is optimal). iTaD The decoding process is then carried out step by step at each time t:

[0096] p iTaD =λp N +(1-λ)p M ;

[0097] Note that the p obtained from this iTaD It can satisfy the original output probability p N When there is less attention to the image tokens, the improvement in attention to the image relative to the hidden layer M is amplified, thereby implicitly achieving enhanced attention to the image tokens and thus alleviating the hallucination problem.

[0098] Furthermore, in some embodiments, before using a preset early exit mechanism to transform the hidden state of the target layer into a probability distribution through a linear projection layer, the method further includes: adjusting the attention weights of the target layer.

[0099] Understandably, before using the preset early exit mechanism to transform the hidden state of the target layer into a probability distribution through the linear projection layer, the attention weights of the target layer can be adjusted according to preset strategies or conditions, such as renormalization. By adjusting the weights, the attention distribution of the target layer to important information in the input sequence can be optimized, so as to better capture key features.

[0100] In summary, the image token attention-enhanced decoding proposed in this application has the following advantages:

[0101] (1) Enhanced image understanding: By defining the image token attention vector (iTaV) and the image token attention amplifier (iTaMP), this application can extract and maximize the improvement in image understanding between model layers, effectively enhancing the model's ability to process image information.

[0102] (2) Mitigating hallucination: This application introduces an image token attention amplifier to output token distribution, which implicitly enhances the model’s attention to image tokens, thereby significantly mitigating hallucination in multimodal large language models.

[0103] (3) Universality and ease of use: As a plug-and-play method, this application is applicable to different multimodal large language models, datasets and benchmarks, and has wide applicability. At the same time, this application does not require significant modifications to the original model, nor any additional datasets or additional training. It alleviates the illusion problem while maintaining low inference costs, meeting the requirements of scenarios with scarce training resources and high-efficiency inference.

[0104] The hallucination mitigation method for multimodal large language models proposed in this application involves processing the input text-image data to obtain text tokens and image tokens, aligning the text tokens and image tokens to obtain the input sequence of text-image data, and constructing an image token attention vector for each position in the input sequence. Based on the image token attention vector, the Jensen-Shannon divergence is used to measure the difference in image understanding between different decoding layers, and the target layer with the largest distance from the image token attention vector of the last layer is determined according to the measurement result. A preset early exit mechanism is used to transform the hidden state of the target layer into a probability distribution through a linear projection layer to predict the probability distribution of the next token. This solves the problems of existing techniques for mitigating hallucination phenomena in multimodal large language models being overly complex, costly, and affecting model performance, effectively reducing and managing hallucination phenomena in multimodal large language models, and improving their reliability and accuracy in practical applications.

[0105] Next, referring to the accompanying drawings, a multimodal large-scale hallucination relief device according to an embodiment of this application is described.

[0106] Figure 3 This is a block diagram of a multimodal large-scale illusion relief device according to an embodiment of this application.

[0107] like Figure 3 As shown, the hallucination relief device 10 of the multimodal large model includes: an acquisition module 100, a construction module 200, a determination module 300, and a prediction module 400.

[0108] The acquisition module 100 is used to acquire the input graphic and text data;

[0109] Module 200 is used to process the image and text data to obtain text tokens and image tokens, align the text tokens and image tokens to obtain the input sequence of the image and text data, and construct the image token attention vector for each position in the input sequence;

[0110] The determination module 300 is used to measure the difference in image understanding between different decoding layers based on the image token attention vector at each location using Jensen-Shannon divergence, and determine the target layer with the largest distance to the image token attention vector of the last layer based on the measurement result;

[0111] The prediction module 400 is used to use a preset early exit mechanism to transform the hidden state of the target layer into a probability distribution through a linear projection layer in order to predict the probability distribution of the next token.

[0112] Furthermore, in some embodiments, the construction module 200 is specifically used for:

[0113] Text tokens are obtained by processing the text input of image and text data using a pre-defined large-scale language model.

[0114] The image input of the graphic data is transformed into a visual embedding through a preset encoder, and the visual embedding is mapped into an image token through a visual-language alignment connector.

[0115] Furthermore, in some embodiments, the construction module 200 is specifically used for:

[0116] The distribution of attention weights for image tokens is represented by the maximum weight in multi-head attention;

[0117] The softmax function is used to normalize the maximum weight in multi-head attention to obtain the image token attention vector for each position in the input sequence.

[0118] Furthermore, in some embodiments, the image token attention vector for each location is:

[0119]

[0120] in, The maximum weight at each position, I s To I e Let n be the range of positions of the image token in the input sequence, n be the layer of the decoder, and t be each position.

[0121] Furthermore, in some embodiments, before using a preset early exit mechanism to transform the hidden state of the target layer into a probability distribution through a linear projection layer, the prediction module 400 is further configured to:

[0122] Adjust the attention weights of the target layer.

[0123] It should be noted that the foregoing explanation of the embodiment of the hallucination relief method for multimodal large models also applies to the hallucination relief device for multimodal large models in this embodiment, and will not be repeated here.

[0124] The hallucination mitigation device for multimodal large language models proposed in this application processes the input text and image data to obtain text tokens and image tokens, aligns the text tokens and image tokens to obtain the input sequence of text and image data, and constructs an image token attention vector for each position in the input sequence. Based on the image token attention vector, the Jensen-Shannon divergence is used to measure the difference in image understanding between different decoding layers, and the target layer with the largest distance from the image token attention vector of the last layer is determined according to the measurement result. Using a preset early exit mechanism, the hidden state of the target layer is transformed into a probability distribution through a linear projection layer to predict the probability distribution of the next token. This solves the problems of existing techniques for mitigating hallucination phenomena in multimodal large language models being overly complex, costly, and affecting model performance, effectively reducing and managing hallucination phenomena in multimodal large language models, and improving their reliability and accuracy in practical applications.

[0125] Figure 4 A schematic diagram of the structure of an electronic device provided in an embodiment of this application. The electronic device may include:

[0126] The memory 401, the processor 402, and the computer program stored on the memory 401 and capable of running on the processor 402.

[0127] When the processor 402 executes the program, it implements the hallucination relief method for the multimodal large model provided in the above embodiments.

[0128] Furthermore, electronic devices also include:

[0129] Communication interface 403 is used for communication between memory 401 and processor 402.

[0130] The memory 401 is used to store computer programs that can run on the processor 402.

[0131] The memory 401 may include high-speed RAM (Random Access Memory) memory, and may also include non-volatile memory, such as at least one disk storage.

[0132] If the memory 401, processor 402, and communication interface 403 are implemented independently, then the communication interface 403, memory 401, and processor 402 can be interconnected via a bus to complete communication between them. The bus can be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, or an EISA (Extended Industry Standard Architecture) bus, etc. The bus can be divided into address bus, data bus, control bus, etc. For ease of representation, Figure 4 The bus is represented by a single thick line, but this does not mean that there is only one bus or one type of bus.

[0133] Optionally, in a specific implementation, if the memory 401, processor 402, and communication interface 403 are integrated on a single chip, then the memory 401, processor 402, and communication interface 403 can communicate with each other through an internal interface.

[0134] Processor 402 may be a CPU (Central Processing Unit), an ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of this application.

[0135] This application also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described multimodal large-scale hallucination relief method.

[0136] This application also proposes a computer program product, including a computer program that, when executed by a processor, implements the above-described multimodal large-scale hallucination relief method.

[0137] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this application, "multiple" means at least two, such as two, three, etc., unless otherwise explicitly specified.

[0138] In the description of this specification, the references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., refer to specific features, structures, materials, or characteristics described in connection with that embodiment or example, which are included in at least one embodiment or example of this application. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of different embodiments or examples.

[0139] Although embodiments of this application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting this application. Those skilled in the art can make changes, modifications, substitutions and variations to the above embodiments within the scope of this application.

Claims

1. A method for alleviating hallucinations in a multimodal large model, characterized in that, Includes the following steps: Step S101: Obtain the input image and text data; Step S102: Process the image and text data to obtain text tokens and image tokens, align the text tokens and image tokens to obtain the input sequence of the image and text data, and construct the image token attention vector for each position in the input sequence; Step S103: Based on the image token attention vector at each position, the difference in image understanding between different decoding layers is measured by Jensen-Shannon divergence, and the target layer with the largest distance from the image token attention vector of the last layer is determined according to the measurement result; Step S104: Using a preset early exit mechanism, the hidden state of the target layer is transformed into a probability distribution through a linear projection layer in order to predict the probability distribution of the next token. In step S102, the attention weight distribution of the image token is represented by the maximum weight in the multi-head attention, and the maximum weight in the multi-head attention is normalized using the softmax function to obtain the image token attention vector at each position in the input sequence. Step S104 further includes: using the Large Language Model (LLM) through the linear projection layer to... Mapped to In a 3D space, the softmax function is used to transform the projection into a probability distribution for predicting the next token: in, This is the representation of the last decoding layer. For a vocabulary list; In step S104, the preset early exit mechanism refers to allowing the model to output results upon reaching the target layer, without waiting for the last layer; the hidden state of the target layer is transformed into a probability distribution through the linear projection layer. in, For the target layer, This refers to the hidden state of the target layer; The final predicted token probability distribution of the model is as follows: in, This represents the final predicted token probability distribution of the model.

2. The method according to claim 1, characterized in that, The process of processing the graphic data to obtain text tokens and image tokens includes: The text token is obtained by processing the text input of the image and text data using a preset large-scale language model; The image input of the graphic data is converted into a visual embedding through a preset encoder, and the visual embedding is mapped to the image token through a visual-language alignment connector.

3. The method according to claim 1, characterized in that, The image token attention vector for each location is: ； in, The maximum weight for each position. arrive The position range of the image token in the input sequence. For the decoder layer, For each of the aforementioned positions.

4. The method according to claim 1, characterized in that, Before using a preset early exit mechanism to transform the hidden state of the target layer into a probability distribution through a linear projection layer, the process also includes: Adjust the attention weights of the target layer.

5. A multimodal large-scale hallucination relief device, characterized in that, The device is applied to the hallucination relief method for a multimodal large model as described in any one of claims 1-4, the device comprising: The acquisition module is used to acquire the input image and text data; The construction module is used to process the graphic data to obtain text tokens and image tokens, align the text tokens and image tokens to obtain the input sequence of the graphic data, and construct the image token attention vector at each position in the input sequence; The determination module is used to measure the difference in image understanding between different decoding layers based on the image token attention vector at each position using Jensen-Shannon divergence, and determine the target layer with the largest distance from the image token attention vector of the last layer based on the measurement result; The prediction module is used to use a preset early exit mechanism to transform the hidden state of the target layer into a probability distribution through a linear projection layer in order to predict the probability distribution of the next token.

6. The apparatus according to claim 5, characterized in that, The building module is specifically used for: The text token is obtained by processing the text input of the image and text data using a preset large-scale language model; The image input of the graphic data is converted into a visual embedding through a preset encoder, and the visual embedding is mapped to the image token through a visual-language alignment connector.

7. An electronic device, characterized in that, include: A memory, a processor, and a computer program stored in the memory and executable on the processor, the processor executing the program to implement the multimodal large-scale hallucination relief method as described in any one of claims 1-4.

8. A computer-readable storage medium having a computer program stored thereon, characterized in that, The program is executed by the processor to implement the hallucination relief method for a multimodal large model as described in any one of claims 1-4.

9. A computer program product, characterized in that, Includes a computer program, which, when executed by a processor, is used to implement the hallucination relief method for a multimodal large model as described in any one of claims 1-4.