A multi-modal large model hallucination detection method and related device
By acquiring the tokens and their four-dimensional features of a multimodal large language model, and combining them with generation probability, language entropy, and visual attention features, a lightweight nonlinear fusion detector is trained. This solves the fine-grained monitoring problem of MLLM hallucination detection, achieves real-time and stable hallucination detection, and reduces computational overhead and deployment costs.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHANGAN UNIV
- Filing Date
- 2026-04-07
- Publication Date
- 2026-06-12
Smart Images

Figure CN122196976A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of reliable large model technology, specifically a multimodal large model illusion detection method and related equipment. Background Technology
[0002] As multimodal large language models (MLLMs) such as LLaVA, Qwen-VL, and InternVL demonstrate powerful capabilities in tasks like visual question answering, image captioning, and cross-modal reasoning, the issue of factual consistency in their generated content is becoming increasingly prominent. When lacking sufficient visual evidence or faced with blurry images, MLLMs often generate seemingly plausible but inconsistent "illusionary" content, such as fabricating non-existent objects, incorrectly describing colors or quantities, or even fabricating event logic. These illusions not only damage the model's credibility but can also lead to serious consequences in high-risk scenarios such as healthcare, autonomous driving, and security. Therefore, fine-grained, real-time, and interpretable monitoring of MLLM illusions has become a key challenge in the field of trustworthy artificial intelligence.
[0003] Currently, methods for detecting and mitigating MLLM illusions mainly focus on the sentence or entity level, and often rely on external validation tools or post-processing corrections, making it difficult to meet the needs of real-time inference and fine-grained intervention. Typical methods include the following categories: The first category is posterior detection methods based on knowledge verification using external tools. For example, Deng et al. proposed the CGD method, which calls CLIP during the generation process to calculate the similarity between candidate text and images, thereby suppressing hallucination generation online. Although this type of method is intuitive, it has obvious drawbacks: (1) It has high computational overhead, requiring multiple additional models to be run, making it difficult to deploy on online services; (2) It has coarse granularity, only able to determine whether a whole sentence or entity is a hallucination, but unable to locate specific tokens; (3) It relies on external tools and has poor generalization, easily leading to misjudgments when the verification model and the main model have inconsistent visual understanding.
[0004] The second category is heuristic analysis methods based on attention or activation values. Some studies have attempted to analyze the distribution of attention to visual tokens during MLLM decoding; if attention is too scattered or concentrated in irrelevant areas, it is speculated that hallucinations may exist. However, these methods typically use only a single signal and do not effectively integrate with the uncertainty of language generation, resulting in limited discriminative ability. More importantly, existing work mostly remains at the stage of qualitative observation or threshold segmentation, lacking a systematic token-level supervision signal and a learnable fusion mechanism, making it difficult to form a stable and reliable detector.
[0005] The third category comprises unsupervised detection methods based on contrastive learning or self-supervision. For example, Jiang et al. constructed a visual-language contrastive learning mechanism that uses interfering text as negative samples and compares it with real images and text to achieve weakly supervised detection or suppression of hallucinations. However, such methods often require carefully designed data augmentation strategies or fine-tuned models, and lack high-quality negative sample construction standards in MLLM scenarios, resulting in high engineering complexity. Furthermore, these methods typically still perform discrimination on a sentence-by-sentence basis, failing to support token-level hallucination recognition and detection.
[0006] In summary, the existing technology has the following core problems: 1. Lack of token-level fine-grained monitoring capabilities makes it impossible to accurately locate the illusion token, which limits the design of subsequent refined intervention strategies; 2. Over-reliance on external validation tools or complex dataset construction leads to high system latency and deployment difficulties; 3. The feature utilization is singular, failing to effectively combine multi-dimensional internal signals such as language generation uncertainty and visual attention dynamics, thus limiting detection performance. Summary of the Invention
[0007] This invention provides a method and related equipment for detecting multimodal large model hallucinations, which solves the problems of insufficient detection accuracy, high system latency and difficult deployment of existing methods for detecting and mitigating MLLM hallucinations.
[0008] To achieve the above objectives, the present invention provides the following technical solution: A multimodal large-model hallucination detection method includes: Obtain the token and its four-dimensional features output by the multimodal large language model, wherein the token and its four-dimensional features output by the multimodal large language model are obtained based on the text and image samples input to the large language model; The token and its four-dimensional features output by the multimodal large language model are input into the pre-trained hallucination detection model to output a hallucination risk score. The training method for the pre-trained hallucination detection model is as follows: Character-level illusion annotation is performed on text and image samples of the large language model. Teacher-forcing inference is performed on the text and image samples after character-level illusion annotation to obtain the logits sequence and cross-layer attention weights corresponding to each generated token. Four-dimensional token-level illusion discrimination features are extracted based on the logits sequence corresponding to each generated token and the cross-layer attention weights. The four-dimensional token-level illusion discrimination features include generation probability, language entropy, visual attention ratio, and visual attention entropy. Based on character-level phantom annotation, obtain the token-level label for each token; A lightweight nonlinear fusion detector is trained based on token-level labels and four-dimensional token-level hallucination discrimination features to obtain a hallucination detection model.
[0009] Preferably, the method for performing teacher-forcing inference on text and image samples after character-level illusion annotation to obtain the logits sequence and cross-layer attention weights corresponding to each generated token is as follows: Given an image Text prompt Reference Answer and character-level illusion span annotation The sample, of which and Indicates the first Using the illusionary annotations to pinpoint the start and end positions of characters in the original string, construct the complete input sequence: ; Input sequence and image They are input into a multimodal large language model and subjected to teacher-forcing inference. During the inference process, the logits sequence of each token is obtained in real time. ,in For the model in the first Step prediction The unnormalized score of each token; Obtain cross-layer attention weights ,in For the first Layer attention matrix, For the number of attention heads, This represents the total number of tokens.
[0010] Preferably, the extraction of four-dimensional token-level illusion discrimination features based on the logits sequence corresponding to each generated token and the cross-layer attention weights is specifically as follows: Calculate and obtain the generation probability:
[0011] in For multimodal large language models in the first The logits vector output by the step. This is the location index of the token; Language entropy is calculated and obtained according to Shannon's definition of entropy:
[0012] in, Represents the model vocabulary set, Represents any candidate token in the vocabulary; Calculate and obtain the visual attention ratio: ; in For the number of image tokens, The total length of the entire token sequence. For large model decoder layers, For the first In the prediction of the first layer The average attention given to all tokens when there are 1 token; Calculate and obtain the visual attention entropy according to the Shannon entropy definition:
[0013] in This represents the normalized distribution of visual attention.
[0014] Preferably, the steps for obtaining the token-level tag for each token based on character-level illusion annotation are as follows: Establish the character position of each token in the original string. Mapping; If hallucinations exist Make:
[0015] Then mark the token illusion, that is ,otherwise .
[0016] Preferably, the lightweight nonlinear fusion detector is a gradient boosting tree model.
[0017] Preferably, the steps for training a lightweight nonlinear fusion detector based on token-level labels and four-dimensional token-level hallucination discrimination features to obtain a hallucination detection model are as follows: Four-dimensional feature vectors With token tag Constructing the training set Input the LightGBM classifier; Optimize the weighted binary cross-entropy loss and complete the training:
[0018] in Output for LightGBM model It has built-in regular expression terms. The regularization coefficient; After training, the model outputs a hallucination risk score. .
[0019] A multimodal large-scale hallucination detection system, comprising: Feature acquisition module: used to acquire the token and its four-dimensional features output by the multimodal large language model, wherein the token and its four-dimensional features output by the multimodal large language model are obtained based on the text and image samples input to the large language model; The hallucination detection module is used to input the token and its four-dimensional features output by the multimodal large language model into the pre-trained hallucination detection model, and output a hallucination risk score; the training method of the pre-trained hallucination detection model is as follows: Character-level illusion annotation is performed on text and image samples of the large language model. Teacher-forcing inference is performed on the text and image samples after character-level illusion annotation to obtain the logits sequence and cross-layer attention weights corresponding to each generated token. Four-dimensional token-level illusion discrimination features are extracted based on the logits sequence corresponding to each generated token and the cross-layer attention weights. The four-dimensional token-level illusion discrimination features include generation probability, language entropy, visual attention ratio, and visual attention entropy. Based on character-level phantom annotation, obtain the token-level label for each token; A lightweight nonlinear fusion detector is trained based on token-level labels and four-dimensional token-level hallucination discrimination features to obtain a hallucination detection model.
[0020] A computer device includes a memory, a processor, and a computer program stored in the memory, wherein the processor executes the computer program to implement the steps of the above-described multimodal large-scale illusion detection method.
[0021] A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the aforementioned multimodal large-scale illusion detection method.
[0022] A computer program product includes a computer program that, when executed by a processor, implements the steps of the aforementioned multimodal large-model illusion detection method.
[0023] Compared with existing technologies, this invention has the following advantages: This invention provides a multimodal large-scale hallucination detection method. By combining character-level hallucination annotation and teacher-forcing inference with four-dimensional token-level hallucination discriminative features to construct a detector, it achieves token-level fine-grained monitoring of hallucination content. This allows for precise location of hallucination tokens, providing support for subsequent refined intervention strategies and overcoming the limitation of existing methods that can only achieve sentence or entity-level detection. The solution does not rely on external validation tools. It extracts features based on the model's internal logits sequence and cross-layer attention weights, significantly reducing computational overhead and avoiding the poor generalization and high latency issues associated with external tools. This makes it easier to deploy in online service scenarios. Furthermore, by integrating multi-dimensional features such as generation probability and language entropy, and fully combining the uncertainty of language generation with dynamic visual attention signals, it significantly improves the stability and reliability of hallucination detection compared to single-feature heuristic methods, ensuring the reliable application of multimodal large-scale language models in high-risk scenarios.
[0024] Furthermore, a quantitative assessment of token-level illusion risk was achieved. This solution not only overcomes the two core obstacles of annotation alignment and feature validity, but also builds a deployable, interpretable, and scalable token-level illusion detection capability within MLLM.
[0025] Furthermore, this application demonstrates excellent engineering practicality and deployment friendliness. The entire monitoring process relies solely on a single forward inference output from MLLM, eliminating the need to call external large models, fine-tune the main model, or employ complex post-processing or additional annotation tools, significantly reducing system latency and deployment costs. Simultaneously, the extracted four types of features have low dimensionality and low computational overhead, while the fusion model has a small number of parameters and fast inference speed, allowing for seamless embedding into existing MLLM inference service pipelines to achieve real-time token-level risk scoring.
[0026] Furthermore, this application achieves token-level risk quantification while maintaining low overhead by mining signals within the model and designing a lightweight fusion mechanism, effectively addressing this long-standing technical requirement. Attached Figure Description
[0027] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the embodiments will be briefly described below. It should be understood that the following drawings only show some embodiments of the present invention and should not be regarded as a limitation on the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.
[0028] Figure 1 This is a flowchart of a multimodal large-scale hallucination detection method according to an embodiment of the present invention; Figure 2 This is a flowchart illustrating the training process of the hallucination detection model in an embodiment of the present invention. Figure 3 This is a block diagram of a multimodal large-scale hallucination detection system according to an embodiment of the present invention; Figure 4 This is a flowchart illustrating the training process of the hallucination detection model in an embodiment of the present invention. Detailed Implementation
[0029] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. The components of the embodiments of the present invention described and shown in the accompanying drawings can generally be arranged and designed in various different configurations.
[0030] Therefore, the following detailed description of the embodiments of the invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely to illustrate selected embodiments of the invention. All other embodiments obtained by those skilled in the art based on the embodiments of the invention without inventive effort are within the scope of protection of the invention.
[0031] It should be noted that similar labels and letters in the following figures indicate similar items. Therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures.
[0032] The inventors discovered during their research: The token encoding process of MLLM has a natural alignment problem with manually labeled character-level or sentence-level illusion intervals, requiring the design of a robust mapping mechanism. In addition, the error signals of token-level illusions are weak and easily masked by language fluency. It is necessary to select lightweight features related to illusions from massive forward propagation signals without introducing additional computational burden, while taking into account both model performance and inference efficiency.
[0033] Solving this problem has significant practical implications: On the one hand, token-level hallucination detection can provide fine-grained and accurate signals for the construction of trustworthy large models, significantly improving the credibility and security of MLLM. On the other hand, this method relies solely on the model's own forward propagation output, requiring no external tools or fine-tuning of the main model, thus possessing significant engineering value and being widely applicable to various online service scenarios for large vision-language models. Therefore, proposing a lightweight and efficient MLLM hallucination detection method that is based on multi-internal signal fusion, supports token-level real-time monitoring, and is of practical significance.
[0034] To enable those skilled in the art to better understand the technical solution of the present invention, the present invention will be further described in detail below with reference to the accompanying drawings.
[0035] like Figure 1 As shown, this embodiment of the invention provides a multimodal large-model hallucination detection method, including: S1: Obtain the token and its four-dimensional features output by the multimodal large language model, wherein the token and its four-dimensional features output by the multimodal large language model are obtained based on the text and image samples input to the large language model; S2: Input the token and its four-dimensional features output by the multimodal large language model into the pre-trained hallucination detection model and output a hallucination risk score; The training method for the pre-trained hallucination detection model is as follows: Character-level illusion annotation is performed on text and image samples of the large language model. Teacher-forcing inference is performed on the text and image samples after character-level illusion annotation to obtain the logits sequence and cross-layer attention weights corresponding to each generated token. Four-dimensional token-level illusion discrimination features are extracted based on the logits sequence corresponding to each generated token and the cross-layer attention weights. The four-dimensional token-level illusion discrimination features include generation probability, language entropy, visual attention ratio, and visual attention entropy. Based on character-level phantom annotation, obtain the token-level label for each token; A lightweight nonlinear fusion detector is trained based on token-level labels and four-dimensional token-level hallucination discrimination features to obtain a hallucination detection model.
[0036] like Figure 2 The training method for the pre-trained hallucination detection model shown is as follows: Step S101: Using the MLLM inference sample set annotated with character-level illusion spans, perform teacher-forcing inference on each sample to obtain the logits sequence and cross-layer attention weights.
[0037] By selecting a set of inference samples containing character-level illusion annotations, forward computation is performed in teacher-forcing mode to obtain the complete logits sequence and multi-layer attention weight matrix. This process can extract all internal signals in a single forward propagation without additional backpropagation or auxiliary model calls, ensuring data consistency and computational efficiency.
[0038] The method for performing teacher-forcing inference on text and image samples after character-level illusion annotation to obtain the logits sequence and cross-layer attention weights corresponding to each generated token is as follows: Given an image Text prompt Reference Answer and character-level illusion span annotation The sample, of which and Indicates the first Using the illusionary annotations to pinpoint the start and end positions of characters in the original string, construct the complete input sequence: ; Input sequence and image They are input into a multimodal large language model and subjected to teacher-forcing inference. During the inference process, the logits sequence of each token is obtained in real time. ,in For the model in the first Step prediction The unnormalized score of each token; Obtain cross-layer attention weights ,in For the first Layer attention matrix, For the number of attention heads, This represents the total number of tokens.
[0039] Step S102: Extract four-dimensional token-level illusion discrimination features based on the logits sequence and cross-layer attention weights corresponding to each generated token. The four-dimensional token-level illusion discrimination features include generation probability, language entropy, visual attention ratio, and visual attention entropy.
[0040] The generation probability reflects the model's confidence in the current token; the linguistic entropy measures the uncertainty of the vocabulary distribution; the visual attention ratio characterizes the proportion of attention the model pays to the image token when predicting the current token; and the visual attention entropy measures the concentration of visual attention distribution. All four types of features can be calculated in real-time during the model's forward inference process, with low computational cost and no need to modify the model structure.
[0041] The four-dimensional token-level illusion discrimination features are extracted based on the logits sequence corresponding to each generated token and the cross-layer attention weights, specifically as follows: Calculate and obtain the generation probability:
[0042] in For multimodal large language models in the first The logits vector output by the step. This is the location index of the token; Language entropy is calculated and obtained according to Shannon's definition of entropy:
[0043] in, Represents the model vocabulary set, Represents any candidate token in the vocabulary; Calculate and obtain the visual attention ratio: ; in For the number of image tokens, The total length of the entire token sequence. For large model decoder layers, For the first In the prediction of the first layer The average attention given to all tokens when there are 1 token; Calculate and obtain the visual attention entropy according to the Shannon entropy definition:
[0044] in This represents the normalized distribution of visual attention.
[0045] Step S103: Obtain the token-level label for each token based on character-level illusion annotation.
[0046] Because there is an inherent misalignment between the token boundaries and the original character positions in MLLM, directly using character-level spans would lead to inaccurate supervision. Therefore, the complete input sequence is decoded into the original string, and the start and end positions of each token within the string are recorded. When the character range of any token overlaps with the phantom span, the token is marked as a phantom, thus achieving a precise mapping from character-level annotations to token-level labels.
[0047] Based on character-level phantom annotation, the specific steps to obtain the token-level tag for each token are as follows: Establish the character position of each token in the original string. Mapping; If hallucinations exist Make:
[0048] Then mark the token illusion, that is ,otherwise .
[0049] Step S104: Train a lightweight nonlinear fusion detector based on token-level labels and four-dimensional token-level hallucination discrimination features to obtain a hallucination detection model.
[0050] This invention employs the LightGBM gradient boosting tree as the detector and token-level illusion labels as the supervision signal, optimizing the weighted binary cross-entropy loss to form an interpretable, low-overhead fusion model. The LightGBM model has a lightweight structure, high training efficiency, and the ability to handle class imbalance, making it suitable for real-time detection scenarios.
[0051] The specific steps for training a lightweight nonlinear fusion detector and obtaining a hallucination detection model based on token-level labels and four-dimensional token-level hallucination discrimination features are as follows: Four-dimensional feature vectors With token tag Constructing the training set Input the LightGBM classifier; Optimize the weighted binary cross-entropy loss and complete the training:
[0052] in Output for LightGBM model It has built-in regular expression terms. The regularization coefficient; After training, the model can output a hallucination risk score. .
[0053] Step S105: Use the trained detector to determine the hallucination risk of tokens in the MLLM inference process in real time.
[0054] By embedding the trained detector into the MLLM generation process, real-time risk assessment is performed on each generated token, achieving millisecond-level hallucination detection and risk labeling. This can be used for manual review, automatic filtering, or hallucination evaluation benchmark construction, generating fine-grained, highly consistent datasets for validating and evaluating the performance of other hallucination detection or mitigation methods.
[0055] To demonstrate the effectiveness of this invention, hallucination detection is performed on potential hallucination tokens in the MLLM output, and the results are compared with methods using large-scale enterprise models for fine-grained hallucination detection. This invention conducts systematic experiments on the MHALO dataset, using F1M and F1IoU as the core evaluation metrics. F1M is calculated as follows: for each real hallucination span... The maximum overlap ratio between the predicted span and all predicted spans is calculated as the recall rate; for each predicted span... The maximum overlap ratio between the span and all real spans is calculated as the accuracy; finally, the average F1 score of all spans is taken to obtain F1M. Its mathematical expression is: F1IoU, on the other hand, performs optimal matching based on the intersection-union ratio (IoU): it constructs an IoU matrix between the predicted span and the actual span, and only matches the predicted span if the IoU is less than or equal to the actual span. A match is considered valid when the match is found. Then, the Hungarian algorithm is used for maximum matching. Finally, the ratio of the number of matches to the predicted number and the actual number is calculated to obtain the accuracy. With recall rate And calculate the F1 score:
[0056] The experimental results are shown in Table 1. The token-level illusion detection method based on visual attention and generation uncertainty proposed in this invention can effectively identify local factual errors in the MLLM generation process, and significantly outperforms current mainstream multimodal large language models in both F1M and F1IoU metrics. This fully demonstrates that the feature system and lightweight fusion mechanism designed in this invention can effectively achieve accurate detection of token-level illusions without relying on external large models, possessing good technological advancement and application potential.
[0057] Table 1 Comparison of Fine-Grained Hallucination Detection Performance
[0058] This invention also provides a multimodal large-scale hallucination detection system, comprising: Feature acquisition module: used to acquire the token and its four-dimensional features output by the multimodal large language model, wherein the token and its four-dimensional features output by the multimodal large language model are obtained based on the text and image samples input to the large language model; Hallucination detection module: This module takes the token and its four-dimensional features output by the multimodal large language model and inputs them into the pre-trained hallucination detection model to output a hallucination risk score. The training method for the pre-trained hallucination detection model is as follows: Character-level illusion annotation is performed on text and image samples of the large language model. Teacher-forcing inference is performed on the text and image samples after character-level illusion annotation to obtain the logits sequence and cross-layer attention weights corresponding to each generated token. Four-dimensional token-level illusion discrimination features are extracted based on the logits sequence corresponding to each generated token and the cross-layer attention weights. The four-dimensional token-level illusion discrimination features include generation probability, language entropy, visual attention ratio, and visual attention entropy. Based on character-level phantom annotation, obtain the token-level label for each token; A lightweight nonlinear fusion detector is trained based on token-level labels and four-dimensional token-level hallucination discrimination features to obtain a hallucination detection model.
[0059] like Figure 3 Another embodiment of the present invention also provides a multimodal large-scale hallucination detection system, comprising: Feature acquisition module: used to acquire the token and its four-dimensional features output by the multimodal large language model, wherein the token and its four-dimensional features output by the multimodal large language model are obtained based on the text and image samples input to the large language model; Hallucination detection module: This module takes the token and its four-dimensional features output by the multimodal large language model and inputs them into the pre-trained hallucination detection model to output a hallucination risk score. The training method for the pre-trained hallucination detection model is as follows: Character-level illusion annotation is performed on text and image samples of the large language model. Teacher-forcing inference is performed on the text and image samples after character-level illusion annotation to obtain the logits sequence and cross-layer attention weights corresponding to each generated token. Four-dimensional token-level illusion discrimination features are extracted based on the logits sequence corresponding to each generated token and the cross-layer attention weights. The four-dimensional token-level illusion discrimination features include generation probability, language entropy, visual attention ratio, and visual attention entropy. Based on character-level phantom annotation, obtain the token-level label for each token; A lightweight nonlinear fusion detector is trained based on token-level labels and four-dimensional token-level hallucination discrimination features to obtain a hallucination detection model.
[0060] like Figure 4 As shown, another embodiment of the present invention also provides a hallucination detection model training system, including... Feature extraction module, label alignment module, detector training module, and hallucination detection module The feature extraction module is used to perform character-level illusion annotation on text and image samples of the large language model, perform teacher-forcing inference on the character-level illusion-annotated text and image samples, obtain the logits sequence and cross-layer attention weights corresponding to each generated token, and extract four-dimensional token-level illusion discrimination features based on the logits sequence and cross-layer attention weights corresponding to each generated token. The four-dimensional token-level illusion discrimination features include generation probability, language entropy, visual attention ratio, and visual attention entropy.
[0061] The aforementioned tag alignment module is used to obtain token-level tags for each token based on character-level illusion annotations.
[0062] The detector training module is used to train a lightweight nonlinear fusion detector based on token-level labels and four-dimensional token-level hallucination discrimination features to obtain a hallucination detection model.
[0063] The hallucination detection module, based on a trained detector, performs real-time hallucination risk scoring on any token generated during MLLM inference, enabling token-level hallucination monitoring and early warning.
[0064] A computer device is provided according to an embodiment of the present invention. This computer device includes a processor, a memory, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the steps in the various method embodiments described above. Alternatively, when the processor executes the computer program, it implements the functions of each module / unit in the various device embodiments described above.
[0065] The computer program can be divided into one or more modules / units, which are stored in the memory and executed by the processor to complete the present invention.
[0066] The computer device may be a desktop computer, laptop, handheld computer, or cloud server, etc. The computer device may include, but is not limited to, a processor and memory.
[0067] The processor may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
[0068] The memory can be used to store the computer program and / or module, and the processor implements various functions of the computer device by running or executing the computer program and / or module stored in the memory, and by calling the data stored in the memory.
[0069] If the modules / units integrated into the computer device are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. The computer-readable medium can include: any entity or device capable of carrying the computer program code, recording media, USB flash drives, portable hard drives, magnetic disks, optical disks, computer memory, read-only memory, random access memory, electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the content included in the computer-readable medium can be appropriately added or removed according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, computer-readable media do not include electrical carrier signals and telecommunication signals.
[0070] The foregoing has shown and described the basic principles, main features, and advantages of the present invention. It will be apparent to those skilled in the art that the invention is not limited to the details of the exemplary embodiments described above, and that the invention can be implemented in other specific forms without departing from its spirit or essential characteristics. Therefore, the embodiments should be considered illustrative and non-limiting in all respects, and the scope of the invention is defined by the appended claims rather than the foregoing description. Thus, all variations falling within the meaning and scope of equivalents of the claims are intended to be included within the scope of the invention. No reference numerals in the claims should be construed as limiting the scope of the claims.
[0071] Furthermore, it should be understood that although this specification describes embodiments, not every embodiment contains only one independent technical solution. This narrative style is merely for clarity. Those skilled in the art should consider the specification as a whole, and the technical solutions in each embodiment can be appropriately combined to form other embodiments that can be understood by those skilled in the art. The above content is only for illustrating the technical concept of the present invention and should not be construed as limiting the scope of protection of the present invention. Any modifications made based on the technical concept proposed in this invention shall fall within the scope of protection of the claims of this invention.
Claims
1. A multimodal large-scale hallucination detection method, characterized in that, include: Obtain the token and its four-dimensional features output by the multimodal large language model, wherein the token and its four-dimensional features output by the multimodal large language model are obtained based on the text and image samples input to the large language model; The token and its four-dimensional features output by the multimodal large language model are input into the pre-trained hallucination detection model to output a hallucination risk score. The training method for the pre-trained hallucination detection model is as follows: Character-level illusion annotation is performed on text and image samples of the large language model. Teacher-forcing inference is then performed on the text and image samples after character-level illusion annotation to obtain the logits sequence and cross-layer attention weights corresponding to each generated token. Four-dimensional token-level illusion discrimination features are extracted based on the logits sequence corresponding to each generated token and the cross-layer attention weights. The four-dimensional token-level illusion discrimination features include generation probability, language entropy, visual attention ratio, and visual attention entropy. Based on character-level phantom annotation, obtain the token-level label for each token; A lightweight nonlinear fusion detector is trained based on token-level labels and four-dimensional token-level hallucination discrimination features to obtain a hallucination detection model.
2. The multimodal large-scale hallucination detection method according to claim 1, characterized in that, The method for performing teacher-forcing inference on text and image samples after character-level illusion annotation to obtain the logits sequence and cross-layer attention weights corresponding to each generated token is as follows: Given an image Text prompt Reference Answer and character-level illusion span annotation The sample, of which and Indicates the first Using the illusionary annotations to pinpoint the start and end positions of characters in the original string, construct the complete input sequence: ; Input sequence and image They are input into a multimodal large language model and subjected to teacher-forcing inference. During the inference process, the logits sequence of each token is obtained in real time. ,in For the model in the first Step prediction The unnormalized score of each token; Obtain cross-layer attention weights ,in For the first Layer attention matrix, For the number of attention heads, This represents the total number of tokens.
3. The multimodal large-scale hallucination detection method according to claim 1, characterized in that, The four-dimensional token-level illusion discrimination features are extracted based on the logits sequence corresponding to each generated token and the cross-layer attention weights, specifically as follows: Calculate and obtain the generation probability: in For multimodal large language models in the first The logits vector output by the step. This is the location index of the token; Language entropy is calculated and obtained according to Shannon's definition of entropy: in, Represents the model vocabulary set, Represents any candidate token in the vocabulary; Calculate and obtain the visual attention ratio: ; in For the number of image tokens, The total length of the entire token sequence. For large model decoder layers, For the first In the prediction of the layer The average attention given to all tokens when there are 1 token; Calculate and obtain the visual attention entropy according to the Shannon entropy definition: in This represents the normalized distribution of visual attention.
4. The multimodal large-scale hallucination detection method according to claim 1, characterized in that, Based on character-level phantom annotation, the specific steps to obtain the token-level tag for each token are as follows: Establish the character position of each token in the original string. Mapping; If hallucinations exist Make: Then mark the token illusion, that is ,otherwise .
5. The multimodal large-scale hallucination detection method according to claim 1, characterized in that, The lightweight nonlinear fusion detector used is a gradient boosting tree model.
6. The multimodal large-scale hallucination detection method according to claim 5, characterized in that, The specific steps for training a lightweight nonlinear fusion detector and obtaining a hallucination detection model based on token-level labels and four-dimensional token-level hallucination discrimination features are as follows: Four-dimensional feature vectors With token tag Constructing the training set Input the LightGBM classifier; Optimize the weighted binary cross-entropy loss and complete the training: in Output for LightGBM model It has built-in regular expression terms. The regularization coefficient; After training, the model outputs a hallucination risk score. .
7. A multimodal large-scale hallucination detection system, characterized in that, include: Feature acquisition module: used to acquire the token and its four-dimensional features output by the multimodal large language model, wherein the token and its four-dimensional features output by the multimodal large language model are obtained based on the text and image samples input to the large language model; Hallucination detection module: This module takes the token and its four-dimensional features output by the multimodal large language model and inputs them into the pre-trained hallucination detection model to output a hallucination risk score. The training method for the pre-trained hallucination detection model is as follows: Character-level illusion annotation is performed on text and image samples of the large language model. Teacher-forcing inference is then performed on the text and image samples after character-level illusion annotation to obtain the logits sequence and cross-layer attention weights corresponding to each generated token. Four-dimensional token-level illusion discrimination features are extracted based on the logits sequence corresponding to each generated token and the cross-layer attention weights. The four-dimensional token-level illusion discrimination features include generation probability, language entropy, visual attention ratio, and visual attention entropy. Based on character-level phantom annotation, obtain the token-level label for each token; A lightweight nonlinear fusion detector is trained based on token-level labels and four-dimensional token-level hallucination discrimination features to obtain a hallucination detection model.
8. A computer device, comprising a memory, a processor, and a computer program stored in the memory, characterized in that, The processor executes the computer program to implement the steps of the multimodal large-scale illusion detection method according to any one of claims 1-6.
9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, the computer program implements the steps of the multimodal large-scale illusion detection method according to any one of claims 1-6.
10. A computer program product, comprising a computer program, characterized in that, When executed by a processor, the computer program implements the steps of the multimodal large-scale illusion detection method according to any one of claims 1-6.