Hallucination diagnosis method of large language model based on attention intervention

By collecting attention layer state data in a large language model and constructing a hallucination diagnosis model, the problems of lag and poor reliability in hallucination diagnosis of large language models in high reliability scenarios are solved. Real-time and accurate hallucination risk assessment and early warning are achieved, and it is applicable to large language models of various Transformer architectures.

CN122221031APending Publication Date: 2026-06-16BEIJING LINX SOFTWARE CORP

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING LINX SOFTWARE CORP
Filing Date
2026-03-27
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing large language models suffer from problems such as lag in hallucination diagnosis, poor reliability, poor accuracy, poor stability, and insufficient versatility in high-reliability scenarios such as medicine, law, and finance.

Method used

By collecting state data of the attention layer during the generation of lexical units in a large language model, using hook functions for feature engineering, a hallucination diagnosis model is constructed, outputting hallucination risk scores and type prediction results, and generating warning information and diagnostic reports.

Benefits of technology

It realizes real-time hallucination diagnosis in the process of large language model outputting lexical units, improves the real-time performance, reliability and accuracy of hallucination diagnosis, is applicable to large language models of various Transformer architectures, has universality, and does not depend on specific tasks or external knowledge bases.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122221031A_ABST
    Figure CN122221031A_ABST
Patent Text Reader

Abstract

The present application relates to a hallucination diagnosis method based on attention intervention for large language model, and belongs to the technical field of computers. The method comprises the following steps: in the process of generating word units by the large language model, the state data of the attention layer is collected through the hook function of the forward calculation function registered in the attention layer corresponding to different hallucination types, including the attention score vector of the attention head and the output activation vector of the feedforward network; according to the different hallucination types, the state data collected by the hook function is subjected to corresponding feature engineering processing to obtain input vectors of different hallucination types respectively; the input vectors of different hallucination types are respectively input into the hallucination diagnosis model trained in advance to output hallucination risk scores and hallucination type prediction results; the abnormal training samples during the training of the hallucination diagnosis model are hallucination samples obtained by attention intervention on the large language model. The present application improves the real-time performance, reliability, accuracy, stability and universality of hallucination diagnosis.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer technology, and in particular to a method for diagnosing hallucinations based on a large language model with attention intervention. Background Technology

[0002] Large Language Models (LLMs), based on the Transformer architecture, have demonstrated exceptional capabilities in tasks such as text generation, dialogue systems, and code generation. However, their pervasive "illusion" problem—where the generated content is inconsistent with the real world or given information in terms of facts, logic, or context—has become a key obstacle restricting their application in high-reliability scenarios such as healthcare, law, and finance.

[0003] Current diagnostic methods for large language model hallucinations mainly include the following categories: 1. Posterior validation: After the model generates complete text, an external knowledge base (such as a search engine or knowledge graph) or another validation model is called to perform fact-checking. This method suffers from problems such as large response latency and poor generalizability due to reliance on the completeness of external resources.

[0004] 2. Uncertainty Measurement Method: This method assesses the confidence of generated content by analyzing the probability distribution, entropy value, or consistency of multiple samplings of output lexical units. However, large language models often output hallucinatory content with high confidence, leading to a high false negative rate and insufficient reliability and accuracy.

[0005] 3. Prompt Engineering and Constraint Generation Method: This method guides the model to "self-check" by designing input prompts or adding generated constraints. However, its effectiveness is greatly affected by the design of the prompts, and its stability is poor. Summary of the Invention

[0006] Based on the above analysis, the embodiments of the present invention aim to provide a large language model-based hallucination diagnosis method to solve the problems of diagnostic lag, poor reliability, poor accuracy, poor stability and insufficient universality of existing large language model-based hallucination diagnosis methods.

[0007] On one hand, embodiments of the present invention provide a method for diagnosing hallucinations in a large language model based on attention intervention. This method includes: during the generation of lexical units in the large language model to be diagnosed, collecting state data of corresponding attention layers through hook functions registered in one or more preset attention layers corresponding to different hallucination types; wherein the state data includes an attention score vector of a preset attention head and an output activation vector of the feedforward network; according to different hallucination types, performing corresponding feature engineering processing on the state data collected by each hook function to obtain first input vectors corresponding to different hallucination types; inputting the first input vectors corresponding to different hallucination types into a pre-trained hallucination diagnosis model, and outputting the hallucination risk score and hallucination type prediction result of the large language model to be diagnosed; wherein the abnormal training samples during the training of the hallucination diagnosis model are samples exhibiting hallucinations obtained through attention intervention.

[0008] Based on a further improvement of the above method, the feature engineering process includes: calculating attention features pre-selected from weight distribution entropy, attention Gini coefficient, maximum attention value, key entity attention decay ratio, attention distribution kurtosis, attention distribution skewness, proportion of the top 3 attention positions, and attention focus shift variance based on the attention score vector; calculating activation features pre-selected from activation vector L2 norm ratio, activation sparsity, activation kurtosis, activation skewness, principal component 1 score, principal component 2 score, abnormal dimension activation ratio, and cross-head activation correlation after dimensionality reduction processing of the output activation vector; wherein the number of attention features is greater than the number of activation features; and performing vector concatenation processing on the attention features and activation features in sequence.

[0009] Based on a further improvement of the above method, after outputting the hallucination risk score and hallucination type prediction result of the large language model to be diagnosed, the method further includes: deriving a risk trend based on the hallucination risk scores of the current word and multiple words in the preceding and following context; and generating and sending a warning message to a preset terminal when the hallucination risk scores of multiple consecutive words including the current word are all greater than a preset risk score threshold and / or the hallucination risk scores show an increase in hallucination risk.

[0010] Based on a further improvement of the above method, the state data also includes the input representation, output representation, and logical value received by the language model head of the attention layer before the output lexical unit; after outputting the hallucination risk score and hallucination type prediction result of the large language model to be diagnosed, the method further includes: obtaining the predicted hallucination type based on the hallucination type prediction result, determining the hook function corresponding to the hallucination type and the attention layer registered by the hook function; performing anomaly analysis and localization by analyzing at least one of the attention score vector, output activation vector, input representation, output representation, and logical value of the corresponding attention layer collected by the hook function corresponding to the hallucination type, and generating a diagnostic report.

[0011] Based on a further improvement of the above method, the method for obtaining abnormal training samples includes: obtaining a first input sample from a corpus for inputting into the large language model; inputting the first input sample into the large language model, and determining the target word based on the first input sample and the type of hallucination to be induced; wherein, when the intervention triggering time of the hallucination type occurs, the hook function corresponding to the hallucination type intervenes in the attention score vector according to a preset intervention strategy before performing the collection of the attention score vector; wherein, the intervention triggering time includes the instant when the attention score vector for generating the target word is generated; in response to the hallucination occurring in the large language model regarding the generation of the target word, the target word is obtained. During the generation process, the state data collected by the hook functions corresponding to each hallucination type is processed according to the different hallucination types, and the state data collected by each hook function is subjected to corresponding feature engineering processing to obtain second input vectors corresponding to different hallucination types. First input data for training is constructed based on the second input vectors corresponding to each hallucination type. First label data is constructed based on the hallucination situation of the large language model with respect to the generation results of the target word. Abnormal training samples for training the hallucination diagnosis model are constructed based on the first input data and the first label data. The first label data includes first label information of hallucination occurrence and second label information of hallucination type.

[0012] A further improvement to the above method is that the intervention strategy includes adding a random noise vector to the attention score vector.

[0013] Based on a further improvement of the above method, before intervening in the attention score vector according to the preset intervention strategy, the method further includes: controlling the intervention intensity by adjusting the intervention parameters of the intervention strategy.

[0014] Based on a further improvement of the above method, the method further includes: obtaining a second input sample from a corpus for inputting into the large language model; inputting the second input sample into the large language model; during the generation of lexical units by the large language model, in response to the absence of hallucinations in the generation of the lexical units by the large language model, obtaining the state data collected by the hook function corresponding to each hallucination type during the generation of the lexical units, and performing corresponding feature engineering processing on the state data collected by each hook function according to the different hallucination types to obtain third input vectors corresponding to different hallucination types; constructing second input data for training based on the third input vectors corresponding to each hallucination type, constructing second label data based on the hallucination situation of the generation results of the lexical units by the large language model, and constructing normal training samples for training the hallucination diagnosis model based on the second input data and the second label data; wherein, the second label data includes third marker information indicating that hallucinations did not occur.

[0015] Based on a further improvement of the above method, after the abnormal training samples and the normal training samples are constructed, the method further includes: training a pre-selected lightweight model based on the normal training samples and the abnormal training samples to obtain the hallucination diagnosis model; wherein, the hallucination diagnosis model includes a regression branch and a classification branch, and during training, for the first input data, an output label is set for the regression branch based on the first labeling information, and an output label is set for the classification branch based on the second labeling information; for the second input data, an output label is set for the regression branch and the classification branch based on the third labeling information.

[0016] Based on a further improvement of the above method, the loss function during training of the hallucination diagnosis model is the weighted sum of the regression loss of the regression branch and the classification loss of the classification branch; wherein, the regression loss is the mean square error between the predicted risk score and the true label, and the classification loss is the cross-entropy between the predicted hallucination type probability and the true hallucination type label.

[0017] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the large language model hallucination diagnosis method based on attention intervention as described above.

[0018] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the large language model hallucination diagnosis method based on attention intervention as described above.

[0019] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the large language model hallucination diagnosis method based on attention intervention as described above.

[0020] The present invention provides a hallucination diagnosis method for large language models based on attention intervention. This method pre-trains a hallucination diagnosis model by intervening in the parameters of the large language model during its operation to generate abnormal training samples. During the actual generation of lexical units by the large language model, attention score vectors and output activation vectors are collected through registered hook functions. The first input vector, obtained through feature engineering, is then input into the hallucination diagnosis model, outputting a hallucination risk score and a hallucination type prediction result. This achieves online hallucination diagnosis during the output of lexical units by the large language model, improving the real-time performance, reliability, accuracy, and stability of the hallucination diagnosis. Furthermore, it is applicable to the deployment of large language models with various Transformer architectures, does not depend on specific tasks or external knowledge bases, and possesses versatility.

[0021] In this invention, the above-described technical solutions can be combined with each other to achieve more preferred combinations. Other features and advantages of this invention will be set forth in the following description, and some advantages may become apparent from the description or be learned by practicing the invention. The objects and other advantages of this invention can be realized and obtained from what is particularly pointed out in the description and drawings. Attached Figure Description

[0022] The accompanying drawings are for illustrative purposes only and are not intended to limit the invention. Throughout the drawings, the same reference numerals denote the same parts. Figure 1 This is a flowchart illustrating the large language model-based hallucination diagnosis method based on attention intervention provided by the present invention.

[0023] Figure 2 A schematic diagram of the physical structure of an electronic device is provided. Detailed Implementation

[0024] Preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form part of this application and are used together with the embodiments of the present invention to illustrate the principles of the present invention, but are not intended to limit the scope of the present invention.

[0025] Figure 1 This is a flowchart illustrating the large language model-based hallucination diagnosis method based on attention intervention provided by this invention. Figure 1 As shown, the method includes: Step S1: During the generation of lexical units by the large language model, the state data of the corresponding attention layer is collected by hook functions registered in one or more preset attention layers corresponding to different illusion types; wherein, the state data includes the attention score vector of the preset attention head and the output activation vector of the feedforward network.

[0026] In the process of generating lexical units in a large language model, such as when the large language model generates lexical units based on the input text, the present invention determines the hallucination risk by collecting internal state data. This allows for the judgment of hallucination risk on a lexical unit basis, based on each lexical unit to be output, thereby enabling real-time determination of hallucination risk before the large language model outputs the complete result.

[0027] The hallucination diagnosis method provided by this invention can monitor various types of hallucinations. Hallucination types can include factual contradictions, attribute mismatches, logical conflicts, irrelevant generation, and temporal errors. For each type of hallucination, hook functions are pre-registered in the forward computation functions of one or more attention layers. These hook functions are used to collect state data from the corresponding registered attention layers during the generation of lexical units in a large language model.

[0028] Depending on the type of hallucination and the specific large language model, the attention layer registered by the hook function can be different. The attention score vector and the output activation vector of the feedforward network are important parameters of the large language model. The state data collected by the hook function includes the attention score vector of the pre-defined attention head and the output activation vector of the feedforward network. For different types of hallucinations, the attention heads corresponding to the collected attention score vectors can also be different.

[0029] For example, in the "attribute mismatch" illusion type, the intermediate attention layer of the decoder is usually responsible for semantic integration and relational reasoning, which is a key step affecting attribute generation. Therefore, hook functions can be registered in the intermediate attention layer of the decoder. If the large language model's decoder has 24 attention layers, the hook function can be registered at the 12th layer, and the attention score vector of the 5th attention head can be collected. For the "temporal error" illusion type, the hook function can be registered in the attention layer responsible for temporal correlation processing. Understandably, depending on the selected large language model, the attention layer for hook function registration for each illusion type can be determined in advance through theoretical analysis and experiments.

[0030] Step S2: Based on the different types of hallucinations, perform corresponding feature engineering processing on the state data collected by each hook function to obtain the first input vector corresponding to different types of hallucinations.

[0031] The attention score vector of the preset attention head and the output activation vector of the feedforward network, collected by the hook function, are used for subsequent analysis. For example, relevant attention features can be obtained from the attention score vector, and relevant activation features can be obtained from the output activation vector. Similarly, due to different hallucination types, the attention features and activation features to be analyzed can be different. Therefore, the feature engineering processes for different hallucination types can be different. Based on different hallucination types, relevant attention features and activation features are obtained through corresponding feature engineering processes, and a first input vector corresponding to each hallucination type is obtained based on the attention features and activation features.

[0032] Step S3: Input the first input vector corresponding to different hallucination types into the pre-trained hallucination diagnosis model, and output the hallucination risk score and hallucination type prediction result; wherein, the abnormal training samples during the training of the hallucination diagnosis model are samples of hallucinations obtained through attention intervention.

[0033] The first input vector corresponding to different hallucination types is fed into a pre-trained hallucination diagnosis model. The hallucination diagnosis model outputs a hallucination risk score and a hallucination type prediction result. The hallucination risk score can be a value between 0 and 1, representing the degree of risk of the hallucination. The hallucination type prediction result can include the predicted probability of different hallucination types occurring.

[0034] The hallucination diagnosis model is pre-trained using normal and abnormal training samples. Normal training samples are those from individuals who did not experience hallucinations, while abnormal training samples are those from individuals who did. Since the actual number of hallucination cases is very small, and building a hallucination diagnosis model requires a large amount of training data, abnormal training samples from individuals experiencing hallucinations need to be pre-built before training the model.

[0035] The hallucinations produced by large language models can be attributed to limitations in their attention mechanisms. When attention is insufficiently allocated or inaccurately focused, the output can easily deviate from reality, leading to hallucinations. This invention intervenes in the attention of large language models, guiding them to produce hallucinations in a programmed and controllable manner. This allows for the exploration of their underlying mechanisms, the accumulation of "pathological data," and the construction of a hallucination diagnostic model.

[0036] The present invention provides a hallucination diagnosis method for large language models based on attention intervention. This method pre-trains a hallucination diagnosis model by intervening in the parameters of the large language model during its operation to generate abnormal training samples. During the actual generation of lexical units by the large language model, attention score vectors and output activation vectors are collected through registered hook functions. The first input vector, obtained through feature engineering, is then input into the hallucination diagnosis model, outputting a hallucination risk score and a hallucination type prediction result. This achieves online hallucination diagnosis during the output of lexical units by the large language model, improving the real-time performance, reliability, accuracy, and stability of the hallucination diagnosis. Furthermore, it is applicable to the deployment of large language models with various Transformer architectures, does not depend on specific tasks or external knowledge bases, and possesses versatility.

[0037] According to the present invention, a method for diagnosing hallucinations in a large language model based on attention intervention includes feature engineering processing, which comprises: calculating attention features pre-selected from weight distribution entropy, attention Gini coefficient, maximum attention value, key entity attention decay ratio, attention distribution kurtosis, attention distribution skewness, proportion of the top 3 attention positions, and attention focus shift variance based on the attention score vector; calculating activation features pre-selected from activation vector L2 norm ratio, activation sparsity, activation kurtosis, activation skewness, principal component 1 score, principal component 2 score, abnormal dimension activation ratio, and cross-head activation correlation after dimensionality reduction processing of the output activation vector; wherein the proportion of attention features is greater than the proportion of activation features; and performing vector concatenation processing on the attention features and activation features in sequence.

[0038] Different hallucination types can use different feature engineering methods. For example, the calculated attention features and activation features can be different. Attention features and activation features that are highly correlated with the corresponding hallucination type can be selected from a pre-defined range and pre-set to facilitate the execution of subsequent feature engineering.

[0039] Attention features can be pre-selected from weight distribution entropy, attention Gini coefficient, maximum attention value, key entity attention decay ratio, attention distribution kurtosis, attention distribution skewness, percentage of the top 3 attention positions, and attention focus shift variance. Specifically, weight distribution entropy measures the disorder of attention distribution; a higher value indicates more dispersed attention. The attention Gini coefficient measures the concentration of attention; a lower value (closer to 0) indicates a more uniform distribution. The maximum attention value is the maximum value of the weight vector, reflecting the strongest attention intensity. The key entity attention decay ratio represents the ratio of the attention weight at the target entity position to the historical normal baseline. Attention distribution kurtosis reflects the sharpness of the attention distribution pattern. Attention distribution skewness reflects the symmetry of the attention distribution; a positive value indicates that the weight is concentrated in a few positions. The percentage of the top 3 attention positions is the sum of the top three attention weights, reflecting the concentration of attention. Attention focus shift variance is the Jensen-Shannon divergence between the current step's attention distribution and the previous step's distribution, measuring focus jumps.

[0040] Activation features can be pre-selected from the following: activation vector L2 norm ratio, activation sparsity, activation kurtosis, activation skewness, principal component 1 score, principal component 2 score, outlier activation ratio, and cross-head activation correlation. Specifically, the activation vector L2 norm ratio represents the ratio of the current activation vector norm to the canonical norm of the layer output; activation sparsity is the proportion of activation values ​​close to zero, reflecting the sparsity of the representation; activation kurtosis measures the tail weight of the activation value distribution, with high values ​​indicating the presence of extreme activations; activation skewness measures the asymmetry of the activation value distribution; principal component 1 score represents the projection of the activation vector onto the first principal component, capturing the direction of maximum variation; principal component 2 score represents the projection of the activation vector onto the second principal component; outlier activation ratio represents the ratio of the average activation on the known "hallucinogenic" dimension subset to the global average activation; and cross-head activation correlation represents the average Pearson correlation between the output activations of different attention heads in the current layer.

[0041] For different types of hallucinations, after obtaining attention features and activation features, the attention features and activation features can be concatenated into vectors according to a pre-set order to obtain the result of feature engineering.

[0042] The hallucination diagnosis method based on attention intervention provided by this invention improves the accuracy and relevance of the input features of the large language model by reasonably setting and calculating attention features and activation features for feature engineering processing, thereby further improving the accuracy and reliability of hallucination diagnosis.

[0043] According to the present invention, a method for diagnosing hallucinations based on attention intervention in a large language model, after outputting the hallucination risk score and hallucination type prediction result of the large language model to be diagnosed, the method further includes: deriving a risk trend based on the hallucination risk scores of the current word and multiple words in the preceding and following context; and generating and sending a warning message to a preset terminal when the hallucination risk scores of multiple consecutive words including the current word are all greater than a preset risk score threshold and / or the hallucination risk scores indicate an increase in hallucination risk.

[0044] When diagnosing hallucinations, it's important to consider not only whether the categorical branch predicts the corresponding hallucination type, but also the hallucination risk score output by the regression branch. A risk score threshold can be set for the hallucination risk score, such as T=0.75, to determine the level of risk.

[0045] For example, the output of the hallucination diagnosis model is: Hallucination risk score = 0.05 (far below the threshold of 0.75) Type probability = [0.01, 0.02, 0.01, 0.95, 0.01]. Among them, the highest probability is for the "irrelevant generation" class, but the overall probability is low (meaning the risk score is low).

[0046] The risk trend is derived based on the illusion risk scores of the current word and multiple words in the preceding and following context. For example, if the illusion risk score continuously increases based on the generation order of the words, it indicates an increase in illusion risk. If the illusion risk scores of multiple consecutive words, including the current word, are all greater than a preset risk score threshold and / or the illusion risk score indicates an increase in illusion risk, a warning message is generated and sent to a preset terminal. If the current word's risk score > T, and N=2 consecutive words have high risks, a high-level warning is triggered. The preset terminal can be a user terminal / upstream system. A warning signal can be sent to the user terminal / upstream system, indicating that the output content may contain factual errors. The warning message could be something like, "The current generation result has an illusion risk; please verify."

[0047] The hallucination diagnosis method based on a large language model with attention intervention provided by this invention determines the severity of the risk according to the hallucination risk score and risk trend, and triggers corresponding early warning processing, which facilitates the timely detection of hallucination risks.

[0048] According to the present invention, a method for diagnosing hallucinations in a large language model based on attention intervention is provided. The state data further includes the input representation, output representation, and logical value received by the language model head of the attention layer before the output lexical unit. After outputting the hallucination risk score and hallucination type prediction result of the large language model to be diagnosed, the method further includes: obtaining the predicted hallucination type based on the hallucination type prediction result, determining the hook function corresponding to the hallucination type and the attention layer registered by the hook function; performing anomaly analysis and localization by analyzing at least one of the attention score vector, output activation vector, input representation, output representation, and logical value of the corresponding attention layer collected by the hook function corresponding to the hallucination type, and generating a diagnostic report.

[0049] The state data collected by the hook function can also include the input representation and output representation of the attention layer, as well as the logits received by the language model head of the attention layer before the output lexical.

[0050] After outputting the hallucination risk score and hallucination type prediction results, the predicted hallucination type is obtained based on the hallucination type prediction results. The hook function corresponding to the hallucination type and the attention layer registered by the hook function are determined. Anomaly analysis and localization are performed by analyzing at least one of the attention score vector, output activation vector, input representation, output representation, and logical value collected by the hook function corresponding to the hallucination type. A diagnostic report is then generated, such as an automatically generated interpretable diagnostic report containing information such as anomaly layer localization, attention pattern analysis, and hallucination type inference.

[0051] For example, if the input is "Einstein's profession is ____", and the hallucination diagnosis model determines the risk of hallucination when outputting the fill-in-the-blank word, and predicts the hallucination type as "attribute mismatch", then it obtains the attention layer of the "attribute mismatch" hallucination type registration hook function, such as layer 12. If analysis shows that the average weight distribution entropy of this attention layer is abnormally high (0.65), it indicates that the attention focus has shifted from the "Einstein" entity to multiple irrelevant contextual words. This allows us to pinpoint the main abnormal signal causing the hallucination as originating from the attention mechanism of layer 12. Furthermore, analysis of the output activation vectors reveals a significant deviation in the feedforward network activation vectors on principal components 3 and 7, a pattern highly similar to the "occupational attribute mismatch" pathological samples in the training database.

[0052] The hallucination diagnosis method based on attention intervention provided by this invention achieves interpretability of hallucination diagnosis by performing anomaly analysis and localization on the relevant data collected by the hook function registered with the hallucination type when the hallucination diagnosis model determines that a hallucination has occurred.

[0053] According to the present invention, a method for diagnosing hallucinations in a large language model based on attention intervention is provided. The method further includes: obtaining a first input sample from a corpus for inputting into the large language model; inputting the first input sample into the large language model, and determining a target word based on the first input sample and the type of hallucination to be induced; wherein, when the intervention triggering time of the hallucination type occurs, the hook function corresponding to the hallucination type intervenes in the attention score vector according to a preset intervention strategy before collecting the attention score vector; wherein, the intervention triggering time includes the instant when the attention score vector for generating the target word is generated; in response to the hallucination occurring in the large language model regarding the generation of the target word, the method obtains the target word... During the generation of lexical units, the state data collected by the hook functions corresponding to each hallucination type is processed according to the different hallucination types, and the state data collected by each hook function is subjected to corresponding feature engineering processing to obtain second input vectors corresponding to different hallucination types. First input data for training is constructed based on the second input vectors corresponding to each hallucination type. First label data is constructed based on the hallucination situation of the large language model with respect to the generation results of the target lexical unit. Abnormal training samples for training the hallucination diagnosis model are constructed based on the first input data and the first label data. The first label data includes first marker information of hallucination occurrence and second marker information of hallucination type.

[0054] This embodiment describes the process of constructing abnormal training samples. The aim is to induce a specific type of hallucination in the model through proactive and controllable attention intervention, while simultaneously collecting its complete internal computational state to construct labeled abnormal training samples. The abnormal training samples include first input data and first label data used to train the hallucination diagnosis model. It is understood that the large language model used to generate the training samples for the hallucination diagnosis model is the same as the large language model to be monitored subsequently. The pre-embedding of hook functions is consistent when generating training samples using the large language model and during subsequent online applications. However, it should be noted that during online diagnosis using the large language model, the hook functions are read-only hooks, meaning they only collect data and do not perform attention intervention.

[0055] The first input sample is obtained from a corpus to power the large language model. This can be achieved by extracting a large number of text fragments from diverse corpora (such as Wikipedia, news articles, and book summaries) to form a basic context pool. Examples include: "Mozart's works are ____", and "Tokyo is the capital of ____". During training, the first input sample can be drawn from this basic context pool.

[0056] The first input sample is fed into the large language model, and the target word is determined based on the first input sample and the type of illusion to be induced. Depending on the type of illusion to be induced, the corresponding hook function intervenes in the attention score vector according to a preset intervention strategy when the intervention trigger time for that type of illusion occurs, before collecting the attention score vector. The intervention trigger time includes the instant when the attention score vector used to generate the target word is generated.

[0057] For different types of illusions, a method for determining the target lexical unit based on the first input sample can be pre-set. For example, for the "attribute mismatch" illusion type, the target lexical unit can be determined as an entity attribute word. If the first input sample is "Einstein's profession is ____", then the word to fill in the blank can be determined as the target lexical unit. Therefore, for the "attribute mismatch" illusion type, attention intervention will be triggered the instant the attention score vector used to generate the predicted result of Einstein's profession is generated. For the "temporal mismatch" illusion type, the target lexical unit can be determined as a lexical unit representing temporal relationships, and attention intervention will be triggered the instant the attention score vector used to generate the determined lexical unit representing temporal relationships is generated.

[0058] For each type of hallucination, the intervention strategy for affecting attention can be different. Furthermore, multiple intervention strategies can be pre-set for each hallucination type, and a scheduler can be used to select from these pre-set strategies. For example, intervention strategies for affecting attention could include random perturbation, head occlusion, etc. Each intervention strategy can be configured with relevant intervention parameters, such as those used to control the intensity of the intervention.

[0059] The attention score vector is collected after intervention with a pre-defined intervention strategy. Therefore, the attention score vector collected at this time is the one after attention intervention. After attention interference, the original attention pattern that focuses on the relevant context becomes blurred or chaotic, thereby increasing the possibility that the model will select irrelevant or incorrect words.

[0060] The large language model continues to predict and generate target words based on the intervened attention score vector. If it is determined that the large language model exhibits hallucinations in the generation of target words, anomalous training samples can be constructed. Specifically, during the generation of target words, state data collected by hook functions corresponding to various hallucination types is obtained. Based on the different hallucination types, the state data collected by each hook function undergoes corresponding feature engineering processing to obtain second input vectors corresponding to different hallucination types. The union of these second input vectors can be used as the first input data for training. First label data is constructed based on the hallucination status of the generated target words. Anomalous training samples are then constructed based on the first input data and the first label data. The hallucination status of the target words includes whether a hallucination occurs and the type of hallucination when it occurs. Since a hallucination is generated at this time, the first label data includes first marker information of the hallucination occurrence and second marker information of the hallucination type.

[0061] The present invention provides a large language model hallucination diagnosis method based on attention intervention, which realizes the hallucination of the large language model through attention intervention and realizes the construction of abnormal training samples for training the hallucination diagnosis model.

[0062] According to the present invention, a large language model-based hallucination diagnosis method based on attention intervention is provided, wherein the intervention strategy includes adding a random noise vector to the attention score vector.

[0063] Intervention strategies could include adding a random noise vector to the attention score vector, which is a form of random perturbation. Because the mechanism of hallucinations generated by large language models is complex, it is not easy to accurately determine the specific internal state at the time of hallucination. Therefore, this invention can employ non-targeted intervention strategies to improve the adaptability and reliability of hallucination diagnosis using large language models.

[0064] Specifically, after the target attention head (corresponding to the acquisition of attention score vectors) calculates the attention score vector (denoted as the original score vector s_original) of the query vector of the target word and the key vector of all preceding words, but before the execution of the Softmax function, the attention score vector is intervened. Since the attention score vector is transformed into an attention weight vector by the Softmax function, the intervention in the attention score vector achieves the intervention in the attention weight vector.

[0065] When intervening in the attention score vector by adding a random noise vector, a random noise vector `noise` of the same dimension as `s_original` is generated, with its elements sampled from a Gaussian distribution with a mean of 0 and a standard deviation of σ. The noise is then superimposed on the original score: `s_perturbed = s_original + noise`. The perturbed score vector `s_perturbed` replaces the original `s_original` and is input into the Softmax function to calculate the final attention weights.

[0066] Taking the first input sample as "Einstein's profession was ____" as an example, after attention intervention, keys highly related to "Einstein" and "physics" should have received higher scores in the original scores. The addition of random noise disrupts this structure, potentially causing some keys that originally had low scores and corresponded to unrelated concepts (such as "cooking" and "food") to receive relatively higher scores. After Softmax, the attention of the large language model to the relevant context is diluted, and the attention distribution becomes smoother and more unpredictable.

[0067] The large language model performs subsequent generation calculations based on an attention distribution that has been randomly perturbed. Because it can no longer clearly focus on semantic features strongly related to "Einstein," the probability distribution of correct professions (such as "physicist" or "scientist") decreases significantly, while the probability of various unrelated professions (such as "chef," "painter," or "driver") increases relatively. Ultimately, the model may output incorrect statements such as "Einstein's profession was a chef."

[0068] The hallucination diagnosis method based on attention intervention provided by this invention improves the applicability and reliability of the hallucination diagnosis model by adding a random noise vector to the attention score vector for attention intervention.

[0069] According to the present invention, a large language model-based hallucination diagnosis method based on attention intervention is provided, wherein before intervening in the attention score vector according to a preset intervention strategy, the method further includes: controlling the intervention intensity by adjusting the intervention parameters of the intervention strategy.

[0070] Intervention parameters of an intervention strategy can include parameters used to control the intensity of the intervention. Before intervening in the attention score vector according to a pre-defined intervention strategy, the intensity of the intervention can be controlled by adjusting the intervention parameters of the intervention strategy. For example, for an intervention strategy that adds a random noise vector to the attention score vector, the severity of the interference can be adjusted by adjusting the standard deviation σ of the random noise vector.

[0071] The large language model hallucination diagnosis method based on attention intervention provided by this invention improves the flexibility of attention intervention by controlling the intervention intensity through adjusting the intervention parameters of the intervention strategy.

[0072] According to the present invention, a method for diagnosing hallucinations in a large language model based on attention intervention is provided. The method further includes: obtaining a second input sample from a corpus for inputting into the large language model; inputting the second input sample into the large language model; during the generation of lexical units by the large language model, in response to the absence of hallucinations in the generation of the lexical units by the large language model, obtaining the state data collected by the hook function corresponding to each hallucination type during the generation of the lexical units, and performing corresponding feature engineering processing on the state data collected by each hook function according to the different hallucination types to obtain third input vectors corresponding to different hallucination types; constructing second input data for training based on the third input vectors corresponding to each hallucination type, constructing second label data based on the hallucination situation of the generation results of the lexical units by the large language model, and constructing normal training samples for training the hallucination diagnosis model based on the second input data and the second label data; wherein, the second label data includes third marker information indicating that hallucinations have not occurred.

[0073] This embodiment describes the process of constructing normal training samples. Normal training samples are those obtained when the large language model does not exhibit hallucinations. Since hallucinations are undesirable, the construction process of normal training samples does not require attention intervention on the large language model. It only requires processing the input data normally using the large language model and constructing normal training samples after confirming that no hallucinations have occurred in the output lexical units.

[0074] Specifically, a second input sample is obtained from the corpus and input into the large language model. During the generation of lexical units by the large language model, if it is determined that no hallucination occurs during lexical unit generation, the state data collected by the hook functions corresponding to each hallucination type is obtained. Based on the different hallucination types, the state data collected by each hook function undergoes corresponding feature engineering processing to obtain third input vectors corresponding to different hallucination types. The union of the third input vectors corresponding to different hallucination types can be used as the second input data for training. Second label data is constructed based on the hallucination situation during lexical unit generation, and normal training samples are constructed based on the second input data and the second label data. The hallucination situation of the target lexical unit includes whether a hallucination occurs and the type of hallucination when it occurs. Since no hallucination is generated at this time, the second label data includes third label information indicating that the hallucination did not occur.

[0075] The present invention provides a large language model hallucination diagnosis method based on attention intervention, which realizes the construction of normal training samples for training the hallucination diagnosis model.

[0076] According to the present invention, a large language model-based hallucination diagnosis method based on attention intervention includes, after constructing the abnormal training samples and the normal training samples, the method further includes: training a pre-selected lightweight model based on the normal training samples and the abnormal training samples to obtain the hallucination diagnosis model; wherein, the hallucination diagnosis model includes a regression branch and a classification branch, and during training, for the first input data, an output label is set for the regression branch based on the first labeling information, and an output label is set for the classification branch based on the second labeling information; for the second input data, output labels are set for the regression branch and the classification branch based on the third labeling information.

[0077] The hallucination diagnostic model is a lightweight model whose goal is to learn the mapping from internal state features to hallucinations. Employing multi-task learning allows for the simultaneous prediction of risk scores (severity) and hallucination types, improving the model's generalization ability.

[0078] After constructing abnormal training samples and normal training samples, a pre-selected lightweight model is trained using both samples to obtain a hallucination diagnosis model. The pre-selected lightweight model can be a multilayer perceptron (MLP).

[0079] When training the hallucination diagnosis model, a lightweight multilayer perceptron is pre-constructed. For example, an MLP with two hidden layers is constructed, with the structure as follows: Input layer (60) → Hidden layer 1 (128, ReLU) → Hidden layer 2 (64, ReLU) → Output layer. Input layer (60) indicates that the input features have 60 dimensions; Hidden layer 1 (128) indicates that the layer has 128 neurons, and Hidden layer 2 (64) indicates that the layer has 64 neurons.

[0080] Output layer: The output layer has two branches: (1) Regression branch: Output an illusion risk score between [0,1].

[0081] (2) Classification branch: Output an n-dimensional vector, corresponding to the probability of n predefined hallucination types.

[0082] During training, for the first input data where hallucinations occur, the output label of the regression branch is set to 1 based on the first labeling information, and the output label of the classification branch is set based on the second labeling information. Specifically, for the output vector of the classification branch, the element corresponding to the hallucination type when generating the target word based on the first input data is set to 1, and the elements corresponding to other hallucination types are set to 0. For the second input data where no hallucinations occur, the output labels of the regression and classification branches are set based on the third labeling information. For example, the output label of the regression branch is set to 0, and in the output vector of the classification branch, the elements corresponding to each hallucination type are all set to 0.

[0083] The hallucination diagnosis method based on attention intervention provided by this invention, after constructing abnormal training samples and normal training samples, trains a pre-selected lightweight model based on the normal training samples and abnormal training samples. During training, output labels are set for the regression branch used to predict the probability of hallucination risk and the classification branch used to predict the type of hallucination, respectively, to obtain the hallucination diagnosis model, thereby improving the generalization ability of the model.

[0084] According to the present invention, a large language model-based hallucination diagnosis method based on attention intervention is provided, wherein the loss function during the training of the hallucination diagnosis model is a weighted sum of the regression loss of the regression branch and the classification loss of the classification branch; wherein the regression loss is the mean squared error between the predicted risk score and the true label, and the classification loss is the cross-entropy between the predicted hallucination type probability and the true hallucination type label.

[0085] The loss function during training of the hallucination diagnosis model is a weighted sum of the regression loss of the regression branch and the classification loss of the classification branch. Where: Regression loss: the mean squared error (MSE) between the predicted hazard score and the true label (1 for illusion, 0 for normal).

[0086] Classification loss: Cross-entropy (CE) between the predicted hallucination type probability and the actual hallucination type label.

[0087] Total loss: L_total = α * MSE + β * CE. Where α and β are the weights of the regression loss and classification loss, respectively. α can be 0.7, and β can be 0.3.

[0088] The hallucination diagnosis method based on attention intervention provided by this invention improves the accuracy of hallucination diagnosis by setting the loss function during the training of the hallucination diagnosis model as a weighted sum of the regression loss of the regression branch and the classification loss of the classification branch.

[0089] To ensure the effectiveness and reliability of the hallucination diagnosis model, this invention establishes a complete verification system and conducts tests on multiple benchmarks.

[0090] 1. Validate dataset construction: (1) Internal pathology test set: 10% of the data that was not used for training was retained from the self-built "pathology database" and used to test the detection ability of the diagnostic probe on the illusion of "known etiology".

[0091] (2) External public hallucination benchmarks: Public hallucination evaluation datasets such as TruthfulQA and HaluEval are used to evaluate the generalization performance of the system in naturally occurring hallucination scenarios.

[0092] (3) Artificially constructed adversarial examples: Experts were invited to construct a batch of complex, mixed-type illusion statements to test the robustness of the system.

[0093] 2. Evaluation indicators: (1) Detection performance: accuracy, precision, recall, and F1 score.

[0094] (2) Regression performance: Root mean square error (RMSE) between the predicted risk score and the manually labeled severity level, and Pearson correlation coefficient.

[0095] (3) Efficiency metrics: additional inference latency (ms / token) and memory usage increment of diagnostic probes.

[0096] (4) Explainability: Through user surveys, assess how helpful the diagnostic report is to the model developers / domain experts in understanding the causes of hallucinations.

[0097] 3. Specific verification examples and results: Experimental setup: The hallucination diagnosis model provided in this invention was deployed on an LLaMA-7B model. The hallucination diagnosis model was an MLP with a sequence of 60 → 128 → 64 → [1+5]. [1+5] indicates that the output layer consists of one neuron (for the regression task, outputting a risk score) and five neurons (for the classification task, outputting the probabilities of five hallucination types).

[0098] (1) Results on the internal pathology test set: Hallucination detection accuracy: 94.2%.

[0099] The average F1 score for the hallucination type classification macro is 0.91.

[0100] The correlation coefficient between the predicted risk score and the actual intervention intensity was 0.89.

[0101] (2) Results on the TruthfulQA dataset: The system performs real-time diagnostics as the model generates each answer. Using human annotation as the gold standard, this system achieves an F1 score of 0.79 in word-level illusion detection, significantly higher than methods based on output probability entropy (F1=0.52).

[0102] Case Study: For the question "What was Einstein's main contribution?", the model generated the correct answer "Relativity". The hallucination diagnosis model consistently output a low-risk score (<0.1). For the leading question "Is Einstein famous for inventing the telephone?", the model incorrectly started generating "Yes, he...". When generating "Yes", the probe risk score rose to 0.68; when generating "Telephone", the risk score spiked to 0.92 and triggered an alert, accurately capturing the factual hallucination.

[0103] 4. Efficiency overhead: Diagnostic probe inference latency: 0.8 ms / token on average.

[0104] Overall text generation speed decreased by approximately 5%. This demonstrates that the system meets real-time requirements and that overhead is manageable.

[0105] 5. Interpretability assessment: Ten NLP researchers were invited to rate 100 system-generated diagnostic reports. 85% of the reports were considered to "clearly identify the internal layers and components of the anomalous entity," and 90% of the reports were considered to be "helpful or very helpful in understanding the cause of the particular hallucination."

[0106] Advantages of the present invention compared with the prior art 1. Compared to output posterior verification method (1) Real-time breakthrough: This invention enables in-process or even pre-process diagnosis without waiting for the generation to be completed, significantly reducing response delay.

[0107] (2) Not dependent on external knowledge base: The analysis is based entirely on the internal state of the model, avoiding dependence on the completeness of external resources.

[0108] (3) High interpretability: It provides specific attributions such as internal attention abnormalities and activation deviations, rather than simple right and wrong judgments.

[0109] 2. Compared to the generative uncertainty measurement method (1) Overcoming the underreporting of high-confidence hallucinations: Identifying hallucinations that are semantically abnormal even if the probability distribution is concentrated through internal state analysis.

[0110] (2) Multi-dimensional feature fusion: integrate multi-level information such as attention and activation values ​​to improve the robustness and accuracy of diagnosis.

[0111] (3) Typed output: It can not only determine whether it is a hallucination, but also distinguish the potential categories of hallucinations, supporting targeted processing.

[0112] 3. Compared to hint engineering and constraint generation methods (1) Fundamental diagnosis: start from the internal mechanism of the model, rather than relying on the superficial guidance of external prompts.

[0113] (2) Stable and reliable: The diagnostic results are based on internal state features, are less affected by the style of the input text, and have high stability.

[0114] (3) Strong adaptability: The hallucination diagnosis model can be retrained as the large language model is updated or the data distribution changes, and it is highly adaptable.

[0115] 4. Compared to methods based on external models or knowledge graphs (1) Low latency and low cost: The hallucination diagnosis model is lightweight and has fast reasoning speed, without the need to call additional large models or knowledge services.

[0116] (2) Privacy and data security: All diagnostic processes are completed within the hallucination diagnostic model and do not involve external data transmission.

[0117] (3) Universal architecture: Applicable to various Transformer architecture models, without the need to refactor the external verification process for different models.

[0118] Figure 2 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 2 As shown, the electronic device may include a processor 210, a communications interface 220, a memory 230, and a communication bus 240, wherein the processor 210, the communications interface 220, and the memory 230 communicate with each other via the communication bus 240. The processor 210 can call logical instructions in the memory 230 to execute the large language model hallucination diagnosis method based on attention intervention provided in the above embodiments.

[0119] Furthermore, the logical instructions in the aforementioned memory 230 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0120] On the other hand, the present invention also provides a computer program product, the computer program product including a computer program that can be stored on a non-transitory computer-readable storage medium, and when the computer program is executed by a processor, the computer is able to execute the large language model hallucination diagnosis method based on attention intervention provided in the above embodiments.

[0121] In another aspect, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, is implemented to perform the large language model hallucination diagnosis method based on attention intervention provided in the above embodiments.

[0122] Those skilled in the art will understand that all or part of the processes of the methods described in the above embodiments can be implemented by a computer program instructing related hardware, and the program can be stored in a computer-readable storage medium. The computer-readable storage medium may be a disk, optical disk, read-only memory, or random access memory, etc.

[0123] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0124] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0125] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

[0126] The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in the present invention should be included within the scope of protection of the present invention.

Claims

1. A large language model-based hallucination diagnosis method based on attention intervention, characterized in that, include: During the generation of lexical units in the large language model to be diagnosed, the state data of the corresponding attention layer is collected by hook functions registered in one or more preset attention layers corresponding to different types of hallucinations; wherein, the state data includes the attention score vector of the preset attention head and the output activation vector of the feedforward network; Depending on the type of hallucination, the state data collected by each hook function is subjected to corresponding feature engineering processing to obtain the first input vector corresponding to different hallucination types. The first input vector corresponding to different hallucination types is input into the pre-trained hallucination diagnosis model, and the hallucination risk score and hallucination type prediction result of the large language model to be diagnosed are output; wherein, the abnormal training samples during the training of the hallucination diagnosis model are samples of hallucinations obtained through attention intervention.

2. The large language model-based hallucination diagnosis method based on attention intervention according to claim 1, characterized in that, The feature engineering process includes: Based on the attention score vector, the attention features pre-selected from the weight distribution entropy, attention Gini coefficient, maximum attention value, key entity attention decay ratio, attention distribution kurtosis, attention distribution skewness, proportion of the top 3 attention positions, and attention focus shift variance are calculated. After performing dimensionality reduction through principal component analysis on the output activation vector, activation features are calculated from the activation vector L2 norm ratio, activation sparsity, activation kurtosis, activation skewness, principal component 1 score, principal component 2 score, outlier activation ratio, and cross-head activation correlation; wherein the proportion of attention features is greater than the proportion of activation features. The attention features and activation features are sequentially concatenated into vectors.

3. The large language model-based hallucination diagnosis method based on attention intervention according to claim 1, characterized in that, After outputting the hallucination risk score and hallucination type prediction results of the large language model to be diagnosed, the method further includes: The risk trend is derived based on the illusion risk scores of the current word and multiple words in the preceding and following context; If the hallucination risk scores of multiple consecutive words including the current word are all greater than a preset risk score threshold and / or the hallucination risk scores indicate an increase in hallucination risk, a warning message is generated and sent to a preset terminal.

4. The large language model-based hallucination diagnosis method based on attention intervention according to claim 1, characterized in that, The state data also includes the input representation and output representation of the attention layer, as well as the logical value received by the language model head of the attention layer before the output lexical; After outputting the hallucination risk score and hallucination type prediction results of the large language model to be diagnosed, the method further includes: Based on the hallucination type prediction result, the predicted hallucination type is obtained, and the hook function corresponding to the hallucination type and the attention layer registered by the hook function are determined. Anomaly analysis and localization are performed by analyzing at least one of the attention score vector, output activation vector, input representation, output representation, and logical value of the corresponding attention layer collected by the hook function corresponding to the hallucination type, and a diagnostic report is generated.

5. The large language model hallucination diagnosis method based on attention intervention according to claim 1, characterized in that, The method for obtaining the abnormal training samples includes: Obtain the first input sample from the corpus to input into the large language model; The first input sample is input into the large language model, and the target word is determined based on the first input sample and the type of hallucination to be induced; wherein, when the intervention triggering time of the hallucination type occurs, the hook function corresponding to the hallucination type intervenes in the attention score vector according to the preset intervention strategy before collecting the attention score vector; wherein, the intervention triggering time includes the instant when the attention score vector for generating the target word is generated; In response to the large language model's generation of the target word exhibiting hallucination, the system acquires the state data collected by the hook functions corresponding to each hallucination type during the generation process of the target word. Based on the different hallucination types, the state data collected by each hook function undergoes corresponding feature engineering processing to obtain second input vectors corresponding to different hallucination types. First input data for training is constructed based on the second input vectors corresponding to each hallucination type. First label data is constructed based on the hallucination situation of the large language model's generation result of the target word. Abnormal training samples for training the hallucination diagnosis model are constructed based on the first input data and the first label data. The first label data includes first marker information for the occurrence of hallucination and second marker information for the hallucination type.

6. The large language model hallucination diagnosis method based on attention intervention according to claim 5, characterized in that, The intervention strategy includes adding a random noise vector to the attention score vector.

7. The large language model hallucination diagnosis method based on attention intervention according to claim 5, characterized in that, Before intervening in the attention score vector according to the preset intervention strategy, the method further includes: controlling the intervention intensity by adjusting the intervention parameters of the intervention strategy.

8. The large language model-based hallucination diagnosis method based on attention intervention according to claim 5, characterized in that, The method further includes: Obtain a second input sample from the corpus to input into the large language model; The second input sample is input into the large language model; During the generation of lexical units by the large language model, in response to the absence of hallucinations in the generation of the lexical units by the large language model, the state data collected by the hook functions corresponding to each hallucination type during the generation process of the lexical units is acquired. Based on the different hallucination types, the state data collected by each hook function undergoes corresponding feature engineering processing to obtain third input vectors corresponding to different hallucination types. Second input data for training is constructed based on the third input vectors corresponding to each hallucination type. Second label data is constructed based on the hallucination situation of the generation results of the lexical units by the large language model. Normal training samples for training the hallucination diagnosis model are constructed based on the second input data and the second label data. The second label data includes third marker information indicating that hallucinations did not occur.

9. The large language model hallucination diagnosis method based on attention intervention according to claim 8, characterized in that, After the abnormal training samples and the normal training samples are constructed, the method further includes: The hallucination diagnosis model is obtained by training a pre-selected lightweight model using the normal training samples and the abnormal training samples. The hallucination diagnosis model includes a regression branch and a classification branch. During training, for the first input data, the output label of the regression branch is set according to the first labeling information, and the output label of the classification branch is set according to the second labeling information. For the second input data, the output labels of the regression branch and the classification branch are set according to the third labeling information.

10. The large language model hallucination diagnosis method based on attention intervention according to claim 9, characterized in that, The loss function during training of the hallucination diagnosis model is the weighted sum of the regression loss of the regression branch and the classification loss of the classification branch; wherein, the regression loss is the mean squared error between the predicted risk score and the true label, and the classification loss is the cross-entropy between the predicted hallucination type probability and the true hallucination type label.