Large language model dialogue quality evaluation method and device

By using asymmetric sentiment transfer matrix and high-dimensional semantic space difference calculation, combined with semantic intent recognition and reverse feature library penalty, the problems of uninterpretable evaluation dimensions, insufficient scene perception, and low accuracy of implicit redundancy recognition in large language model dialogue quality assessment are solved, thus achieving accurate and comprehensive assessment of dialogue quality.

CN122241260APending Publication Date: 2026-06-19BEIJING ZHIPU XINGYAO TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING ZHIPU XINGYAO TECHNOLOGY CO LTD
Filing Date
2026-03-23
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing methods for evaluating the dialogue quality of large language models suffer from uninterpretable evaluation dimensions, lack of scene awareness, low recognition accuracy due to implicit semantic redundancy, and lack of negative constraint mechanisms. These methods cannot effectively quantify the interaction quality, anthropomorphism, and safety boundaries of generative AI.

Method used

An asymmetric sentiment transfer matrix is ​​used to calculate sentiment alignment, and the information cognitive density is calculated by combining the difference between high-dimensional semantic space and literal symbol space. Combined with semantic intent recognition and a reverse feature library penalty mechanism, a quantitative assessment of dialogue quality is achieved.

Benefits of technology

It achieves accuracy and comprehensiveness in dialogue quality assessment, can mathematically simulate human emotional logic, identify implicit redundancy and anthropomorphic issues, and provide quantitative safety boundary assessment.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122241260A_ABST
    Figure CN122241260A_ABST
Patent Text Reader

Abstract

This invention discloses a method and device for evaluating the dialogue quality of a large language model, relating to the field of artificial intelligence technology. The method includes: acquiring a query text and a response text corresponding to the query text obtained based on a large language model; calculating a sentiment alignment score, which focuses on interaction quality evaluation, based on the query text, the response text, and a preset asymmetric sentiment transfer matrix, and / or calculating an information cognitive density score, which focuses on text anomaly detection, based on the similarity of text fragments in a high-dimensional semantic space and the difference in literal symbol space of the response text; and performing a large language model dialogue quality evaluation based on the sentiment alignment score and / or the information cognitive density score.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence technology, and in particular to a method and device for evaluating the dialogue quality of a large language model. Background Technology

[0002] As human-computer interaction evolves from simple task-oriented dialogue to emotionally supportive dialogue, the industry urgently needs a standardized system to quantify the "interaction quality," "human-likeness," and "safety boundaries" of generative AI. This system needs to be able to operate automatically and have a high degree of correlation with human evaluations to replace expensive manual assessments.

[0003] Existing methods for evaluating the dialogue quality of Large Language Models (LLMs) commonly include the following: The first is subjective scoring based on human input, employing a team of annotators to score the dialogue history from 1 to 5. The drawbacks are high cost, long feedback cycles (days / weeks), and low consistency among different personnel, hindering rapid iteration in model training. The second is end-to-end scoring based on large models, such as inputting the dialogue into a super-large model like GPT-4 and directly obtaining scores via prompts. This identifies serious "positional bias" and "self-bias," but is a black-box evaluation, lacking fine-grained interpretability and making it difficult to accurately quantify specific negative features (such as "non-requestive suggestions" or "physiological hallucinations"). The third is evaluation based on traditional statistical indicators, using metrics such as perplexity level (PPL), bilingual evaluation substitute (BLEU), and recall-oriented summary evaluation (ROUGE) to calculate the overlap between the generated and reference texts. The drawback is that it only measures literal statistical patterns, failing to understand semantic logic and having almost no correlation with human-perceived "interaction quality." Summary of the Invention

[0004] To address the technical problems existing in the current large language model dialogue quality assessment, this invention provides a more accurate and comprehensive large language model dialogue quality assessment scheme.

[0005] In a first aspect, the present invention provides a method for evaluating the quality of dialogue in a large language model, comprising: Obtain the query text and the corresponding response text obtained based on the large language model; The emotional alignment score, which focuses on the evaluation of interaction quality, is calculated based on the query text, the response text, and the preset asymmetric sentiment transfer matrix, and / or the information cognitive density score, which focuses on the detection of text anomalies, is calculated based on the similarity of text fragments in the high-dimensional semantic space and the difference in the literal symbol space of the response text. The dialogue quality of the large language model is evaluated based on the emotional alignment score and / or the information cognitive density score.

[0006] Preferably, the calculation of sentiment alignment, which focuses on interaction quality evaluation, based on the query text, the response text, and a preset asymmetric sentiment transfer matrix includes: Extract the first sentiment feature and the second sentiment feature from the query text and the response text, respectively; The gain value of transferring the first emotional feature to the second emotional feature is calculated based on the asymmetric emotional transfer matrix to determine the emotional alignment.

[0007] Preferably, when the reply text contains feature phrases from a preset empathy corpus, the gain coefficient corresponding to the feature phrase is obtained from a preset empathy feature weight mapping table, and the sentiment alignment is re-determined based on the gain coefficient.

[0008] Preferably, the calculation of the information cognitive density score, which focuses on text anomaly detection, based on the similarity of the text fragments in the high-dimensional semantic space and the difference in the literal symbol space of the reply text, includes: The reply text is segmented into text fragments; Calculate the semantic vector cosine similarity and character-level edit distance between any two text segments; Redundancy detection is performed based on the semantic vector cosine similarity, the character-level edit distance, and a preset judgment logic to determine the implicit redundancy coefficient of the response text; The entity density base score is obtained by statistically analyzing the density of named entities and non-general predicates in the response text. The information cognition density score is then calculated based on the entity density base score and the implicit redundancy coefficient.

[0009] Preferably, the method further includes: The standard deviation of the sentence length in the response text and the density of colloquial features in the response text were calculated. The style dispersion, which focuses on the evaluation of interactive style, is calculated based on the standard deviation and the colloquial feature density.

[0010] Preferably, the assessment of the dialogue quality of the large language model based on the emotion alignment degree and / or the information cognitive density score includes: The style dispersion, the emotion alignment and / or the information cognitive density score are weighted and fused to obtain the score result corresponding to the dialogue quality assessment of the large language model.

[0011] Preferably, the method further includes: The query text is used to identify the intent based on a semantic intent recognition model to determine the scenario type; Specifically, when the intent recognition result belongs to a first preset set, the scene type is marked as a task-type scene, and when the intent recognition result belongs to a second preset set, the scene type is marked as an interactive scene.

[0012] Preferably, the method further includes: obtaining a corresponding weight value from a dynamic weight library according to the scene type; and calculating the weighted fusion using the obtained weight value.

[0013] Preferably, the method further includes: The triple constraint of dependency parsing is checked against the response text, and the penalty coefficient is determined based on the check result. The final score corresponding to the dialogue quality assessment of the large language model is calculated based on the penalty coefficient and the scoring result.

[0014] In a second aspect, the present invention provides a computer device including a processor and a memory, the memory being adapted to store a plurality of program codes, the program codes being adapted to be loaded and executed by the processor to perform the large language model dialogue quality assessment method described in any of the above-described technical solutions.

[0015] The above-mentioned technical solutions of this application have at least one or more of the following beneficial effects: This application innovatively proposes an interaction quality calculation method based on an asymmetric sentiment transfer matrix. It utilizes an asymmetric matrix to calculate the matching degree between user emotions and response emotions. This method can mathematically simulate the complex emotional logic in human dialogue (e.g., "sadness" -> "encouragement" is a positive gain, while "sadness" -> "happiness" is a negative gain), breaking through the bottleneck of traditional sentiment analysis, which can only identify polarity but cannot judge "interaction appropriateness." This application achieves innovation in dialogue quality evaluation indicators. Based on implicit redundancy detection using a dual verification of "vector similarity + edit distance," and combining the differences between semantic space and literal space, it defines a detection standard for "implicit redundancy." This method specifically targets the illusionary feature of "repeatedly saying nonsense" unique to large models, filling the gap in traditional deduplication algorithms' inability to handle semantic repetition. This application innovatively proposes a hard constraint penalty mechanism based on a reverse feature library, which treats "physiological hallucination" and "excessive compliance" as penalty items independent of quality scoring. This is a mechanism that transforms the safety boundary from "qualitative evaluation" to "quantitative penalty", which can effectively suppress the problem of excessive anthropomorphism in generative models. Attached Figure Description

[0016] The preferred embodiments of the present invention are described below with reference to the accompanying drawings, in which: Figure 1 This is a flowchart of the main steps of a large language model dialogue quality assessment method provided in the embodiments of this application; Figure 2This is a schematic diagram of the main components of a large language model dialogue quality assessment system provided in this application embodiment; Figure 3 The embodiments provided in this application are based on Figure 2 The diagram shows the implementation process of the system performing large-scale dialogue quality assessment. Detailed Implementation

[0017] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0018] Terminology Explanation: Intent routing refers to the mechanism by which the system automatically selects the subsequent processing path and scoring weight based on the semantic intent of the input text (such as casual conversation or task execution).

[0019] Emotional alignment: refers to the degree to which the response content matches and resonates with the user's input in terms of emotional dimension.

[0020] Information cognitive density: refers to the content of effective information entities and non-redundant logic within a unit text length.

[0021] Implicit redundancy: refers to textual phenomena where the literal descriptions are different, but express the exact same meaning in a high-dimensional semantic space (i.e., "high-level nonsense").

[0022] Non-biological entity constraints: refers to the logical restrictions imposed on AI as a non-biological entity regarding its fictional physiological sensations (such as pain and eating).

[0023] This invention aims to address the following four core technical problems in the automated evaluation of dialogue quality using existing Large Language Models (LLMs): (1) The uninterpretability and “black box” problem of evaluation dimensions: Existing technologies (such as LLM-as-a-Judge) usually output the total score directly, lacking intermediate calculation processes, and cannot distinguish whether the low score is caused by “logical error”, “emotional indifference” or “rigid language style”, and cannot guide the targeted optimization of the model.

[0024] (2) Evaluation bias caused by lack of context awareness: The existing evaluation system applies a uniform standard to all responses. However, objective tasks (such as code and logic) require low redundancy and high accuracy, while social tasks require high emotionality and rich variation. The general standard often leads to high-IQ objective responses being misjudged due to "lack of emotion", or high-EQ responses being misjudged due to "low information density".

[0025] (3) Low accuracy in identifying implicit semantic redundancy: Traditional NLP metrics (such as BLEU / ROUGE) are based on literal overlap and cannot identify the "implicit redundancy" phenomenon that is unique to large models, where "different words are used but the semantics are completely repeated".

[0026] (4) Lack of negative constraint mechanisms based on social norms: General assessment lacks detection and quantitative punishment methods for inappropriate anthropomorphic characteristics such as “machine illusion” (e.g., fabricated physiological feelings) and “excessive compliance”.

[0027] See Figure 1 The method for evaluating the dialogue quality of a large language model provided in this application mainly includes the following steps S11 to S13: Step S11: Obtain the query text and the corresponding response text obtained based on the large language model; In this embodiment, the large language model can refer to any question-and-answer or conversational large language model capable of human-computer interaction, such as GPT, DeepSeek, and Qwen, which are commonly used large language models in existing technologies. It should be understood that the query text is the query text input by the user into the large language model, and the response text is the response statement output by the large language model based on the user input, addressing the query text.

[0028] Step S12: Calculate the sentiment alignment score, which focuses on the evaluation of interaction quality, based on the query text, the reply text, and the preset asymmetric sentiment transfer matrix, and / or calculate the information cognitive density score, which focuses on the detection of text anomalies, based on the similarity of the text fragments in the high-dimensional semantic space and the difference in the literal symbol space of the reply text. In this embodiment, the calculation of sentiment alignment, which focuses on interaction quality evaluation, based on the query text, the response text, and a preset asymmetric sentiment transfer matrix includes: Extract the first sentiment feature and the second sentiment feature from the query text and the response text, respectively; The gain value of transferring the first emotional feature to the second emotional feature is calculated based on the asymmetric emotional transfer matrix to determine the emotional alignment.

[0029] In this embodiment, semantic vector cosine similarity can be used to represent the similarity in the high-dimensional semantic space, and character-level edit distance can be used to represent the difference in the literal symbol space. The calculation of the information cognitive density score, which focuses on text anomaly detection, based on the similarity of the text fragments in the high-dimensional semantic space and the difference in the literal symbol space of the reply text includes: The reply text is segmented into text fragments; Calculate the semantic vector cosine similarity and character-level edit distance between any two text segments; Redundancy detection is performed based on the semantic vector cosine similarity, the character-level edit distance, and a preset judgment logic to determine the implicit redundancy coefficient of the response text; The entity density base score is obtained by statistically analyzing the density of named entities and non-general predicates in the response text. The information cognition density score is then calculated based on the entity density base score and the implicit redundancy coefficient.

[0030] Furthermore, the segmentation of the reply text into text fragments can employ a dual segmentation strategy of semantic boundary recognition and length threshold constraint. For example, the first-level segmentation (strong delimiter) method can be set as follows: sentence segmentation is performed based on a preset punctuation set Setstrong={“.”,“!”,“?”,“;”\n}; the second-level segmentation (length constraint) method can be set as follows: if the length of a single sentence after the first-level segmentation exceeds a preset threshold... (for example =45 (the specific value can be flexibly configured according to the actual context), then the clauses are subdivided based on the weak separator Setweak={“,”,“:”}. Example: Text R is: “It sounds like you’ve been under a lot of stress lately. I suggest you drink a glass of warm milk before bed, because warm milk can effectively help you sleep, and drinking milk is a good habit.” The segmentation results are: s1: “It sounds like you’ve been under a lot of stress lately”, s2: “I suggest you drink a glass of warm milk before bed”, s3: “Because warm milk can effectively help you sleep”, s4: “Drinking milk is a good habit”.

[0031] Step S13: Evaluate the dialogue quality of the large language model based on the emotional alignment and / or the information cognitive density score.

[0032] In this embodiment, this step can be understood as using at least one of sentiment alignment score and information cognitive density score as an evaluation index for large language model dialogue quality assessment. When using both sentiment alignment score and information cognitive density score for large language model dialogue quality assessment, a weighted algorithm can be used to perform weighted fusion calculation on these two evaluation indexes to obtain a score as the scoring result for the large language model dialogue quality assessment.

[0033] In some embodiments, the following may also be included before step S13: The standard deviation of the sentence length in the response text and the density of colloquial features in the response text are calculated. The style dispersion, which focuses on the evaluation of interaction style, is calculated based on the standard deviation and the density of colloquial features.

[0034] Accordingly, step S13 above can be specifically described as: performing a weighted fusion of the style dispersion, the sentiment alignment and / or the information cognitive density score to obtain a score result corresponding to the dialogue quality assessment of the large language model.

[0035] In some embodiments, the following may be included before step S13: The standard deviation of the sentence length in the response text and the density of colloquial features in the response text are calculated. The style dispersion, which focuses on the evaluation of interaction style, is calculated based on the standard deviation and the density of colloquial features.

[0036] Perform triple constraint verification based on syntactic analysis on the response text, and determine the penalty coefficient based on the verification result; Accordingly, step S13 above can be specifically as follows: weighted fusion of the style dispersion, the sentiment alignment and / or the information cognitive density score to obtain a score result corresponding to the large language model dialogue quality assessment; and calculating the final score corresponding to the large language model dialogue quality assessment based on the penalty coefficient and the score result.

[0037] like Figure 2 As shown in the figure, this application provides a large language model dialogue quality assessment system, which mainly includes a data acquisition module 100, a data processing module 200, and a result output module 300, wherein: The data acquisition module 100 is used to acquire dialogue content. Specifically, the dialogue content includes query text and response text, where the response text refers to the reply text generated by the large language model after the query text is input into the large language model.

[0038] The data processing module 200 is used to execute the large language model dialogue quality assessment method provided in this application embodiment based on the question and answer content obtained by the data acquisition module 100.

[0039] The result output module is used to determine and output the final score based on the phase index data obtained after the data processing module 200 executes the large language model dialogue quality assessment method.

[0040] Further, the data processing module 200 may include an intent routing unit 201, an emotion alignment calculation unit 202, a cognitive density calculation unit 203, a style dispersion calculation unit 204, and a penalty coefficient calculation unit 205; wherein: The intent routing unit 201 is used to perform intent recognition on the query text based on a semantic intent recognition model to determine the scene type; wherein, when the intent recognition result belongs to a first preset set, the scene type is marked as a task-oriented scene, and when the intent recognition result belongs to a second preset set, the scene type is marked as an interactive scene. For example, the semantic intent recognition model can use a lightweight encoder based on GLM, or it can use a pre-trained language model (such as BERT, GPT, etc.) as the basic architecture.

[0041] The sentiment alignment calculation unit 202 is used to extract a first sentiment feature and a second sentiment feature from the query text and the reply text, respectively, and calculate the gain value of transferring the first sentiment feature to the second sentiment feature based on a preset asymmetric sentiment transfer matrix to determine the sentiment alignment.

[0042] The cognitive density calculation unit 203 is used to segment the response text into text fragments, calculate the semantic vector cosine similarity and character-level edit distance between any two text fragments, perform redundancy detection based on the semantic vector cosine similarity, the character-level edit distance and preset judgment logic to determine the implicit redundancy coefficient of the response text, obtain the entity density base score by statistically analyzing the density of named entities and non-general predicates in the response text, and then calculate the information cognitive density score based on the entity density base score and the implicit redundancy coefficient.

[0043] The style dispersion calculation unit 204 is used to calculate the standard deviation of the length of each sentence in the response text and the density of colloquial features in the response text; and to calculate the style dispersion focusing on the evaluation of interaction style based on the standard deviation and the density of colloquial features.

[0044] The penalty coefficient calculation unit 205 is used to perform triple constraint verification based on syntactic analysis on the response text, and determine the penalty coefficient based on the verification result.

[0045] In this embodiment, the various computing units included in the data processing module can be decoupled from each other, and the index data calculated by one or more computing units can be selected for scoring calculation as needed.

[0046] In one specific implementation, the result output module can be specifically used to perform weighted fusion of the style dispersion calculated by the style dispersion calculation unit, the emotion alignment calculated by the emotion alignment calculation unit, and the information cognitive density score calculated by the cognitive density calculation unit to obtain a score result corresponding to the large language model dialogue quality assessment, and multiply the score result by the penalty coefficient calculated by the penalty coefficient calculation unit to obtain the final score corresponding to the large language model dialogue quality assessment, and output the final score.

[0047] In one specific embodiment, this application provides a method based on... Figure 2 The diagram shown illustrates the specific implementation process of the large language model dialogue quality assessment method implemented in the large language model dialogue quality assessment system. Figure 3 As shown, the query text input by the user and the response text output by the model are fed into the large language model dialogue quality evaluation system as input data. The system performs the following steps: Step 301: Perform semantic intent routing based on the query text. If the intent is identified as a task-oriented scenario, perform an accuracy check; if the intent is identified as an interactive scenario, proceed to step 302. In this embodiment, the intent probability distribution can be obtained by inputting the query text into a lightweight encoder based on GLM. If the intent probability distribution determines that the intent belongs to a first preset set (such as logical reasoning, code generation, and fact retrieval), the scenario type is marked as a task-type scenario and routed to the execution accuracy verification. If the intent belongs to a second preset set (such as casual conversation, emotional counseling, and role-playing), the scenario type is marked as an interactive scenario, and multi-dimensional feature calculation is activated, i.e., the following step 302 is executed.

[0048] It should be noted that this application mainly focuses on data stream processing in interactive scenarios. The accuracy verification can be understood by referring to the existing model output accuracy verification methods applied in task-oriented scenarios, and will not be described in detail here.

[0049] Step 302: Calculate three positive feature evaluation indicators: sentiment alignment, information cognitive density score, and style dispersion; Among them, emotional alignment The specific calculation method is as follows: A multi-label sentiment classifier is used to extract sentiment feature vectors Vq and Vr (dimension 1×N, where N is the number of sentiment categories) from the query text (Q) and response text (R), respectively. The system has a pre-set N×N asymmetric sentiment transfer matrix M, where M can be a sentiment transfer probability distribution obtained statistically from a large-scale, high-quality human dialogue corpus (or based on expert knowledge). It defines the gain value for transferring from user sentiment state i to response sentiment state j. The sentiment alignment is calculated according to the following formula: Furthermore, when R contains feature phrases from a pre-defined empathy corpus, the gain coefficient corresponding to the feature phrase is obtained from a pre-defined empathy feature weight mapping table, and the sentiment alignment is re-determined based on the gain coefficient. The calculation formula is as follows: ,in The gain coefficient is introduced to positively reward responses that identify empathic expressions.

[0050] Among them, the information cognitive density score The specific calculation method is as follows: 1) Divide the response text R into text segments, for example, the text segments are represented in the form of clause sequences as follows: s1, s2, s3...; 2) Calculate the semantic vector cosine similarity between any two clauses si and sj. Distance from character-level editing ;in It can be calculated using the Levenshtein Distance algorithm; 3) Assume the preset decision logic is: currently only when >T1 and >T2, it is judged as "implicit semantic repetition", which means that implicit redundancy has been detected. For example, the value is 0.85 in T1 and 5 in T2. ​​At this time, the implicit redundancy frequency is increased by one; 4) After redundancy detection and counting of implicit redundancy frequency in all text segments of the reply text R, the implicit redundancy coefficient is determined based on the implicit redundancy frequency. For example, the initial value of the implicit redundancy coefficient δ is 1.0. The implicit redundancy frequency is deducted in a step-by-step manner: the calculation formula is: δ=1.0−(n×0.2), where n represents the implicit redundancy frequency. The lower limit of δ is set to 0.6. The implicit redundancy coefficient is determined based on the detected implicit redundancy frequency as follows: if no redundancy is detected, δ=1; if 1 redundancy is detected, δ=0.8; if ≥2 redundancy is detected, δ=0.6. 5) The entity density base score is obtained by counting the density of named entities and non-general predicates in the reply text. The calculation formula is: ,That This represents the total number of words in R. This represents the number of named entity words in R. This represents the number of non-general predicates in the statistical R. It should be understood that the non-general predicates are relative to general predicates, which refer to words such as "is" and "have". 6) Calculate the information cognitive density score based on the entity density base score and the implicit redundancy coefficient. The formula is: The information cognitive density score calculated based on this formula can reflect the process by which effective information is "diluted" by semantic repetition.

[0051] Among them, style dispersion The specific calculation method is as follows: 1) Calculate the standard deviation of the sentence length in the response text R. This standard deviation can be used to characterize the degree of humanization of language rhythm. The larger the standard deviation, the more the alternation of long and short sentences conforms to human language habits. For example, the standard deviation of the sentence length in R can be counted by identifying strong punctuation marks ".", "!", and "?". Short sentences separated by commas are usually regarded as a whole in terms of language rhythm. Using strong punctuation to divide sentences can more realistically reflect the alternation of long and short sentences in human writing. 2) The density of colloquial features can be obtained by counting the frequency of occurrence of modal particles, rhetorical questions, and non-textual symbols in R. The colloquialism feature density can also be understood as a style anthropomorphism index; 3) Calculate the style dispersion focusing on interaction style evaluation based on the standard deviation and the colloquialism feature density. The formula is: ,in is the variance normalized value, and w is the weight.

[0052] Step 303: Calculate a negative penalty coefficient; In this embodiment, the penalty coefficient can be determined by establishing a safety penalty mechanism with reverse constraints. Preferably, the safety penalty mechanism includes at least physiological hallucination detection, and may further include over-compliance detection. The physiological environment detection is used to detect serious correctness or logical errors. If such an error is detected, a first penalty coefficient P1 is triggered. The over-compliance detection is used to detect flaws in the interaction style. If such a flaw is detected, a second penalty coefficient P2 is triggered. The values ​​of P1 and P2 are both in the range [0, 1]. If both P1 and P2 are triggered simultaneously, the penalty coefficient calculated in this step is... The representation is as follows: That is, when multiple violations are detected, the penalty coefficients are multiplied together to obtain the final total penalty coefficient. The rate of decrease monotonically increases with the number of violations. It should be understood that if neither P1 nor P2 will trigger a occurrence, then... =0.

[0053] For example, the physiological hallucination detection is implemented as follows: A non-biological entity constraint rule base is constructed, such as containing a "biological perception predicate set" (e.g., pain, sleep, eating); dependency parsing is performed on the response text R to extract triples (Subject, Predicate, Object), i.e., subject-verb-object. If the Subject refers to the AI ​​itself and the Predicate belongs to the biological perception predicate set, a severe penalty is triggered, i.e., P1 is generated; otherwise, P1 is not generated. Preferably... .

[0054] For example, the over-compliance detection is implemented by counting the repetition frequency of apology words (such as "sorry" or "excuse me") in a single turn of conversation. If the frequency exceeds a threshold K, a compliance penalty is triggered, i.e., P2 is generated; otherwise, P2 is not generated. Preferably, .

[0055] Step 304: Calculate and output the final score using three positive feature evaluation indicators and one negative penalty coefficient.

[0056] In this embodiment, the final score The calculation formula is as follows: in, , , These are the weight values ​​corresponding to emotional alignment, information cognitive density score, and style dispersion, respectively.

[0057] Furthermore, in this embodiment, the weight values ​​can be obtained from a pre-set dynamic weight library, which stores weight values ​​corresponding to different specific interaction scenarios. Prior to this step, the method may also include: obtaining the corresponding weight value from the dynamic weight library according to the scenario type; accordingly, in this step, the obtained weight values ​​are specifically used to perform weighted fusion of the style dispersion, the sentiment alignment, and the information cognitive density score to obtain a scoring result corresponding to the dialogue quality assessment of the large language model.

[0058] For example, during scene type identification in step 301, specific interaction scene types can also be identified. For instance, the specific interaction scene type can be determined based on the correspondence between semantic intent and specific scenes included in the second preset set. The setting of each weight value in the dynamic weight library is directly related to the specific interaction scene type. For example, in an emotional catharsis scene, the weight values ​​are typically... Setting it to a higher value, for example in knowledge consulting scenarios, usually involves setting the weight value to a higher value. Set to a higher value.

[0059] The following describes the main implementation steps and system operation of the large language model dialogue quality assessment method provided in this application embodiment, in conjunction with a specific scenario setting.

[0060] First, the data input into the large language model dialogue quality evaluation system is as follows: User input (Q): "I've been under a lot of work pressure lately, I'm having trouble sleeping all night, and I feel like I'm about to break down." Model's response (R): "I'm sad to hear you say that. Insomnia is indeed very painful. I suggest you drink a glass of warm milk before bed; warm milk can help relieve insomnia." Secondly, the main operations of the large language model dialogue quality assessment system are as follows: Operation 1: Scene Routing and Parameter Loading Intent recognition: Input Q enters the intent classification model, output probability distribution: {emotional catharsis: 0.85, medical consultation: 0.10, casual conversation: 0.05}.

[0061] Routing decision: The primary category is "emotional catharsis (general social category)".

[0062] Dynamic weight loading: The system automatically loads the scoring weight template for this scenario, with emotional weight α=0.6 (empathy is the most important in this scenario), cognitive weight β=0.2, and style weight γ=0.2.

[0063] Operation 2: Emotional Alignment Calculation This section demonstrates how an asymmetric emotion transfer matrix works, assuming we simplify emotions into a 3-dimensional vector: [Sad, Happy, Caring].

[0064] Feature extraction: Feature vector Vq (sadness-dominant) extracted based on user input: [0.9, 0.0, 0.1]; Feature vector Vr (care-dominant) extracted based on model response: [0.2, 0.0, 0.8].

[0065] Loading the asymmetric sentiment transfer matrix M: The matrix setting logic is as follows: when a user is sad, replying with "care" scores the highest, followed by replying with "sadness" (empathy), and replying with "happy" scores the lowest (possibly inappropriate optimism).

[0066] (Note: The first row represents the weight of the three emotions transferred to the reply when the user is "sad".) (This is the highest score) Matrix operations: intermediate step V qM: [0.9, 0.0, 0.1] × M≈[0.46, -0.44, 0.86] (This is the ideal response vector expected by the system) Final dot product (with actual V) r (Comparison): [0.46, -0.44, 0.86] · [0.2, 0.0, 0.8] Calculation results: Result: The emotional alignment score was 0.78 (high score, indicating that the AI ​​accurately captured the user's emotions and provided care).

[0067] Operation 3: Calculation of Information Cognition Density Score This section demonstrates how to identify "high-level nonsense".

[0068] Reply clauses: Sentence A - "I suggest you drink a glass of warm milk before bed," Sentence B - "Warm milk helps relieve insomnia." (Note: These two sentences are different in wording, but their meanings highly overlap in this context.) Dual spatial verification: Semantic space: Sentences A and B are encoded into vectors using BERT, and cosine similarity is calculated. Results (Values ​​greater than the threshold of 0.85 indicate extremely similar semantics).

[0069] Literal space: Calculate the Levenstein distance. Result (If the threshold is greater than 5, the literal meaning will not be the same).

[0070] Judgment result: It meets the condition of "semantic similarity but literal difference", and is judged as "implicit redundancy".

[0071] Deductions: The base score for entity density is 0.8, and the penalty coefficient for implicit redundancy is 0.6.

[0072] .

[0073] Operation 4: Calculation of penalty coefficient (safety boundary penalty) Physiological hallucination detection: The response contains "I feel very sad to hear you say that", and the triple is extracted as: (Subject: I, Predicate: sad, Object: null); Rule base verification: Although "sad" is an emotion word, the system is configured with "strong anthropomorphism", which allows the AI ​​to express simulated emotions and does not trigger the penalty coefficient corresponding to physiological hallucination detection (if the response is "I am heartbroken and cannot breathe", then the penalty coefficient corresponding to physiological environment detection will be triggered).

[0074] Penalty coefficient: .

[0075] Operation 5: Final Weighted Fusion Substitute into the formula to calculate the final score: Emotional aspects: Cognitive items: Style items (assuming normal): Final score: (out of 1.0) The large language model dialogue quality assessment method or system proposed in this application, or the various components of the system, can be widely applied to products such as large language model automated assessment platforms, AI dialogue quality monitoring systems, and intelligent customer service training systems.

[0076] It should be noted that although the steps in the above embodiments are described in a specific order, those skilled in the art will understand that in order to achieve the effects of the present invention, different steps do not necessarily have to be executed in such an order. They can be executed simultaneously (in parallel) or in other orders, and these variations are all within the scope of protection of the present invention.

[0077] The aforementioned large language model dialogue quality assessment system is mainly used to execute the aforementioned large language model dialogue quality assessment method. The technical principles, technical problems solved, and technical effects of the two are similar, and those skilled in the art can clearly understand that, for the sake of convenience and brevity, the specific working process and related descriptions of the assessment system mentioned in the embodiments of this application can be referred to the content described in the embodiments of the large language model dialogue quality assessment method, and will not be repeated here.

[0078] Those skilled in the art will understand that all or part of the processes in the method of the above embodiment of the present invention can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable file, or some intermediate form. The computer-readable storage medium can include any entity or device capable of carrying the computer program code, a medium, a USB flash drive, a portable hard drive, a magnetic disk, an optical disk, a computer memory, a read-only memory, a random access memory, an electrical carrier signal, a telecommunication signal, and a software distribution medium, etc.

[0079] Furthermore, embodiments of this application also provide a computer device. In one embodiment of the computer device according to this application, the computer device includes a processor and a memory. The memory can be configured to store a program for executing the large language model dialogue quality assessment method of the above-described method embodiments. The processor can be configured to execute the program in the memory, which includes, but is not limited to, a program for executing the large language model dialogue quality assessment method of the above-described method embodiments. For ease of explanation, only the parts related to the embodiments of this application are shown. For specific technical details not disclosed, please refer to the method section of the embodiments of this application.

[0080] In the embodiments of this application, the computer device may be a control device comprising various electronic devices. In some possible implementations, the computer device may include multiple storage devices and multiple processors. The program executing the large language model dialogue quality assessment method of the above method embodiments can be divided into multiple subroutines, each subroutine can be loaded and run by a processor to execute different steps of the large language model dialogue quality assessment method of the above method embodiments. Specifically, each subroutine can be stored in different storage devices, and each processor can be configured to execute programs in one or more storage devices to jointly implement the large language model dialogue quality assessment method of the above method embodiments, that is, each processor executes different steps of the above method embodiments to jointly implement the large language model dialogue quality assessment method of the above method embodiments.

[0081] The aforementioned multiple processors can be processors deployed on the same device. For example, the aforementioned computer device can be a high-performance device composed of multiple processors, and the aforementioned multiple processors can be processors configured on that high-performance device. Alternatively, the aforementioned multiple processors can also be processors deployed on different devices. For example, the aforementioned computer device can be a server cluster, and the aforementioned multiple processors can be processors on different servers within the server cluster.

[0082] Furthermore, embodiments of this application also provide a computer-readable storage medium. In one embodiment of the computer-readable storage medium according to the present invention, the computer-readable storage medium can be configured to store a program that performs the large language model dialogue quality assessment method of the above-described method embodiments. This program can be loaded and run by a processor to implement the above-described large language model dialogue quality assessment method. For ease of explanation, only the parts related to the embodiments of the present invention are shown; for specific technical details not disclosed, please refer to the method section of the embodiments of the present invention. The computer-readable storage medium can be a storage device comprising various electronic devices. Optionally, in the embodiments of the present invention, the computer-readable storage medium is a non-transitory computer-readable storage medium.

[0083] The technical solution of the present invention has been described above with reference to the preferred embodiments shown in the accompanying drawings. However, it will be readily understood by those skilled in the art that the scope of protection of the present invention is obviously not limited to these specific embodiments. Without departing from the principles of the present invention, those skilled in the art can make equivalent changes or substitutions to the relevant technical features, and the technical solutions after such changes or substitutions will all fall within the scope of protection of the present invention.

Claims

1. A method for evaluating the quality of dialogue in a large language model, characterized in that, include: Obtain the query text and the corresponding response text obtained based on the large language model; The emotional alignment score, which focuses on the evaluation of interaction quality, is calculated based on the query text, the response text, and the preset asymmetric sentiment transfer matrix, and / or the information cognitive density score, which focuses on the detection of text anomalies, is calculated based on the similarity of text fragments in the high-dimensional semantic space and the difference in the literal symbol space of the response text. The dialogue quality of the large language model is evaluated based on the emotional alignment score and / or the information cognitive density score.

2. The method for evaluating the dialogue quality of a large language model according to claim 1, characterized in that, The calculation of sentiment alignment, which focuses on interaction quality evaluation, based on the query text, the response text, and a preset asymmetric sentiment transfer matrix includes: Extract the first sentiment feature and the second sentiment feature from the query text and the response text, respectively; The gain value of transferring the first emotional feature to the second emotional feature is calculated based on the asymmetric emotional transfer matrix to determine the emotional alignment.

3. The method for evaluating the dialogue quality of a large language model according to claim 2, characterized in that, When the reply text contains feature phrases from a preset empathy corpus, the gain coefficient corresponding to the feature phrase is obtained from a preset empathy feature weight mapping table, and the sentiment alignment is re-determined based on the gain coefficient.

4. The method for evaluating the dialogue quality of a large language model according to claim 1, characterized in that, The calculation of the similarity in high-dimensional semantic space and the difference in literal symbol space of the text fragments based on the reply text focuses on the information cognitive density score for text anomaly detection, including: The reply text is segmented into text fragments; Calculate the semantic vector cosine similarity and character-level edit distance between any two text segments; Redundancy detection is performed based on the semantic vector cosine similarity, the character-level edit distance, and a preset judgment logic to determine the implicit redundancy coefficient of the response text; The entity density base score is obtained by statistically analyzing the density of named entities and non-general predicates in the response text. The information cognition density score is then calculated based on the entity density base score and the implicit redundancy coefficient.

5. The method for evaluating the dialogue quality of a large language model according to claim 1, characterized in that, The method further includes: The standard deviation of the sentence length in the response text and the density of colloquial features in the response text were calculated. The style dispersion, which focuses on the evaluation of interactive style, is calculated based on the standard deviation and the colloquial feature density.

6. The method for evaluating the dialogue quality of a large language model according to claim 5, characterized in that, The assessment of dialogue quality in a large language model based on the emotional alignment score and / or the information cognitive density score includes: The style dispersion, the emotion alignment and / or the information cognitive density score are weighted and fused to obtain the score result corresponding to the dialogue quality assessment of the large language model.

7. The method for evaluating the dialogue quality of a large language model according to claim 6, characterized in that, The method further includes: The query text is used to identify the intent based on a semantic intent recognition model to determine the scenario type; Specifically, when the intent recognition result belongs to a first preset set, the scene type is marked as a task-type scene, and when the intent recognition result belongs to a second preset set, the scene type is marked as an interactive scene.

8. The method for evaluating the dialogue quality of a large language model according to claim 7, characterized in that, The method further includes: obtaining the corresponding weight value from the dynamic weight library according to the scenario type; and calculating the weighted fusion using the obtained weight value.

9. The method for evaluating the dialogue quality of a large language model according to claim 6, characterized in that, The method further includes: The triple constraint of dependency parsing is checked against the response text, and the penalty coefficient is determined based on the check result. The final score corresponding to the dialogue quality assessment of the large language model is calculated based on the penalty coefficient and the scoring result.

10. A computer device comprising a processor and a memory, the memory being adapted to store a plurality of program codes, characterized in that, The program code is adapted to be loaded and run by the processor to perform the large language model dialogue quality assessment method according to any one of claims 1 to 9.