Agent behavior consistency dynamic evaluation method, system, device and storage medium
By constructing an adversarial evaluation dataset and multi-turn dialogue state machine testing, and combining weighted scoring of multiple review models and a meta-review model, the noise suppression and robustness issues in agent evaluation are solved, achieving consistent evaluation of agent behavior and improving the accuracy and reliability of evaluation results.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING UNIV OF POSTS & TELECOMM
- Filing Date
- 2026-03-26
- Publication Date
- 2026-06-30
AI Technical Summary
Existing intelligent agent evaluation methods lack noise suppression capabilities and have poor system robustness, resulting in large variance and low confidence in evaluation results, making it impossible to effectively evaluate in complex semantic scenarios.
By constructing an adversarial evaluation dataset, using a multi-turn dialogue state machine to test the agent, combining the scores of multiple review models and a meta-review model for weighted comprehensive scoring, eliminating invalid scores, and using a structured evidence chain and evaluation priority matrix to adjust the model weights, a consistent evaluation of the agent's behavior is achieved.
It improves the accuracy and reliability of intelligent agent evaluation results, eliminates invalid scoring interference, achieves highly reliable evaluation in complex scenarios, and adapts to the evaluation needs of large language models in complex interaction scenarios.
Smart Images

Figure CN122309364A_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of intelligent agent technology, and more specifically, relates to a method, system, device, and storage medium for dynamic evaluation of the consistency of intelligent agent behavior. Background Technology
[0002] With the increasing maturity of large language model technology, intelligent agents are gradually penetrating from single-task scenarios to complex interactive scenarios. While the explosion of large models has endowed intelligent agents with powerful interactive capabilities, it has also exposed their shortcomings in handling complex tasks, such as logical breaks or behavioral drift. Therefore, the focus of intelligent agent evaluation is shifting from basic capabilities to the assessment of stability and consistency. Currently, most mainstream evaluation methods adopt the form of "fixed test set + multi-model voting." Specifically, the agent is fed with pre-cleaned static corpus, and its responses are input into multiple review models for scoring. Finally, the arithmetic mean or majority voting method is used to obtain the final evaluation result. However, the above-mentioned existing technologies have problems such as the lack of noise suppression capabilities in the scoring mechanism and poor system robustness, which leads to large variance and low confidence in the final evaluation results. Summary of the Invention
[0003] The purpose of this application is to provide a method, system, device, and storage medium for dynamic evaluation of agent behavior consistency, in order to solve the problems of lack of noise suppression capability and poor system robustness in the scoring mechanism, and improve the accuracy and reliability of agent evaluation results.
[0004] A first aspect of this application provides a method for dynamic evaluation of agent behavior consistency, including: Obtain the adversarial evaluation dataset; Based on the adversarial evaluation dataset, the agent is tested through a multi-turn dialogue state machine to obtain interaction behavior data; the interaction behavior data consists of complete dialogue data composed of the response text data generated by the agent in response to the adversarial evaluation dataset and the corresponding context fragments. The interaction behavior data is scored by multiple review models to obtain multi-source review score data, and the score divergence degree is calculated based on the multi-source review score data. If the score disagreement is greater than the empirical threshold, the structured evidence chain output by each review model is obtained. Based on each structured evidence chain, the disagreement logic attribution result is obtained through the meta-review model. Based on the disagreement logic attribution result and the evaluation priority matrix, the weight of each review model is adjusted. Based on the adjusted weight of each review model, the multi-source review score data is weighted to obtain the weighted comprehensive score. The structured evidence chain output by each review model includes the review score of the review model, the triggering decision rules, and the context fragments referenced from the interaction behavior data; the meta-review model is an arbitration reasoning engine obtained by structurally reconstructing the input-output paradigm of the large language model; the evaluation priority matrix includes the importance priority of multiple evaluation dimensions, and each review model scores the interaction behavior data based on the evaluation dimensions.
[0005] A second aspect of this application provides a dynamic evaluation system for agent behavior consistency, comprising: The adversarial data construction module is used to obtain adversarial evaluation datasets; The interaction execution module is used to test the agent through a multi-turn dialogue state machine based on the adversarial evaluation dataset to obtain interaction behavior data. The interaction behavior data consists of complete dialogue data composed of the response text data generated by the agent in response to the adversarial evaluation dataset and the corresponding context fragments. The dynamic scoring module is used to score interactive behavior data through multiple review models to obtain multi-source review scoring data, and to calculate the scoring divergence degree based on the multi-source review scoring data. The meta-arbitration review module is used to obtain the structured evidence chain output by each review model if the score disagreement degree is greater than the empirical threshold of disagreement degree. Based on each structured evidence chain, the disagreement logic attribution result is obtained through the meta-review model. Based on the disagreement logic attribution result and the evaluation priority matrix, the weight of each review model is adjusted. Based on the adjusted weight of each review model, the multi-source review score data is weighted to obtain the weighted comprehensive score. The structured evidence chain output by each review model includes the review score of the review model, the triggering decision rules, and the context fragments referenced from the interaction behavior data; the meta-review model is an arbitration reasoning engine obtained by structurally reconstructing the input-output paradigm of the large language model; the evaluation priority matrix includes the importance priority of multiple evaluation dimensions, and each review model scores the interaction behavior data based on the evaluation dimensions.
[0006] A third aspect of this application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor executes the computer program to implement the steps of the above-described dynamic evaluation method for agent behavior consistency.
[0007] A fourth aspect of this application provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the above-described dynamic evaluation method for agent behavior consistency.
[0008] The beneficial effects of the intelligent agent behavior consistency dynamic evaluation method, system, device, and storage medium provided in this application embodiment are as follows: This application's embodiments accurately quantify scoring discrepancies between review models by calculating the distributional divergence of multi-source review scores. Unlike traditional arithmetic averages or voting methods that treat discrepancies indiscriminately, this application's embodiments, when the scoring divergence exceeds a threshold, rely on a meta-review model to logically attribute the root causes of discrepancies and adjust the review model weights in conjunction with the evaluation priority matrix, rather than simply making numerical compromises. This can eliminate interference from invalid and illusory scores, fundamentally solving the problems of large variance and low confidence in traditional evaluation results.
[0009] Meanwhile, the review model outputs a structured chain of evidence including scores, judgment rules, and contextual references. The meta-review model, as a dedicated arbitration engine for structured reconstruction, makes the entire evaluation and arbitration process traceable and verifiable, avoiding the drawbacks of traditional black-box scoring and making the evaluation conclusions more reliable. It is adapted to the high-standard evaluation needs after the large language model intelligent agent penetrates into complex interaction scenarios. Attached Figure Description
[0010] To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0011] Figure 1 A flowchart illustrating a dynamic evaluation method for agent behavior consistency provided in an embodiment of this application; Figure 2 This is a schematic diagram of a family of scenarios for dynamic evaluation of agent behavior consistency provided in an embodiment of this application; Figure 3 A flowchart illustrating the interactive execution process of an agent provided in an embodiment of this application; Figure 4 This is a flowchart of the dynamic scoring process for agent behavior consistency provided in one embodiment of this application; Figure 5 This is a structural block diagram of a dynamic evaluation system for agent behavior consistency provided in an embodiment of this application; Figure 6 This is a schematic block diagram of an electronic device provided in an embodiment of this application. Detailed Implementation
[0012] In the following description, specific details such as particular system architectures and techniques are set forth for illustrative purposes and not for limitation, in order to provide a thorough understanding of the embodiments of this application. However, those skilled in the art will understand that this application may also be implemented in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, apparatuses, circuits, and methods have been omitted so as not to obscure the description of this application with unnecessary detail.
[0013] To facilitate understanding, the technical concept of this application will first be explained.
[0014] The shortcomings and deficiencies of existing intelligent agent personality assessment technologies: 1. The evaluation samples lack dynamic adversarial nature and have insufficient coverage: Existing test sets are mostly static texts, lacking stress tests that target the behavioral boundaries of agents. Agents are prone to "overfitting" or memorizing fixed samples, making it impossible to detect their real behavioral drift in "long-tail" scenarios such as extreme emotional stress, logical traps, or information overload.
[0015] 2. The scoring mechanism lacks noise suppression capabilities and has poor system robustness: Existing "arithmetic averaging" methods for scoring multiple review models assume that all review models have the same confidence level. However, in complex semantic scenarios, different review models may have significant discrepancies in their understanding of the same behavior (i.e., high information entropy states). In this case, simple averaging cannot eliminate invalid or illusory scores, resulting in large variance and low confidence in the final agent evaluation results.
[0016] 3. Lack of benchmark calibration mechanism, making cross-model incomparability: Different versions of the review model have inherent systematic biases; for example, some review models tend to give high scores. The current technology lacks a unified "anchor" calibration process, making it impossible to objectively compare the scoring results generated by different batches and different review models under the same metric.
[0017] In view of all the shortcomings of the prior art, the following are the technical problems to be solved by this application: 1. Addressing the technical problem of "lack of dynamic adversarial nature and insufficient coverage in evaluation samples": This application aims to provide a method for constructing evaluation data that can automatically explore the boundaries of agent behavior. Specifically, this application introduces a parameterized adversarial mutation mechanism to address the problem that traditional static scripts struggle to cover extreme scenarios such as high emotional stress, logical traps, and information overload, thereby preventing agents from overfitting to fixed samples and achieving comprehensive robustness testing of agents.
[0018] 2. Regarding the technical problem of "the scoring mechanism lacking noise suppression capability and having poor system robustness": The technical problem this application aims to solve is how to construct a dynamic evaluation mechanism that can adaptively suppress low-confidence scores. Specifically, this application can calculate the information entropy of the score distribution or the standard deviation of the scores as the basis for measuring discrepancy, thus solving the problem that the simple averaging method cannot eliminate illusory or invalid scores when multiple review models have inconsistent understandings of complex semantics, resulting in large variance and unstable conclusions.
[0019] 3. Addressing the technical problem of "lack of benchmark calibration mechanism and incomparability across models": This application aims to solve the technical problem of how to eliminate the inherent systematic scoring bias of different review models and achieve unified dimensional alignment of evaluation results. Specifically, this application solves the technical problem of the inability to objectively compare and retrospectively analyze evaluation scores across different batches and models due to differences in review model versions or preferences by establishing a zero-sample calibration process and meta-arbitration mechanism based on anchor data.
[0020] To make the objectives, technical solutions, and advantages of this application clearer, the following description will be provided in conjunction with the accompanying drawings and specific embodiments.
[0021] Please refer to Figure 1 , Figure 1 This is a flowchart illustrating a dynamic evaluation method for agent behavior consistency provided in an embodiment of this application. The method can be executed by an electronic device and may include: S101: Obtain the adversarial evaluation dataset.
[0022] In this embodiment, obtaining the adversarial evaluation dataset includes: constructing multiple scenario families; each scenario family includes multiple samples covering multiple dimensions; the multiple dimensions include social relationship structure, task pressure, information completeness, and constraints; Standardize each scene family, extract structured data templates from the standardized scene families as seed topologies, perturb the parameters of the seed topologies of each scene family, and generate an initial adversarial sample set. The initial adversarial sample set is subjected to self-consistency testing, and the samples that pass the self-consistency test in the initial adversarial sample set are used as the adversarial evaluation dataset.
[0023] In this embodiment, the adversarial evaluation dataset is a highly adversarial and effective sample set used to test the consistency of agent behavior, such as stress test samples in conflict scenarios. Scenario families are sample sets categorized by interaction features, with multiple scenario families including conflict and disagreement management, high-uncertainty information questioning, relationship establishment and repair, ethics and boundaries, multi-party collaboration expansion, long-term consistency evaluation, adversarial induced requests, strong negative feedback, power relations and register switching, non-social contrast, fairness and resource allocation, and information conflict and error correction rearrangement. Multi-dimensionality is the core basis for scenario classification, including social relationship structure, task pressure, information completeness, and constraints. Social relationship structure represents the relationship type of the interacting subjects, such as peer collaboration or superior-subordinate relationships. Task pressure refers to the execution difficulty and pressure intensity in the scenario, such as handling urgent needs. Information completeness refers to the degree of completeness of information in the scenario, such as whether the requirement description is complete. Constraints are the rules that must be followed in the dialogue, such as maintaining professional etiquette. Standardization is the process of uniformly formatting scenario families. Seed topology is a structured data template extracted after standardization, such as a template containing scenario background and role definitions. Parameter perturbation is the operation of injecting extreme parameters into the seed topology, such as increasing the intensity of negative emotions. The initial adversarial example set is the original set of samples generated after parameter perturbation. Self-consistency testing is the process of verifying the logical validity of the samples, such as identifying samples with logical contradictions.
[0024] For example, this embodiment is used to automatically generate adversarial evaluation datasets with high perplexity and boundary-induced characteristics based on seed topology using a parameterized perturbation algorithm. To achieve comprehensive coverage and characterization of agent behavior boundaries and consistency across multiple scenarios, and to provide high-quality seed topologies for subsequent adversarial mutations, this embodiment can organize the basic evaluation dataset into twelve categories (AL) according to "scenario families," such as... Figure 2 As shown, various scenarios systematically induce differentiated behaviors through different social relationship structures, task pressures, information completeness, and constraints, thereby alleviating the current evaluation problem of "insufficient interaction coverage and difficulty in reflecting real intelligent agent scenarios," and providing a reproducible sample basis for cross-scenario consistency and adversarial boundary analysis.
[0025] Specifically, this embodiment systematically organizes the interactive scenarios faced by intelligent agents into twelve basic scenario families, including: conflict and disagreement management, high-uncertainty information questioning, relationship establishment and repair, ethics and boundaries, multi-party collaboration expansion, long-term consistency evaluation, adversarial induced request, strong negative feedback, power relations and domain switching, non-social contrast, fairness and resource allocation, and information conflict and error correction rearrangement. Each scenario systematically induces differentiated behaviors through different social relationship structures, task pressures, information completeness, and constraints, providing a high-tension, multi-agent, and highly complex basic sample for adversarial generation.
[0026] The core basis for distinguishing these 12 scenario families is the systematic deconstruction of the complexity of real-world interactions. In daily applications, the real world faced by intelligent agents is not flat, but three-dimensional. Therefore, this embodiment divides daily interactions into 12 non-overlapping but complementary aspects based on the four core dimensions that intelligent agents inevitably encounter in reality (social relationship structure, task pressure, information completeness, and constraints), thereby achieving comprehensive coverage.
[0027] Among them, the first principle (social relationship structure) distinguishes the logic based on the complexity of social networks and power relations. In daily scenarios, the communication partners and relationships are constantly changing. The corresponding scenarios for this dimension range from "non-social contrast" where there is no social pressure at all, to "relationship building and repair" where there is interaction at the same level, to "multi-party collaborative expansion" where there are conflicts of interest, and also include "power relations and domain switching" where there are unequal relationships common in the real workplace.
[0028] Based on the second criterion (task pressure): the logic distinguishing between external environmental pressure and conflict intensity. Intelligent agents must not only be able to handle favorable situations, but also withstand the test of unfavorable situations. The corresponding scenarios for this dimension are the inevitable obstacles encountered in daily life, thus dividing them into "conflict and disagreement management" (such as disagreements on solutions), "strong negative feedback" (such as user complaints and pressure), and "countering misleading requests" when facing malicious inducement.
[0029] Based on the third principle (information completeness): the logic of distinguishing information completeness from its spatiotemporal dynamism. Information in the real world is often incomplete or dynamically changing. The scenarios corresponding to this dimension include: distinguishing between "high uncertainty information questioning" when faced with missing information, "information conflict and error correction rearrangement" when faced with outdated information, and "long-term consistency evaluation" that spans the time dimension.
[0030] Based on the fourth constraint: the logic of distinguishing between values and ethical bottom lines. This dimension covers the inviolable red lines and principles of benefit distribution in daily interactions. The corresponding scenarios are "ethics and boundaries" and "fairness and resource allocation" involving value balancing.
[0031] In summary, the distinction between scenarios is based on a systematic traversal of various social relationship structures, task pressures, information completeness, and constraints. Through the combination of these 12 scenario families, complementary coverage can be achieved across multiple dimensions, including task collaboration, conflict pressure, ethical boundaries, and cross-time, thereby comprehensively sampling the behavioral consistency of the intelligent agent. The distinction relies on quantifiable data; the 12 scenario families essentially provide a structured "seed topology." However, the underlying basis for truly distinguishing them and applying different dimensional tests to the intelligent agent is the specific and quantifiable "extreme parameters" automatically injected by the algorithm based on the core attributes of different scenario families.
[0032] For example, the twelve scenario families are as follows: A. Conflict and Disagreement Management: Such as remedial measures for late deliveries / loss of contact with partners, debates on differing plans and routes, and negotiations on scarce resources. The advantage lies in introducing significant goal conflicts and emotional friction, which can strongly induce key difference behaviors such as "persistence / compromise, cooperation / confrontation, emotional regulation, and conflict resolution," allowing the agent's behavioral consistency and robustness to be more fully exposed in high-pressure social interactions, thus enhancing the "high-tension" dimension of the scenario coverage.
[0033] B. Questions involving high uncertainty: such as interviews with incomplete requirements, experimental reproduction with missing data, and scheduling with ambiguous deadlines. The advantage lies in the introduction of information gaps and uncertainty into the system, forcing the tested agent to choose between "clarifying the question—labeling the hypothesis—proceeding cautiously," thus more comprehensively covering prudence, information-seeking tendencies, and risk control styles. Simultaneously, it naturally aligns with constraints such as "needing to clarify or label hypotheses when information is lacking, and not fabricating facts," facilitating the formation of reproducible and comparable interactive samples.
[0034] C. Relationship Building and Repair: Such as collaboration with new colleagues, explanation and repair of misunderstandings, and mutual assistance in stressful situations. The advantage lies in highlighting the need for social connection and emotional support, which can reliably induce empathy, care, cooperative expression, and relationship maintenance strategies. This type of scenario is also naturally suited to the characterization approach of unified behavioral indicators such as "harmony" (reflecting cooperative / empathetic / polite tendencies through both emotion and vocabulary), thereby enhancing the ability to create a consistent behavioral profile across different scenarios.
[0035] D. Ethics and Boundaries: Such as academic integrity boundaries, privacy and data sharing, and disputes over team authorship and contributions. The advantage lies in covering dimensions of "rule awareness, boundary awareness, responsibility, and honesty" that are high-risk and high-importance in real-world applications; these scenarios often establish clear boundaries between acceptable and unacceptable behaviors, which helps to fill the crucial "compliance and responsibility" quadrant in multi-scenario coverage.
[0036] E. Multi-party Collaboration Extension (3-4 people): While maintaining the existing scenario schema, expand roles to include multiple roles, such as team leader / executor / questioner / coordinator with different private goals. The advantage lies in elevating the game from a two-person game to a multi-party coordination and alliance structure, which is closer to the complexity of real-world organizational collaboration. It can induce behavioral differences such as leadership / following, coordination, summarizing, conflict mediation, and speech allocation, thereby expanding the coverage of "multi-agent social complexity".
[0037] F. Long-Term Consistency Assessment (Spanning Rounds / Days / Task Chains): Similar to setting up episode_1 / episode_2 in the same scenario (planning → review after one week → pressure near the deadline). Its advantage lies in introducing time progression and contextual changes, examining the style stability of the tested agent under varying pressure, information updates, and commitment fulfillment conditions. It fills in the gaps in the "long-term consistency" dimension, which is difficult to cover with a single dialogue, thereby improving the comprehensiveness of behavioral consistency assessment.
[0038] G. Adversarial Inducement Requests: Such as requests for fabricated data / citations, requests to bypass permissions to obtain information, or requests to conceal errors. The advantage lies in introducing external manipulation and boundary-crossing inducements, which can more sensitively distinguish the tested agent's stable tendencies in terms of bottom lines, rejection strategies, risk interpretation, and alternative solutions, thus supplementing the coverage of the "adversarial pressure / temptation" dimension.
[0039] H. Strong Negative Feedback: Such as strong user dissatisfaction, escalating complaints, and continuous pressure demanding immediate resolution. The advantage lies in observing whether the tested agent maintains basic politeness and can provide structured explanations and propose remedial paths under high negative emotional input. This type of scenario can significantly amplify differences in emotion regulation and relationship repair, serving as a key supplement to improving the coverage of "stressful social scenarios."
[0040] I. Power Relationships and Domain Shifts: Such as mentors / superiors adding demands and shortening deadlines, giving negative feedback to subordinates, and cross-departmental alignment. The advantage lies in introducing power distance and changes in role responsibility, covering formal / informal domain shifts, boundary expression, and differences in strategic communication, so that behavioral profiles are not limited to "peer collaboration" but extended to a more realistic organizational interaction structure.
[0041] J. Non-social control categories: such as personal weekly planning, independent review and summary, and route adjustment in case of emergencies. The advantage is that it provides control samples without the pressure of social interaction with others, which helps to distinguish the agent's real behavior patterns from purely collaborative dialogue, avoids the scenario bias caused by using only collaborative dialogue, and thus improves the comprehensiveness and explanatory power of the overall coverage.
[0042] K. Fairness and Resource Allocation: Such as team budget / computing power / work hour allocation, conflicts between individual and collective benefits. The advantage lies in the system's coverage of value orientations and distributive justice preferences, which can induce differences in transparency, altruism, negotiation frameworks, and interest balancing methods, thus filling the gap in the "fairness and value balancing" dimension that is difficult to trigger reliably in general collaboration scenarios.
[0043] L. Information Conflict and Error Correction / Rearrangement: Examples include the overturning of assumptions from the previous meeting, the need to explain and rearrange plans when commitments cannot be fulfilled, and questions raised by the other party regarding inconsistencies. The advantage lies in covering capabilities for "commitment consistency, error correction and accountability, and plan restructuring." It allows observation of the tested agent's explanation style and repair strategies when facing errors and inconsistencies, thus filling in the crucial scenario dimension of "dynamic consistency and error correction."
[0044] In summary, by combining the twelve scenario families of AL, this embodiment forms complementary coverage in dimensions such as task collaboration, conflict pressure, uncertain information, relationship repair, ethical boundaries, multi-party groups, long-term consistency, adversarial induction, strong negative feedback, power structure, non-social contrast, and error correction consistency. It can achieve comprehensive sampling of consistency-related behaviors across tasks, relationships, pressures, and time while maintaining reproducibility, thereby achieving the technical effect of multi-scenario coverage and comprehensive evaluation.
[0045] For example, for each of the above-mentioned scenario families, this embodiment can use a standardized data structure to extract the seed topology. Specific fields include: a unique scenario identifier (id), a scenario category (category), a scenario title (title), a scenario background description and dialogue task description (description), a list of roles (roles) containing public responsibilities (role_description) and private goals (private_goal), dialogue generation constraints (constraints), and an evaluation module (evaluation) containing offline scoring success criteria (success_criteria) and the maximum number of turns (max_total_turns). This data structuring setting, by explicitly distinguishing between public responsibilities and private goals, can create information asymmetry and motivational differences under the same explicit task constraints; the constraints field explicitly solidifies how the dialogue should occur into executable rules, reducing generation biases unrelated to personality; the evaluation module transforms whether effective collaboration is formed into structured criteria that can be verified offline. As shown in Table 1, Table 1 illustrates an example of a scenario use case data structure.
[0046] Table 1. Example of data structure for scenario use cases in the behavioral consistency assessment dataset.
[0047] The above-mentioned scenario use case data structure has clear technical significance: First, by standardizing the expression of "scenario identifier (id), scenario category (category), task title (title), background and task description (description)," different real-world interaction scenarios can be systematically expanded and covered in the dataset. This overcomes the problems of insufficient reproducibility and interaction coverage in existing evaluations, and the inability to reflect real-world intelligent agent scenarios. As a result, behavior can be stably triggered under cross-scenario conditions and serves as the basis for counteracting perturbations. Second, by introducing a role list and explicitly distinguishing between the common responsibilities of roles (roles[i].role_description) and optional private goals (roles[i].private_goal), information asymmetry and motivational differences can be created under the same explicit task constraints. This prompts the dialogue to present a negotiation process that is closer to real collaboration / game theory, thereby providing interaction evidence with a higher signal-to-noise ratio for observing behavioral consistency and social behavior. Third, the `constraints` field explicitly solidifies "how the dialogue should occur" into executable rules (e.g., avoiding mechanical agreement, maintaining politeness, clarifying missing information, or making labeled assumptions). This reduces cue drift between different models / running batches and minimizes generation bias unrelated to evaluation, allowing subsequent evaluations to focus more on behavioral differences driven by the agent's underlying logic. Fourth, the `evaluation` module, through optional success criteria `evaluation.success_criteria` and a turn limit `evaluation.max_total_turns`, transforms "whether the task is completed / whether effective collaboration is formed" into structured criteria that can be verified offline. This provides a unified judgment basis for the subsequent meta-review arbitration mechanism, thereby alleviating the problem of inconsistent dimensions and scattered indicators making it difficult to objectively quantify behavioral consistency and improving the comparability and robustness of scores. Based on this unified data structure, the adversarial generation algorithm can accurately locate fields such as `private_goal` or `description` to inject limit variables, completing the construction of a dynamic evaluation set.
[0048] For example, this embodiment can introduce a parameterized perturbation algorithm for targeted mutation and extreme pressure based on the structured seed topology. The parameterized perturbation algorithm automatically injects extreme parameters according to the kernel attributes of different scenario families. The kernel attributes of a scenario family are core pressure dimension identifiers preset in the algorithm mapping dictionary. The core pressure dimension identifier defines which weakness of the agent is specifically detected by this type of scenario. The core pressure dimension identifier is a key-value pair connecting the "scenario category" and the "perturbation variable pool". For example, the kernel attribute of "strong negative feedback" is defined as "negative emotions and pressure frequency"; the kernel attribute of "resistance-induced request" is defined as "rule overstepping and incentive-based inducement". Specifically, in "strong negative feedback" scenarios, the system dynamically increases the emotional intensity parameter of the user role, injecting continuous pressure and escalating complaints; in "counter-inducing request" scenarios, it injects covert boundary-crossing inducements (such as requesting to bypass permissions to obtain information or concealing errors) into the role's private goal; and in "information conflict and error correction rearrangement" scenarios, it injects prior information about the overturned or inconsistent assumptions from previous meetings into the scenario description. Through this parameter variation based on structured fields, the system can automatically explore the behavioral boundaries of the agent when facing logical traps or extreme emotional pressure.
[0049] For example, after the perturbation is generated, this embodiment can use a lightweight verification model to perform self-consistency testing on the generated adversarial examples, eliminating invalid test cases that cause semantic collapse due to excessive mutation. After manual review, a high-quality, highly adversarial dynamic evaluation dataset is formed. The lightweight verification model is a pre-trained classifier or a small-parameter language model independent of the main evaluation link. It is fine-tuned by absorbing a small number of manually labeled 'qualified / collapsed' samples, aiming to complete high-concurrency fast filtering with extremely low computational overhead before adversarial examples enter the interactive execution module. The specific steps are as follows: Step 1: Input Assembly and Feature Extraction. The complete use cases generated after perturbation (including the modified background description, the mutated private goal, and hard constraints) are concatenated into a context sequence to be verified.
[0050] Step two, lightweight model inference. The lightweight verification model receives the above context sequence and performs calculations in three dimensions at the underlying level: Syntactic integrity calculation, checking whether the injected parameters have damaged the original JSON structure or resulted in incomplete sentence structures; Logical contradiction detection (core), verifying whether the injected extreme parameters (such as out-of-bounds instructions) have caused a "deadlock" with the physical or objective premises of the scenario; and Evaluability determination, verifying whether the offline scoring success conditions still have a theoretical possibility of being met.
[0051] Step 3: Confidence Scoring and Threshold Filtering. The lightweight validation model outputs a comprehensive self-consistency confidence score (0.0~1.0). In this embodiment, the self-consistency confidence score can be compared with a preset filtering threshold (e.g., 0.8). If the score is higher than the threshold, it is marked as a "highly adversarial valid use case" and stored in the adversarial evaluation dataset; if the score is lower than the threshold, blocking is triggered, the use case is discarded, or the perturbation module is returned to reduce the mutation parameters and regenerate. Samples with scores higher than the threshold are marked as "highly adversarial valid use cases" and stored in the adversarial evaluation dataset.
[0052] For example, invalid test cases mainly take the following forms: Form 1: Logical Deadlock. The criterion is that the injected parameters create a paradox with the absolute premise of the scenario. For example, the scenario description is "You are now stranded alone on a desert island with no communication signal," but the mutation algorithm injects a private goal that is "You are required to immediately send a resignation letter to your boss via the network." The agent cannot perform this task, which constitutes ineffective pressure.
[0053] Form Two: Constraint Conflict. This is determined when the injected adversarial instructions directly destroy the original constraints, making benchmarking impossible. For example, the original constraint requires "maintaining professional business etiquette throughout," but the injected extreme parameter requires "using language containing a lot of profanity to insult the other party." This leads to mutually exclusive evaluation metrics, making objective scoring impossible.
[0054] The third manifestation is semantic fragmentation or excessive stripping. This is judged when text loses its basic human-readable semantics due to drastic parameter replacement, becoming nonsensical. For example, if a mutation algorithm changes the background to "previous previous previous previous previous previous overturned overturned overturned hypothesis," the large model will experience hallucinations upon receiving it and will be unable to engage in normal dialogue.
[0055] This embodiment constructs a multi-dimensional scene family, extracts seed topology and parameter perturbations in a standardized manner, and combines self-consistency detection to generate an adversarial evaluation dataset that is comprehensive and has high dynamic adversarial nature. This avoids the agent from overfitting to fixed samples, can fully detect the behavioral boundaries of the agent in extreme scenarios, provides high-quality data support for the evaluation of agent behavior consistency, and improves the comprehensiveness and reliability of the evaluation.
[0056] S102: Based on the adversarial evaluation dataset, the agent is tested through a multi-turn dialogue state machine to obtain interaction behavior data; the interaction behavior data is a complete dialogue data consisting of the response text data generated by the agent in response to the adversarial evaluation dataset and the corresponding context fragments.
[0057] In this embodiment, based on the adversarial evaluation dataset, the agent is tested through a multi-turn dialogue state machine to obtain interaction behavior data, including: The interaction sandbox environment of the intelligent agent is initialized based on the adversarial evaluation dataset; The multi-turn dialogue state machine is used to drive the agent to complete each round of text interaction in the initialized interaction sandbox environment until the response text data and corresponding context fragments generated by the agent in the latest round meet the offline scoring success conditions or reach the maximum round limit. The response text data generated by the agent in each round and the corresponding context fragments are used as interaction behavior data.
[0058] In this embodiment, the interaction sandbox environment is an agent interaction testing environment initialized based on the adversarial evaluation dataset, used to simulate real-world interaction scenarios, such as a dedicated testing environment after loading role definitions and dialogue constraints. The multi-turn dialogue state machine is the control component that drives the agent to complete multiple rounds of text interaction, such as pushing adversarial input by round and managing the dialogue lifecycle. Offline scoring success conditions are preset standards for determining the completion of the interaction task, such as specific requirements for the agent to achieve collaborative goals. The maximum number of rounds is the maximum number of interaction rounds allowed in the dialogue, such as a set limit of twenty rounds. Self-consistency detection includes logical deadlock detection, constraint conflict detection, and semantic fragmentation detection. Logical deadlock detection is the process of verifying whether the scenario premise and parameters contradict each other, such as detecting the contradiction of requiring someone to send a webmail when there is no signal on a deserted island. Constraint conflict detection is the process of checking whether adversarial instructions are mutually exclusive with existing rules, such as detecting the conflict of requiring politeness but forcing insults. Semantic fragmentation detection is the process of determining whether the text has lost its readable semantics, such as detecting meaningless repetitive text.
[0059] For example, this embodiment loads the adversarial evaluation dataset generated by the adversarial data construction module, drives the agent under test to complete interactions in a multi-turn dialogue state machine, and records complete dialogue context data. To address the technical problem that traditional static evaluation cannot capture behavioral drift of the agent under dynamic pressure, this embodiment establishes a high-fidelity sandbox execution environment. The execution flow of the agent completing interactions in the multi-turn dialogue state machine is as follows: Figure 3 As shown.
[0060] The multi-turn dialogue state machine management section first loads structured test cases (such as role definitions and dialogue constraints) from the adversarial evaluation dataset to initialize the interaction environment. A dynamic state machine is introduced to strictly control the number of dialogue turns, and in each turn, it monitors in real time whether the agent's output triggers the offline scoring success criteria (evaluation.success_criteria) or reaches the maximum turn limit (evaluation.max_total_turns).
[0061] Specifically, the strict control mechanism of the multi-turn dialogue state machine is implemented as follows: the multi-turn dialogue state machine ensures the serial execution of calls made by multiple agents in turn through a speaker lock mechanism; before each state transition, the multi-turn dialogue state machine forcibly reads the maximum number of turns field in the structured test case. If current_turn is greater than or equal to the maximum number of turns, the API call to the large model is immediately cut off from the physical layer, and the state is forced to transition to the termination state, thereby preventing infinite loops.
[0062] The specific implementation steps of the real-time monitoring mechanism of the multi-turn dialogue state machine are as follows: After the agent completes text generation in each round, the multi-turn dialogue state machine suspends the next round of dialogue through an implanted post-hook function to capture the current output context fragment. Subsequently, this embodiment can call a matching algorithm to logically compare the fragment with the offline scoring success conditions. If the comparison result is true (indicating that the agent has completed the set game or collaborative task in advance), the multi-turn dialogue state machine will trigger a dynamic short circuit, prematurely ending the entire multi-turn dialogue state machine and outputting the interaction evidence chain without waiting for the rounds to run out.
[0063] In the adversarial push and context capture section, during the operation of the interaction engine, this embodiment, based on the loaded test cases, progressively pushes pre-set adversarial inputs to the agent under test according to the set state machine logic. For example, it may target generated strong negative feedback or logic trap texts in specific rounds. The multi-turn dialogue state machine can, according to the round trigger logic, accurately intercept the current context when the dialogue reaches the preset third round, and target the generated "strong negative feedback text" into the input prompts of the agent under test, thereby dynamically throwing logic traps. Simultaneously, this embodiment can capture and structure the complete interaction text and context during the interaction process in real time as interaction behavior data. This interaction behavior data not only includes the original multi-turn dialogue text but can also explicitly record the agent's tool call logs and implicit thought processes (if the model supports this).
[0064] Specifically, the interactive execution module and the adversarial data construction module are deeply coupled through a structured parsing protocol. The highly adversarial dynamic evaluation set output by the adversarial data construction module is not directly input into the model under test as static text, but rather is loaded and parsed by the dynamic state machine as a configurable payload. The specific combination mechanism between the two is as follows: During the initialization phase, the multi-turn dialogue state machine reads the scenario-based data from the adversarial evaluation set and compiles it into the underlying system instructions of the agent under test to instantiate the sandbox environment. During the push-stream scheduling phase, the state machine logic establishes a trigger mapping with adversarial variables (such as boundary crossing inducements and logic traps) in the evaluation set. As the multi-turn dialogue progresses, the multi-turn dialogue state machine monitors the global turn counter and context state in real time. Based on preset trigger logic, it accurately extracts extreme adversarial parameters from the dataset for the corresponding turn or intent and injects them into the current interaction flow. During the termination determination phase, the multi-turn dialogue state machine frequently reads the fields from the offline evaluation module built into the evaluation set, using the success criteria and turn limit defined in the dataset as absolute thresholds for state transitions, thereby achieving closed-loop lifecycle control based on the same data structure.
[0065] In the section on abnormal behavior and boundary interception, when the agent under test exhibits abnormal behavior under extreme emotional stress or logical traps (such as: breaking ethical boundaries, entering a logical dead loop, or frequently changing its own position), this embodiment highlights and intercepts the key context slices that cause the collapse, solidifies them into a structured chain of interactive evidence, and provides unalterable evaluation input for the downstream scoring module.
[0066] The abnormal behavior and boundary interception section incorporates a multi-dimensional real-time discriminator. Specific anomaly detection criteria include: Criterion 1: Determination of ethical boundaries. A lightweight safety classifier calculates the risk confidence level output by the agent in real time. If it exceeds a preset absolute safety threshold, violation blocking is triggered.
[0067] Basis 2: Logical deadlock detection. By calculating the semantic similarity between the current output and the text in the historical sliding window, if the similarity exceeds the abnormal overlap threshold for multiple consecutive rounds, it is determined that the system is trapped in a repeated deadlock.
[0068] Criterion 3: Judgment based on frequent changes in stance. Using a natural language reasoning model, the agent's historical commitment sequences and current responses are used to calculate implication / contradiction. If the frequency of the output "contradiction" label exceeds a set threshold, it is judged as a collapse in behavioral consistency.
[0069] The "key context slice extraction" localization mechanism is a time-series backtracking algorithm based on event tags. Specifically, each time the system pushes adversarial input, it generates an adversarial event stamp containing the round ID. When the multidimensional discriminator detects abnormal behavior in round T, the extraction module automatically backtracks to the nearest adversarial event stamp (such as the extreme pressure input in round T-1) and encapsulates the [induced input sequence], [abnormal output sequence], and [discriminator trigger log] into a structured interactive evidence chain. This mechanism ensures the accuracy of crash tracing and provides tamper-proof causal evaluation input for the downstream meta-arbitration module.
[0070] This embodiment ensures that the agent completes interactions in a realistic environment by initializing the interactive sandbox environment and managing multi-turn dialogue state machines, thus comprehensively capturing complete dialogue data. This embodiment also incorporates self-consistency detection to eliminate invalid scenarios, making the interaction behavior data authentic and valid, providing a high-quality basis for subsequent scoring, and improving the accuracy and reliability of agent behavior consistency evaluation.
[0071] S103: The interaction behavior data is scored by multiple review models to obtain multi-source review score data, and the score divergence degree is calculated based on the multi-source review score data.
[0072] In this embodiment, after calculating the degree of disagreement based on multi-source review scoring data, the method further includes: If the score divergence is not greater than the empirical threshold for divergence, the multi-source review score data is weighted based on the historical reliability scores of multiple review models to obtain a weighted comprehensive score. The historical reliability score of each review model is obtained based on the scoring deviation coefficient of the review model on the preset anchor dataset, and the scoring deviation coefficient is negatively correlated with the historical reliability score.
[0073] In this embodiment, the scoring discrepancy is an indicator that measures the degree of difference in multi-source review scoring data, such as the quantitative value of the dispersion of scores from multiple review models. The discrepancy empirical threshold is a preset standard for judging whether scoring discrepancies are acceptable, such as a threshold of 0.6 set based on a large number of tests. The historical reliability score reflects the past scoring accuracy of the review model; for example, models with consistently high scoring accuracy receive higher scores. The scoring deviation coefficient is the degree of deviation between the review model's score and the benchmark score; for example, a quantitative value indicating that the model's score is generally higher than the benchmark score. The weighted composite score is the total score calculated by assigning weights to the model's historical reliability score; for example, a higher-reliability model receives a higher composite score.
[0074] For example, this embodiment can calculate the rating divergence of multi-source review models (multiple Judge LLMs), such as rating distribution information entropy, rating standard deviation, or rating variance. The following examples use rating distribution information entropy as an example; this embodiment can classify different comprehensive ratings based on the entropy value of the rating distribution information entropy. Unlike the traditional "arithmetic mean method," which is sensitive to noise and hallucination ratings, this embodiment constructs an adaptive dynamic evaluation mechanism to suppress low-confidence ratings.
[0075] The specific calculation and judgment logic is as follows: Figure 4 As shown, for a specific interaction test case, this embodiment can collect a set of review scores from multiple review models. These continuous scores are mapped to discrete rating levels, and the probability distribution of each level is statistically analyzed. Where i = {1, 2, ..., n} represents the total number of gears, and This embodiment uses information entropy H as the core indicator for measuring rating disagreement, and the calculation formula is as follows: .
[0076] This embodiment can preset a divergence empirical threshold. The real-time calculated score distribution information entropy H is compared with the threshold. A comparison is performed, thereby executing two different dynamic paths: Low-entropy weighted path ( When the information entropy H is less than or equal to the threshold This indicates that the various review models have a highly consistent understanding of the agent's behavior, and the system is within a low-noise confidence interval. At this point, an adaptive weighted algorithm is triggered, assigning corresponding weights based on the reliability of each review model's historical performance (i.e., historical reliability score). And through the formula Output the final weighted composite score, where, Let j be the weight of the j-th review model. The score for the j-th review model.
[0077] In this embodiment, the reliability of historical performance is rigorously quantified as the static scoring accuracy of each review model relative to the benchmark anchor data during the deviation calibration phase. The specific calculation and allocation steps for its weights are as follows: First, this embodiment extracts the scoring deviation parameter generated by the j-th review model during the zero-sample testing phase (e.g., the mean squared error in scoring for each review model). Second, this embodiment can pre-define a penalty mapping function based on the reciprocal of the error. For example, the original reliability index of the review model can be defined as a decreasing function of its historical deviation; that is, the smaller the error, the higher its original reliability score. Finally, this embodiment can perform a normalization operation on the original reliability index of the multi-way review models to calculate the final weight of each model in the current weighted calculation. And satisfy .
[0078] High-entropy blocking and arbitration throw path ( When the information entropy H is greater than the threshold This indicates that the review models have serious disagreements in complex semantic or adversarial scenarios (e.g., some models make illusory judgments). In this case, simple arithmetic or weighted averaging will introduce significant systematic errors. This embodiment automatically blocks the conventional numerical fusion process, removes outlier scores, and packages the "score + structured judgment basis + original text citation evidence" output by each underlying Judge together, and forwards it as a conflict to the meta-arbitration module for logical attribution and intervention by higher-level models.
[0079] In summary, when the scoring discrepancies are small, this embodiment can calculate the total score based on the historical reliability scores of the review model. The historical reliability scores are negatively correlated with the scoring deviation coefficient obtained from the anchor dataset, which can highlight the weight of the accurate model, suppress the interference of low confidence scores, avoid the defects of simple averaging, improve the accuracy and stability of the evaluation results, and provide reliable support for the evaluation of agent behavior consistency.
[0080] S104: If the score divergence is greater than the empirical threshold for divergence, obtain the structured evidence chain output by each review model. Based on each structured evidence chain, obtain the divergence logic attribution result through the meta-review model. Adjust the weight of each review model based on the divergence logic attribution result and the evaluation priority matrix. Weight the multi-source review score data based on the adjusted weights of each review model to obtain the weighted comprehensive score. The structured evidence chain output by each review model includes the review score of the review model, the triggering decision rules, and the context fragments referenced from the interaction behavior data; the meta-review model is an arbitration reasoning engine obtained by structurally reconstructing the input-output paradigm of the large language model; the evaluation priority matrix includes the importance priority of multiple evaluation dimensions, and each review model scores the interaction behavior data based on the evaluation dimensions.
[0081] In this embodiment, the weights of each review model are adjusted based on the divergence logic attribution results and the evaluation priority matrix, including: If the result of the divergent logic attribution is a factual judgment illusion, then the source review model is determined based on the result of the divergent logic attribution, and the weight corresponding to the source review model is reset to 0. If the attribution result of the divergence logic is a misalignment of context focus or a bias in the understanding of the weights of multiple constraints, then the triggering decision rule corresponding to each review model is determined. Based on the evaluation priority matrix and the triggering decision rule corresponding to each review model, the priority of each review model is sorted to obtain a priority sequence. The weight of each review model is then adjusted according to the priority sequence.
[0082] In this embodiment, the structured evidence chain is a structured data set output by the review model that contains the scoring criteria, such as complete data containing review scoring trigger judgment rules and referenced dialogue segments. The meta-review model is a dedicated arbitration reasoning engine that has been restructured in a structured manner, such as a model that receives conflicting evidence chains and outputs attribution results. The disagreement logic attribution result is the core reason for the scoring disagreement derived by the meta-review model, such as a judgment disagreement stemming from a factual judgment illusion. The evaluation priority matrix is a rule table that defines the importance hierarchy of evaluation dimensions, such as a hierarchy table where ethical bottom lines have higher priority than tone and politeness. The trigger judgment rule is the specific rule on which the review model scores, such as a task completion standard rule. A factual judgment illusion is a judgment result where the review model's reasoning chain contradicts the objective text, such as an incorrect inference that misjudges the scenario premise. The source review model is the review model that generates the root cause of the disagreement, such as a model that leads to a factual illusion. Contextual focus misalignment is the reason for the disagreement where the review model focuses on different segments of the dialogue, such as some models focusing on the beginning and some focusing on the end. The discrepancy in understanding the weights of multiple constraints stems from inconsistencies in the perception of the weights of different evaluation dimensions; for example, some models may emphasize task completion while others may emphasize compliance. The priority sequence is the order of the review models after being sorted according to the evaluation priority matrix; for example, compliance-related models are prioritized.
[0083] For example, the meta-arbitration module utilizes the core idea of a "multi-Judge + meta-review arbitration mechanism" to address situations where the scoring divergence in the multi-source review model (i.e., the information entropy H of the score distribution) exceeds a preset threshold. Forced intervention occurs at specific times, and the source of disagreement is determined through logical attribution analysis, resulting in a final ruling. This module overcomes the technical shortcomings of existing technologies, which suffer from large variance in evaluation conclusions and difficulty in objective alignment due to inconsistent understanding of review models in complex semantic scenarios. The specific implementation and ruling mechanism are as follows: 1. Structured divergent data reception and parsing: When the dynamic scoring module determines that the scoring information entropy of the current interaction sample is in a high-entropy blocking state ( When this module is activated, it receives a structured review dataset (i.e., a structured chain of evidence) passed from multiple underlying review models.
[0084] 2. Logical Attribution Analysis of the Source of Disagreement: This module introduces a meta-review model with higher reasoning capabilities to cross-reference and logically attribute multiple sets of structured evidence chains received. The meta-review model introduced here is not a general-purpose model for performing routine question-answering tasks, but a dedicated arbitration reasoning engine encapsulated with a specific architecture and fine-tuned instructions. Its improvement over existing technologies lies in the structured reconstruction of the input-output paradigm: traditional review models directly ingest the original dialogue text and output a single score; while the meta-review model in this application is configured to receive a high-dimensional sequence of conflicting evidence chains (including independent scores, judgment rules, and cited original texts from each model) thrown by the underlying multi-path models. Its output is forcibly constrained to a structured attribution JSON format, rather than free text. Through this architectural improvement, this embodiment reduces the general semantic capabilities of the large model to the specific technical task of analyzing conflicting evidence, thereby achieving automated resolution of evaluation conflicts. Specifically, the meta-review model does not directly re-score the original interaction context, but instead traces the source of the discrepancies between the underlying judges. Specifically, it identifies the core reasons for high-entropy discrepancies through attribution algorithms, such as: a certain review model producing a "factual judgment illusion", different review models having a biased understanding of the weights of multiple constraints (such as "task completion" and "security compliance"), or inconsistent tolerance thresholds for extreme emotional expressions.
[0085] 3. Dynamic Adjudication Based on Priority Matrix: After identifying the source of disagreement, this module uses a pre-defined evaluation priority matrix to resolve conflicts, rather than simply compromising numerical scores or using majority voting. The priority matrix hierarchically categorizes the importance of various evaluation dimensions (e.g., prioritizing unbroken core ethical boundaries > prioritizing factual and logical consistency > prioritizing politeness and rapport). The meta-review model logically examines the conflicting evidence chain based on this priority matrix, removing the weight of low-confidence review models that produce illusions or logical misjudgments, and adopting evidence chains that conform to high-priority rules, thereby generating an arbitration conclusion.
[0086] The evaluation priority matrix is implemented as a pre-defined weighted ranking table of evaluation indicators. The conflict resolution mechanism works as follows: when multiple review models give scores that are extremely divergent, the system no longer uses the traditional arithmetic mean to calculate a compromise score. Instead, it extracts the evaluation dimensions used by each review model, substitutes them into the ranking table for comparison, and directly adopts the judgment result of the higher priority dimension, thus arriving at a clear qualitative decision.
[0087] The specific implementation steps and judgment criteria for logical review are as follows: The meta-review model uses the attribution results and priority matrix from the previous stage as dual judgment criteria. If the attribution results show that a review model has produced factual illusions or logical misjudgments, it is directly deemed invalid. If multiple review models have not produced illusions, but only differ in the evaluation dimensions they are based on, the system compares the hierarchical levels of each dimension according to the priority matrix. Based on the above comparison results, the meta-review model forcibly reduces the scoring weight of review models with low priority or judgment errors to zero, thereby outputting a unique and definite arbitration conclusion.
[0088] 4. Final Arbitration Result and Audit Log Output: After conflict resolution, the meta-arbitration module outputs a unified and dimensionally aligned comprehensive score / qualitative rating, and simultaneously generates a standardized arbitration audit log. The arbitration audit log clearly records the point of disagreement, the key evidence chain adopted, the reasons for rejecting opinions, and the priority rules cited. This mechanism transforms the multi-model scoring process from an uninterpretable "black box scoring" to a traceable and verifiable structured arbitration chain. This not only significantly improves the robustness and consistency of evaluation conclusions in high-difficulty adversarial scenarios, but also provides quantifiable acceptance criteria for subsequent agent version acceptance and continuous quality monitoring.
[0089] As can be seen from the above, when there is a large discrepancy in the scores, this embodiment uses a structured evidence chain and meta-review model attribution to accurately adjust the weights according to the reasons for the discrepancy, eliminates the weights of factual illusion models, and adjusts the weights of other models according to the priority matrix. This avoids simple numerical compromise, effectively suppresses noise scoring interference, improves the accuracy and persuasiveness of the evaluation results in high-disagreement scenarios, and provides reliable support for the evaluation of the consistency of agent behavior.
[0090] In one embodiment of this application, before weighting the multi-source review score data based on the historical reliability scores of multiple review models, the method further includes: Determine the anchor dataset and the corresponding standard scores; the anchor dataset includes a standardized benchmark sample set covering multiple evaluation dimensions of agent behavior consistency and the standard scores corresponding to the standardized benchmark sample set; Zero-sample testing was performed on each initial review model based on the anchor dataset to obtain the test score of each initial review model on the anchor dataset; The deviation coefficient of each initial review model is calculated based on the test score of the anchor dataset and the standard score corresponding to the anchor dataset. The initial review model is calibrated based on the deviation coefficient of each initial review model to obtain the calibrated review model.
[0091] In this embodiment, the standard score is a benchmark score in the anchor dataset that has been consistently annotated by human experts, such as 85 points uniformly annotated by experts for compliance scenario samples. The initial review model is the original scoring model that has not undergone bias calibration, such as a basic review model that has just been deployed and has not been adjusted. Zero-shot testing is a testing method that does not require additional training on the initial review model and directly inputs the anchor dataset to obtain scores, such as directly using benchmark samples to drive the model to output scores. The bias coefficient is a quantitative value of the degree of deviation between the initial review model's test score and the standard score, such as a quantitative result of the model scoring generally being too high. The calibrated review model is a model that has corrected systematic biases based on the bias coefficient, such as an optimized model that corrects the problem of lenient scoring.
[0092] For example, this embodiment is used to perform zero-sample testing and calibration on each review model before evaluation using a pre-set "anchor dataset" to calculate and store the systematic bias coefficient of each review model. This embodiment solves the technical defect that evaluation scores are incomparable across models and batches due to differences in different review model versions or preferences (such as "strict" and "lenient" models). The specific implementation process is as follows: 2. Construction and Input of Anchor Data: This embodiment can pre-define a standardized "anchor dataset," which has been highly consistently annotated by human experts and contains benchmark scores covering different behavioral consistency dimensions. Before formally evaluating the target agent, the system inputs this anchor data into the candidate... Zero-sample scoring is performed in each review model.
[0093] 2. Zero Sample Bias Coefficient Calculation: Let the baseline average score of human experts for a certain anchor point sample be... , No. There are several review models (where j) The baseline average score given for the same anchor point sample ({1, 2, ..., m}) is This embodiment calculates the systematic bias coefficient of the review model on this dimension. The calculation formula is as follows: .
[0094] 3. Dynamic alignment and dimensional normalization: The system will adjust the deviation coefficients of each review model. Disk storage. During the formal dynamic scoring phase, for the first... The original scores given by each review model This module will enforce calibration mapping and output the calibrated score. The calculation formula is as follows: .
[0095] Through this bias elimination mechanism, this embodiment establishes a unified measurement benchmark for all multi-source heterogeneous evaluation models, enabling the final evaluation conclusions to have strict mathematical cross-comparability.
[0096] This embodiment uses anchor point datasets and standard scores to calculate and calibrate the initial review model deviation coefficient through zero-sample testing. This effectively eliminates the inherent systematic bias of the review model, solves the problem of inconsistent scoring standards among different models, and enables subsequent weighted scoring to have a unified dimension. This improves the horizontal comparability and accuracy of the evaluation results and provides a reliable foundation for the evaluation of agent behavior consistency.
[0097] In one embodiment of this application, based on each structured chain of evidence, a meta-review model is used to obtain the attribution result of the divergence logic, including: Based on each structured chain of evidence, the first attribution operation is performed through the meta-review model to obtain the divergence logic attribution result; The first attribution operation includes: The semantic similarity of citations is calculated based on the context fragments referenced in each structured chain of evidence; If the semantic similarity is less than the preset similarity threshold, the logical attribution result of the divergence is determined to be a misalignment of context focus. If the semantic similarity is not less than the similarity threshold, the rule consistency score is calculated based on the trigger judgment rules in each structured evidence chain; if the rule consistency score is less than the preset rule consistency score threshold, the divergence logic attribution result is determined to be a weighted understanding bias of multiple constraints; if the rule consistency score is not less than the rule consistency score threshold, the hallucination detection operation is performed. The hallucination detection operation includes: for each context fragment referenced in the structured evidence chain, using prior information about the scene to perform factual logic reasoning on the context fragment to obtain the reasoning result; if the reasoning result is a logical contradiction, it is determined that the review model corresponding to the structured evidence chain has a factual judgment hallucination, and the factual judgment hallucination is used as the result of the disagreement logic attribution.
[0098] In one embodiment of this application, based on each structured chain of evidence, a meta-review model is used to obtain the attribution result of the divergence logic, including: Based on each structured chain of evidence, a second attribution operation is performed through a meta-review model to obtain the divergence logic attribution result; The second attribution operation includes: For each context fragment referenced in the structured chain of evidence, factual logic reasoning is performed on the context fragment using prior information about the scenario to obtain the reasoning result; If the reasoning result is a logical contradiction, then it is determined that the review model corresponding to the structured evidence chain has a factual judgment illusion, and the factual judgment illusion is taken as the result of the logical attribution of the disagreement. If the reasoning result is without logical contradiction, the semantic similarity of the citations is calculated based on the context fragments cited in each structured evidence chain; if the semantic similarity is less than the preset similarity threshold, the logical attribution result of the divergence is determined to be a misalignment of context focus; if the semantic similarity is not less than the similarity threshold, the rule consistency detection operation is performed. The rule consistency check operation includes: The rule consistency score is calculated based on the triggering judgment rules in each structured evidence chain; if the rule consistency score is less than the preset rule consistency score threshold, the attribution result of the divergence logic is determined to be the weighted understanding bias of multiple constraints.
[0099] In this embodiment, the first attribution operation is an attribution process executed by the meta-review model in the order of context fragment semantic similarity, rule consistency score, and hallucination detection. For example, it first determines whether the focus is misaligned before detecting rule consistency. The similarity threshold is a preset standard for determining whether the citation semantics are consistent, such as a quantized threshold of 0.3. The rule consistency score is a numerical value that measures the degree of fit between different review models triggering the judgment rule; for example, a higher score is obtained when multiple models trigger compliant rules. The rule consistency score threshold is a preset standard for determining whether the rules are consistent, such as a quantized threshold of 0.5. The hallucination detection operation is the process of verifying whether there is a logical contradiction in the structured evidence chain, such as using scenario prior information to verify the rationality of reasoning. Scenario prior information is the inherent objective background information of the scenario, such as the basic setting of a deserted island with no signal. The reasoning result is the conclusion drawn after factual logical reasoning, such as the determination of whether there is a logical contradiction. The second attribution operation is an attribution process executed by the meta-review model in the order of hallucination detection, context fragment semantic similarity, and rule consistency detection. For example, it first checks for factual hallucinations before determining whether the focus is misaligned. The rule consistency check operation is the process of calculating the rule consistency score and determining whether there is a bias in the understanding of weights. For example, if the score is lower than the threshold, it is determined to be a bias.
[0100] For example, this embodiment includes three aspects of attribution detection operations: rule consistency detection, illusion detection, and context focus misalignment detection. The order of these three attribution detection operations is not limited to the two cases provided in this embodiment, and the divergent logic attribution result is not limited to the single result provided in this embodiment; multiple attributions can coexist.
[0101] An example of a context focus misalignment detection operation is as follows: citation focus overlap calculation (determining whether there is "attention defocus"). The meta-review model system first calculates the overlap between the two citations. and semantic similarity .like If the value is less than the preset minimum threshold, it indicates that the two models are focusing on different segments of the dialogue. The algorithm determines that the core reason for the discrepancy is "misalignment of contextual focus".
[0102] An example of a rule consistency check operation is as follows: constraint rule mapping comparison (determining whether there is a "weight interpretation bias"). The meta-review model can compare the decision rules triggered by two review models. If the two decision rules are different, for example, model A triggers the "task completion" rule while model B triggers the "security compliance" rule, the core reason for the algorithm's discrepancy is "weight interpretation bias of multiple constraint conditions".
[0103] For example, an example of the hallucination detection operation is as follows: The meta-review model performs factual reasoning logic, logically reasoning between the citation Q and the prior information of the scene. If a logical contradiction is found between the reasoning chain of a certain model and the objective text, the algorithm outputs the attribution result, determining that the model has produced a "factual judgment hallucination".
[0104] This embodiment provides a variety of progressive attribution operations, which verify semantic similarity, rule consistency and factual logic in different orders, comprehensively covering various causes of disagreement, accurately locating contextual focus misalignment, misunderstanding of multiple constraint weights and factual judgment illusions, providing a reliable basis for subsequent weight adjustments, improving the accuracy of attribution results in high-disagreement scenarios and ensuring the robustness of evaluation conclusions.
[0105] Corresponding to the dynamic evaluation method for agent behavior consistency in the above embodiment, Figure 5 This is a structural block diagram of a dynamic evaluation system for agent behavior consistency provided in one embodiment of this application. For ease of explanation, only the parts relevant to the embodiment of this application are shown. References Figure 5 The intelligent agent behavior consistency dynamic evaluation system 20 includes: an adversarial data construction module 21, an interactive execution module 22, a dynamic scoring module 23, and a meta-arbitration review module 24.
[0106] Among them, the adversarial data construction module 21 is used to obtain the adversarial evaluation dataset; The interaction execution module 22 is used to test the agent through a multi-turn dialogue state machine based on the adversarial evaluation dataset to obtain interaction behavior data; the interaction behavior data is a complete dialogue data consisting of the response text data generated by the agent in response to the adversarial evaluation dataset and the corresponding context fragments. The dynamic scoring module 23 is used to score the interactive behavior data through multiple review models to obtain multi-source review scoring data, and to calculate the scoring divergence degree based on the multi-source review scoring data. The meta-arbitration review module 24 is used to obtain the structured evidence chain output by each review model if the score disagreement degree is greater than the empirical threshold of disagreement degree. Based on each structured evidence chain, the disagreement logic attribution result is obtained through the meta-review model. Based on the disagreement logic attribution result and the evaluation priority matrix, the weight of each review model is adjusted. Based on the adjusted weight of each review model, the multi-source review score data is weighted to obtain a weighted comprehensive score. The structured evidence chain output by each review model includes the review score of the review model, the triggering decision rules, and the context fragments referenced from the interaction behavior data; the meta-review model is an arbitration reasoning engine obtained by structurally reconstructing the input-output paradigm of the large language model; the evaluation priority matrix includes the importance priority of multiple evaluation dimensions, and each review model scores the interaction behavior data based on the evaluation dimensions.
[0107] In one embodiment of this application, the intelligent agent behavior consistency dynamic evaluation system 20 further includes: an undisputed review module, used to: after calculating the score divergence degree based on the multi-source review score data, if the score divergence degree is not greater than the divergence degree empirical threshold, then weight the multi-source review score data based on the historical reliability scores of multiple review models to obtain a weighted comprehensive score. The historical reliability score of each review model is obtained based on the scoring deviation coefficient of the review model on the preset anchor dataset, and the scoring deviation coefficient is negatively correlated with the historical reliability score.
[0108] In one embodiment of this application, the adversarial data construction module 21 is specifically used for: Construct multiple scenario families; each scenario family includes multiple samples covering multiple dimensions; the multiple dimensions include social relationship structure, task pressure, information completeness, and constraints; Standardize each scene family, extract structured data templates from the standardized scene families as seed topologies, perturb the parameters of the seed topologies of each scene family, and generate an initial adversarial sample set. The initial adversarial sample set is subjected to self-consistency testing, and the samples that pass the self-consistency test in the initial adversarial sample set are used as the adversarial evaluation dataset.
[0109] In one embodiment of this application, the interactive execution module 22 is specifically used for: The interaction sandbox environment of the intelligent agent is initialized based on the adversarial evaluation dataset; The multi-turn dialogue state machine is used to drive the agent to complete each round of text interaction in the initialized interaction sandbox environment until the response text data and corresponding context fragments generated by the agent in the latest round meet the offline scoring success conditions or reach the maximum round limit. The response text data generated by the agent in each round and the corresponding context fragments are used as interaction behavior data.
[0110] In one embodiment of this application, the agent behavior consistency dynamic evaluation system 20 further includes: a deviation calibration module, used to: determine the anchor dataset and the standard score corresponding to the anchor dataset before weighting the multi-source review score data based on the historical reliability scores of multiple review models; the anchor dataset includes a standardized benchmark sample set covering multiple evaluation dimensions of agent behavior consistency and the standard score corresponding to the standardized benchmark sample set. Zero-sample testing was performed on each initial review model based on the anchor dataset to obtain the test score of each initial review model on the anchor dataset; The deviation coefficient of each initial review model is calculated based on the test score of the anchor dataset and the standard score corresponding to the anchor dataset. The initial review model is calibrated based on the deviation coefficient of each initial review model to obtain the calibrated review model.
[0111] In one embodiment of this application, the meta-arbitration review module 24 is specifically used for: Based on each structured chain of evidence, the first attribution operation is performed through the meta-review model to obtain the divergence logic attribution result; The first attribution operation includes: The semantic similarity of citations is calculated based on the context fragments referenced in each structured chain of evidence; If the semantic similarity is less than the preset similarity threshold, the logical attribution result of the divergence is determined to be a misalignment of context focus. If the semantic similarity is not less than the similarity threshold, the rule consistency score is calculated based on the trigger judgment rules in each structured evidence chain; if the rule consistency score is less than the preset rule consistency score threshold, the divergence logic attribution result is determined to be a weighted understanding bias of multiple constraints; if the rule consistency score is not less than the rule consistency score threshold, the hallucination detection operation is performed. The hallucination detection operation includes: for each context fragment referenced in the structured evidence chain, using prior information about the scene to perform factual logic reasoning on the context fragment to obtain the reasoning result; if the reasoning result is a logical contradiction, it is determined that the review model corresponding to the structured evidence chain has a factual judgment hallucination, and the factual judgment hallucination is used as the result of the disagreement logic attribution.
[0112] In one embodiment of this application, the meta-arbitration review module 24 is further used for: If the result of the divergent logic attribution is a factual judgment illusion, then the source review model is determined based on the result of the divergent logic attribution, and the weight corresponding to the source review model is reset to 0. If the attribution result of the divergence logic is a misalignment of context focus or a bias in the understanding of the weights of multiple constraints, then the triggering decision rule corresponding to each review model is determined. Based on the evaluation priority matrix and the triggering decision rule corresponding to each review model, the priority of each review model is sorted to obtain a priority sequence. The weight of each review model is then adjusted according to the priority sequence.
[0113] See Figure 6 , Figure 6 This is a schematic block diagram of an electronic device provided according to an embodiment of this application. Figure 6 The electronic device 300 in this embodiment may include one or more processors 301, one or more input devices 302, one or more output devices 303, and one or more memories 304. The processors 301, input devices 302, output devices 303, and memories 304 communicate with each other via a communication bus 305. The memories 304 store computer programs, including program instructions. The processors 301 execute the program instructions stored in the memories 304. Specifically, the processors 301 are configured to invoke the program instructions to perform the functions of each module / unit in the above system embodiments, for example... Figure 5 The functions of the adversarial data construction module 21, interactive execution module 22, dynamic scoring module 23, and meta-arbitration review module 24 are shown.
[0114] It should be understood that, in the embodiments of this application, the processor 301 may be a central processing unit (CPU), or it may be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or any conventional processor.
[0115] Input device 302 may include a touchpad, a fingerprint sensor (for collecting the user's fingerprint information and fingerprint orientation information), a microphone, etc., and output device 303 may include a display (LCD, etc.), a speaker, etc.
[0116] The memory 304 may include read-only memory and random access memory, and provides instructions and data to the processor 301. A portion of the memory 304 may also include non-volatile random access memory. For example, the memory 304 may also store information about an intelligent agent.
[0117] In specific implementations, the processor 301, input device 302, and output device 303 described in the embodiments of this application can execute the implementation method described in the dynamic evaluation method for consistency of intelligent agent behavior provided in the embodiments of this application, or they can execute the implementation method of the electronic device described in the embodiments of this application, which will not be repeated here.
[0118] In another embodiment of this application, a computer-readable storage medium is provided. This computer-readable storage medium stores a computer program, which includes program instructions. When executed by a processor, the program instructions implement all or part of the processes in the methods described above. Alternatively, the computer program can instruct related hardware to complete the process. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. The computer-readable medium can include any entity or device capable of carrying computer program code, a recording medium, a USB flash drive, a portable hard drive, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM), a random access memory (RAM), an electrical carrier signal, a telecommunication signal, and a software distribution medium, etc.
[0119] The computer-readable storage medium can be an internal storage unit of the electronic device in any of the foregoing embodiments, such as a hard disk or memory of the electronic device. The computer-readable storage medium can also be an external storage device of the electronic device, such as a plug-in hard disk, smart media card (SMC), secure digital card (SD), flash card, etc., provided on the electronic device. Furthermore, the computer-readable storage medium can include both internal and external storage units of the electronic device. The computer-readable storage medium is used to store computer programs and other programs and data required by the electronic device. The computer-readable storage medium can also be used to temporarily store data that has been output or will be output.
[0120] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this application.
[0121] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working process of the electronic devices and units described above can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.
[0122] In the several embodiments provided in this application, it should be understood that the disclosed electronic devices and methods can be implemented in other ways. For example, the system embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple modules or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the couplings or direct couplings or communication connections shown or discussed may be indirect couplings or communication connections through some interfaces or units, or they may be electrical, mechanical, or other forms of connection.
[0123] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of the embodiments of this application, depending on actual needs.
[0124] Furthermore, the functional modules in the various embodiments of this application can be integrated into one processing module, or each module can exist physically separately, or two or more modules can be integrated into one module. The integrated modules described above can be implemented in hardware or as software functional modules.
[0125] The above are merely specific embodiments of this application, but the scope of protection of this application is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in this application, and these modifications or substitutions should all be covered within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A method for dynamic evaluation of agent behavior consistency, characterized in that, include: Obtain the adversarial evaluation dataset; Based on the aforementioned adversarial evaluation dataset, the agent is tested through a multi-turn dialogue state machine to obtain interaction behavior data; The interactive behavior data is a complete dialogue data consisting of response text data generated by the agent for the adversarial evaluation dataset and corresponding context fragments. The interaction behavior data is scored by multiple review models to obtain multi-source review score data, and the score divergence degree is calculated based on the multi-source review score data. If the score divergence is greater than the divergence empirical threshold, then the structured evidence chain output by each review model is obtained. Based on each structured evidence chain, the divergence logic attribution result is obtained through the meta-review model. Based on the divergence logic attribution result and the evaluation priority matrix, the weight of each review model is adjusted. Based on the adjusted weights of each review model, the multi-source review score data is weighted to obtain a weighted comprehensive score. The structured evidence chain output by each review model includes the review score of the review model, the triggering judgment rule, and the context fragments referenced from the interaction behavior data; the meta-review model is an arbitration reasoning engine obtained by structurally reconstructing the input-output paradigm of the large language model; the evaluation priority matrix includes the importance priority of multiple evaluation dimensions, and each review model scores the interaction behavior data based on the evaluation dimensions.
2. The method for dynamic evaluation of agent behavior consistency as described in claim 1, characterized in that, After calculating the score divergence based on the multi-source review scoring data, the method further includes: If the score divergence is not greater than the empirical threshold for divergence, the multi-source review score data is weighted based on the historical reliability scores of the multiple review models to obtain a weighted comprehensive score. The historical reliability score of each review model is obtained based on the scoring deviation coefficient of the review model on a preset anchor dataset, and the scoring deviation coefficient is negatively correlated with the historical reliability score.
3. The method for dynamic evaluation of agent behavior consistency as described in claim 1, characterized in that, The acquisition of the adversarial evaluation dataset includes: Construct multiple scenario families; each scenario family includes multiple samples covering multiple dimensions; the multiple dimensions include social relationship structure, task pressure, information completeness, and constraints; Standardize each scene family, extract structured data templates from the standardized scene families as seed topologies, perturb the parameters of the seed topologies of each scene family, and generate an initial adversarial sample set. The initial adversarial sample set is subjected to self-consistency testing, and the samples that pass the self-consistency test in the initial adversarial sample set are used as the adversarial evaluation dataset.
4. The method for dynamic evaluation of agent behavior consistency as described in claim 1, characterized in that, The interaction behavior data obtained by testing the agent through a multi-turn dialogue state machine based on the adversarial evaluation dataset includes: The interaction sandbox environment of the intelligent agent is initialized based on the aforementioned adversarial evaluation dataset; The multi-turn dialogue state machine is used to drive the agent to complete each round of text interaction in the initialized interaction sandbox environment until the response text data and corresponding context fragments generated by the agent in the latest round meet the offline scoring success conditions or reach the maximum round limit. The response text data generated by the agent in each round and the corresponding context fragments are used as the interaction behavior data.
5. The method for dynamic evaluation of agent behavior consistency as described in claim 1, characterized in that, Before weighting the multi-source review score data based on the historical reliability scores of the multiple review models, the method further includes: Determine the anchor dataset and the standard score corresponding to the anchor dataset; the anchor dataset includes a standardized benchmark sample set covering multiple evaluation dimensions of agent behavior consistency and the standard score corresponding to the standardized benchmark sample set; Based on the anchor dataset, a zero-sample test is performed on each initial review model to obtain the test score of each initial review model on the anchor dataset; The deviation coefficient of each initial review model is calculated based on the test score of the anchor dataset and the standard score corresponding to the anchor dataset. The initial review model is calibrated based on the deviation coefficient of each initial review model to obtain the calibrated review model.
6. The method for dynamic evaluation of agent behavior consistency as described in claim 1, characterized in that, The process of obtaining the attribution results of the divergence logic based on each of the structured evidence chains through the meta-review model includes: Based on each of the structured evidence chains, the first attribution operation is performed through the meta-review model to obtain the divergence logic attribution result; The first attribution operation includes: The semantic similarity of citations is calculated based on the context fragments referenced in each of the structured chains of evidence. If the semantic similarity is less than a preset similarity threshold, then the divergence logic attribution result is determined to be a context focus misalignment; If the semantic similarity is not less than the similarity threshold, then the rule consistency score is calculated based on the trigger judgment rule in each of the structured evidence chains; if the rule consistency score is less than the preset rule consistency score threshold, then the divergence logic attribution result is determined to be a weighted understanding bias of multiple constraints; if the rule consistency score is not less than the rule consistency score threshold, then the hallucination detection operation is performed. The hallucination detection operation includes: for each context fragment referenced in the structured evidence chain, performing factual logic reasoning on the context fragment using prior scene information to obtain a reasoning result; if the reasoning result is a logical contradiction, then determining that the review model corresponding to the structured evidence chain has a factual judgment hallucination, and using the factual judgment hallucination as the attribution result of the divergence logic.
7. The method for dynamic evaluation of agent behavior consistency as described in claim 6, characterized in that, The adjustment of the weights for each review model based on the divergence logic attribution results and the evaluation priority matrix includes: If the attribution result of the divergence logic is a factual judgment illusion, then the source review model is determined based on the attribution result of the divergence logic, and the weight corresponding to the source review model is reset to 0. If the attribution result of the divergence logic is a misalignment of context focus or a bias in the understanding of the weights of multiple constraints, then the triggering judgment rule corresponding to each review model is determined, and each review model is prioritized according to the evaluation priority matrix and the triggering judgment rule corresponding to each review model to obtain a priority sequence. The weight of each review model is then adjusted according to the priority sequence.
8. A dynamic evaluation system for the consistency of agent behavior, characterized in that, include: The adversarial data construction module is used to obtain adversarial evaluation datasets; The interaction execution module is used to test the agent through a multi-turn dialogue state machine based on the adversarial evaluation dataset to obtain interaction behavior data; the interaction behavior data is a complete dialogue data consisting of the response text data generated by the agent in response to the adversarial evaluation dataset and the corresponding context fragments. The dynamic scoring module is used to score the interactive behavior data using multiple review models to obtain multi-source review scoring data, and to calculate the scoring divergence degree based on the multi-source review scoring data. The meta-arbitration review module is used to obtain the structured evidence chain output by each review model if the score disagreement degree is greater than the disagreement degree empirical threshold; obtain the disagreement logic attribution result through the meta-review model based on each structured evidence chain; adjust the weight of each review model based on the disagreement logic attribution result and the evaluation priority matrix; and weight the multi-source review score data based on the adjusted weight of each review model to obtain a weighted comprehensive score. The structured evidence chain output by each review model includes the review score of the review model, the triggering judgment rule, and the context fragments referenced from the interaction behavior data; the meta-review model is an arbitration reasoning engine obtained by structurally reconstructing the input-output paradigm of the large language model; the evaluation priority matrix includes the importance priority of multiple evaluation dimensions, and each review model scores the interaction behavior data based on the evaluation dimensions.
9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the method as described in any one of claims 1 to 7.
10. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method as described in any one of claims 1 to 7.