AI agent output result credibility evaluation and traceability tracking system

The AI ​​agent output credibility assessment and traceability system solves the problems of insufficient evaluation of agent output results and unstable correspondence between conclusions and process information. It realizes the alignment of conclusion units and reasoning process and credibility assessment, and improves the pertinence and traceability of the assessment.

CN122240513APending Publication Date: 2026-06-19JIANGSU RUNYILIAN INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
JIANGSU RUNYILIAN INFORMATION TECH CO LTD
Filing Date
2026-05-22
Publication Date
2026-06-19

Smart Images

  • Figure CN122240513A_ABST
    Figure CN122240513A_ABST
Patent Text Reader

Abstract

This invention relates to the field of artificial intelligence and information processing technology, and discloses an AI agent output credibility assessment and source tracing system, comprising: a task aggregation module for forming a unified assessment object set; a type determination module for determining task type labels and task-aware credibility contracts; a result decomposition module for obtaining a conclusion unit set and an inference chain snapshot node set; an alignment and coverage module for determining alignment relationships and node coverage; a multi-dimensional assessment module for determining consistency scores, evidence coverage scores, event chain consistency scores, and a set of questionable segments; a credibility fusion module for determining uncertainty scores, source quality scores, basic credibility scores, conclusion-level credibility, and result-level credibility; and a source tracing and verification module for constructing a hierarchical source tracing tree, generating source tracing record packages, and determining a list of key verification nodes. This invention enables the quantifiable determination of the credibility of the agent's output results and the location-based verification of its formation process.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of artificial intelligence and information processing technology, specifically relating to a system for evaluating the credibility of AI agent output results and tracing its origins. Background Technology

[0002] As intelligent processing systems designed for complex tasks proliferate, intelligent agents are no longer limited to single-round text generation. Instead, they combine task request information, invoke external tools, retrieve external evidence, and gradually generate output results based on the execution process. In practical applications, systems often need to provide not only conclusions but also explanations of which intermediate processing steps led to those conclusions, which evidence was cited, and whether they are consistent with business event records. In other words, simply focusing on the final output is no longer sufficient to meet the requirements of reliability and process traceability in complex scenarios.

[0003] Current practices employ two main approaches: one is to directly score the overall output of the agent, and the other is to retain execution logs or retrieval records for manual review. Both approaches have shortcomings: the former typically only provides an overall verification result, making it difficult to pinpoint which specific conclusion is problematic; the latter, while preserving process information, lacks a unified organization among task request information, tool call records, retrieval evidence sets, and business event sets, making it difficult to establish a stable correspondence between conclusion content and process nodes, evidence items, and business event records. Furthermore, using the same evaluation criteria for different tasks such as fact-finding, reasoning analysis, and idea generation can easily lead to inconsistent judgment standards. Summary of the Invention

[0004] This invention provides a system for evaluating the credibility of AI agent output results and for tracing the source of information, thus solving the technical problems in the background art.

[0005] This invention provides a system for evaluating the credibility of AI agent output results and tracing its origins, including:

[0006] The task aggregation module is used to obtain the task request information, agent output results, execution process logs, tool call records, retrieval evidence set and business event set corresponding to the task to be evaluated, and to associate and organize them to obtain a unified set of evaluation objects;

[0007] The type determination module is used to determine the task type based on the task request information, obtain the task type label, and determine the task perception trust contract based on the task type label.

[0008] The result decomposition module is used to segment the results based on the agent's output to obtain a set of conclusion units; and to obtain a set of inference chain snapshot nodes based on the execution process log, tool call records, and retrieval evidence set.

[0009] The alignment and coverage module is used to determine alignment relationships and node coverage based on the set of conclusion units, the set of inference chain snapshot nodes, and the task-aware credibility contract.

[0010] The multidimensional evaluation module is used to determine the consistency score, evidence coverage score, event chain consistency score, and set of questionable fragments based on the conclusion unit set, alignment relationship, retrieval evidence set, and business event set.

[0011] The Trusted Fusion Module is used to determine the Uncertainty Score, Source Quality Score, Basic Trustworthiness Score, Conclusion-Level Trustworthiness, and Result-Level Trustworthiness based on consistency score, evidence coverage score, event chain consistency score, node coverage rate, and retrieved evidence set.

[0012] The source tracing and verification module is used to construct a hierarchical source tracing tree based on conclusion-level credibility, result-level credibility, alignment relationship, inference chain snapshot node set, retrieval evidence set, business event set, and questionable fragment set, generate source tracing record package, and determine the list of key verification nodes.

[0013] The beneficial effects of this invention are as follows: Based on task request information, agent output results, execution process logs, tool call records, retrieval evidence sets, and business event sets, this invention establishes a unified evaluation object set. Within the same task boundary, it completes task type determination, conclusion unit segmentation, inference chain snapshot node construction, alignment relationship determination, multi-dimensional credibility calculation, and hierarchical tracing. Compared to methods that only provide an overall score, this invention can refine the credibility results to specific conclusion units and establish a correspondence between them and the inference process, evidence items, and business event records, giving low-credibility content a clear location path. Simultaneously, a task-aware credibility contract is introduced for different task types, ensuring that the evaluation criteria are consistent with task attributes. Furthermore, by combining node coverage, evidence coverage scores, and event chain consistency scores, the result judgment not only relies on the text output itself but also considers process support and business loop conditions for verification. This improves the relevance of the result evaluation and facilitates subsequent review and tracing. Attached Figure Description

[0014] Figure 1 This is a schematic diagram of the module of the AI ​​intelligent agent output result credibility assessment and traceability system of the present invention. Detailed Implementation

[0015] The subject matter described herein will now be discussed with reference to exemplary embodiments. It should be understood that these embodiments are discussed only to enable those skilled in the art to better understand and implement the subject matter described herein, and changes may be made to the function and arrangement of the elements discussed without departing from the scope of this specification. Various processes or components may be omitted, substituted, or added as needed in the examples. Furthermore, features described in some examples may be combined in other examples.

[0016] It should be noted that, unless otherwise defined, the technical or scientific terms used in one or more embodiments of the present invention should have the ordinary meaning understood by one of ordinary skill in the art to which this invention pertains. The terms "first," "second," and similar terms used in one or more embodiments of the present invention do not indicate any order, quantity, or importance, but are merely used to distinguish different components. Terms such as "comprising" or "including" mean that the element or object preceding the word encompasses the elements or objects listed after the word and their equivalents, without excluding other elements or objects. Terms such as "connected" or "linked" are not limited to physical or mechanical connections, but can include electrical connections, whether direct or indirect. Terms such as "upper," "lower," "left," and "right" are used only to indicate relative positional relationships; when the absolute position of the described object changes, the relative positional relationship may also change accordingly.

[0017] like Figure 1 As shown, the AI ​​agent output credibility assessment and traceability system includes:

[0018] The task aggregation module is used to obtain the task request information, agent output results, execution process logs, tool call records, retrieval evidence set and business event set corresponding to the task to be evaluated, and to associate and organize them to obtain a unified set of evaluation objects;

[0019] The type determination module is used to determine the task type based on the task request information, obtain the task type label, and determine the task perception trust contract based on the task type label.

[0020] The result decomposition module is used to segment the results based on the agent's output to obtain a set of conclusion units; and to obtain a set of inference chain snapshot nodes based on the execution process log, tool call records, and retrieval evidence set.

[0021] The alignment and coverage module is used to determine alignment relationships and node coverage based on the set of conclusion units, the set of inference chain snapshot nodes, and the task-aware credibility contract.

[0022] The multidimensional evaluation module is used to determine the consistency score, evidence coverage score, event chain consistency score, and set of questionable fragments based on the conclusion unit set, alignment relationship, retrieval evidence set, and business event set.

[0023] The Trusted Fusion Module is used to determine the Uncertainty Score, Source Quality Score, Basic Trustworthiness Score, Conclusion-Level Trustworthiness, and Result-Level Trustworthiness based on consistency score, evidence coverage score, event chain consistency score, node coverage rate, and retrieved evidence set.

[0024] The source tracing and verification module is used to construct a hierarchical source tracing tree based on conclusion-level credibility, result-level credibility, alignment relationship, inference chain snapshot node set, retrieval evidence set, business event set, and questionable fragment set, generate source tracing record package, and determine the list of key verification nodes.

[0025] In one embodiment of the present invention, the system first acquires the task request information, AI agent output results, execution process logs, tool call records, retrieval evidence sets, and business event sets corresponding to the task to be evaluated. Then, it performs correlation and organization on the above data to obtain a unified evaluation object set. The AI ​​agent output results refer to the results generated by the AI ​​agent based on the task request information, after inference processing, tool calls, and evidence retrieval. The execution process logs refer to the step-level processing records formed by the AI ​​agent during task execution. The unified evaluation object set refers to a structured data set formed around the same task to be evaluated, which not only stores the original records but also the hierarchical relationships, temporal order, and correlations between the records. Through the above processing, subsequent task type determination, conclusion unit segmentation, credibility calculation, and source tracing are all based on the same data foundation.

[0026] Specifically, in step 11, the system writes corresponding record type identifiers to each record in the task request information, agent output results, execution process logs, tool call records, retrieval evidence set, and business event set, and writes the same unified task identifier to each record. The record type identifier is a field used to distinguish record categories, and the unified task identifier is a unique marker used to indicate the affiliation of the same task to be evaluated.

[0027] It should be noted that data from different sources often have different formats and names in their raw state. If a record type identifier and a unified task identifier are not written first, the record attribution may become mixed later. In other words, even if task request information, agent output results, and tool call records come from different modules, as long as they belong to the same task to be evaluated, they should all be written with the same unified task identifier.

[0028] In step 12, the system extracts the original time information of each record after writing the unified task identifier, converts it into a unified time format, sorts it according to the unified time format, and determines the ordinal index of each record based on the record type identifier and the original record order when the unified time formats are the same. The unified time format refers to the standardized time field, and the ordinal index refers to the arrangement position of the record on the unified timeline.

[0029] Through the above processing, all types of records are organized onto the same timeline. If a tool call record and an execution process log have the same time value, their positions are determined based on the record type identifier and the original record order, thus ensuring the temporal consistency during subsequent inference process reconstruction.

[0030] In step 13, the system establishes an association mapping based on the reference relationships between records, the generation relationship between tool call records and execution process logs and agent output results, the supporting relationship between evidence sets and execution process logs and agent output results, and the triggering relationship between agent output results, tool call records, and execution process logs and business event sets. The system then writes each record, its sequence index, and the association mapping, after being written with a unified task identifier, into a unified evaluation object set. The reference relationship refers to the correspondence where one record directly references the content or identifier of another record; the generation relationship refers to the source relationship of tool call results entering subsequent processing or output content; the supporting relationship refers to the basis relationship between evidence items and conclusion content; and the triggering relationship refers to the linkage relationship between output content or intermediate processing actions and business events.

[0031] For example, if a piece of retrieval evidence is used to form an anomaly cause judgment, and this anomaly cause judgment triggers the generation of a work order record, then a supporting relationship is formed between the retrieval evidence and the judgment, and a triggering relationship is formed between the judgment and the work order record. Through the above processing, the unified evaluation object set not only preserves the records themselves, but also preserves the continuously referential relationship paths between the records.

[0032] Through the above steps, the system organizes the previously scattered task request information, agent output results, execution process logs, tool call records, retrieval evidence sets, and business event sets into a unified evaluation object set. This provides a unified data entry point for subsequent credibility assessment and preserves a clear relationship chain for subsequent source tracing, ensuring a stable correspondence between conclusions, reasoning processes, evidence sources, and business events.

[0033] In one embodiment of the present invention, after obtaining a unified set of evaluation objects, the system further determines the task type based on the task request information to obtain a task type label, and determines the task-aware credibility contract based on the task type label. The task type label refers to the classification result of the processing category to which the current task to be evaluated belongs; the task-aware credibility contract refers to a set of evaluation constraint information corresponding to the classification result, including evaluation dimension weights, a set of required evidence types, a set of required inference node types, a node coverage threshold, and a loop closure verification threshold. Through the above processing, the matching range of the subsequent conclusion unit set and the inference chain snapshot node set, the evidence verification range, and the credibility fusion method are all based on a unified task type.

[0034] Specifically, in step 21, the system extracts task request information from a unified evaluation object set and performs semantic analysis and classification processing on the task request information. The task types include fact query, reasoning analysis, and creative generation. Fact query refers to a task type centered on objective fact confirmation, attribute retrieval, and status judgment; reasoning analysis refers to a task type centered on cause analysis, relationship judgment, process induction, and solution judgment; and creative generation refers to a task type centered on text generation, content expansion, and expression reconstruction.

[0035] During processing, the system determines the degree of matching between the task request information and fact-based queries, inference analysis, and creative generation. This degree of matching refers to the extent to which the task request information aligns with the corresponding task type in terms of semantic content, task objectives, and expression. In other words, the system does not directly categorize based on a single keyword, but rather combines the factual items, analytical items, and generated items contained in the task request information to generate corresponding matching results for different task types. For example, when the task request information primarily requires determining whether an event has occurred or what attributes an object possesses, the system will increase its degree of matching with fact-based queries; when the task request information requires comparing the influence relationships between multiple factors, the system will increase its degree of matching with inference analysis.

[0036] In step 22, the system determines the task type with the highest matching degree based on the matching degree of fact query, inference analysis, and creative generation, as the task type label. Here, the task type label represents the unique task type determination result of the current task request information.

[0037] Through the above processing, task request information is converged to a single task type label, instead of retaining multiple parallel labels simultaneously. This ensures a single reading path when extracting parameters from the preset task type contract template, preventing the cross-application of evaluation rules for different task types. For example, if the same task is determined to be a fact query, the subsequent evaluation process will emphasize evidence correspondence and factual support; if it is determined to be a reasoning analysis, the subsequent evaluation process will emphasize reasoning node coverage and event chain verification.

[0038] In step 23, the system extracts the corresponding evaluation dimension weights, necessary evidence type sets, necessary inference node type sets, node coverage thresholds, and closed-loop verification thresholds from the preset task type contract template based on the task type label, forming a task-aware credibility contract. The task type label and the task-aware credibility contract are then written into a unified evaluation object set. The evaluation dimension weights refer to a set of weight parameters used to constrain the participation degree of different credibility dimensions; the necessary evidence type set refers to the range of evidence categories that must be present under the current task type; the necessary inference node type set refers to the range of node types that should appear in the inference chain under the current task type; the node coverage threshold refers to the minimum requirement that the conclusion unit must meet when covering the necessary inference node type set; and the closed-loop verification threshold refers to the minimum requirement that must be met when performing a closed-loop comparison between business event records and inference results.

[0039] It should be noted that the preset task type contract template is not a universal, fixed template, but rather a set of templates established for different task types. After determining the task type label in step 22, the system extracts parameters from the corresponding template and assembles them into a task-aware credibility contract. Through the above processing, the unified evaluation object set not only saves the classification results of task request information, but also simultaneously saves the evaluation boundaries corresponding to the classification results. This ensures that subsequent alignment relationship determination, node coverage calculation, and multi-dimensional score fusion maintain consistent constraints, and also keeps the credibility evaluation paths under different task types distinct.

[0040] After the above steps, the system transforms the task request information into task type tags and further forms a task-aware credibility contract, which is then written back to the unified evaluation object set. In this way, the unified evaluation object set is no longer just a compilation of the original records, but also includes task classification results and evaluation constraint information. Subsequent actions, such as conclusion unit segmentation, inference chain snapshot node filtering, and the fusion of consistency scores, evidence coverage scores, and event chain consistency scores, can directly invoke this task-aware credibility contract, allowing the entire credibility assessment and source tracing process to unfold continuously within the same task semantic boundary.

[0041] In one embodiment of the present invention, after determining the task type and writing the task type label into a unified evaluation object set, the system further segments the results based on the agent's output to obtain a set of conclusion units; simultaneously, based on the execution process log, tool call records, and retrieval evidence set, a set of inference chain snapshot nodes is obtained. Here, the set of conclusion units refers to a set of independently verifiable semantic units formed by splitting the agent's output; the set of inference chain snapshot nodes refers to a set of process nodes formed by structurally representing the inference, tool call, retrieval, and comprehensive organization actions during the AI ​​agent's execution process. Through the above processing, the result content represented by the agent's output and the formation path represented by the agent's execution process are respectively organized into two types of structured objects. Subsequent alignment relationship determination, node coverage calculation, credibility assessment, and source tracing can all be carried out around these two types of objects.

[0042] Specifically, in step 31, the system extracts the agent's output results and task type labels from the unified evaluation object set, and segments the agent's output results according to the segmentation boundaries corresponding to the task type labels, obtaining a set of conclusion units arranged in the original order, while determining the importance level of each conclusion unit. The fact assertion boundary corresponding to the fact query refers to the complete assertion boundary formed around objective facts, attribute states, or search results; the reasoning proposition boundary corresponding to the reasoning analysis refers to the complete proposition boundary formed around causal relationships, conditional relationships, or judgment chains; and the semantic complete expression boundary corresponding to the creative generation refers to the semantic boundary formed around a single complete expression intention.

[0043] In other words, the system does not use a uniform sentence segmentation method to process all agent outputs. Instead, it first determines the segmentation criteria based on the task type label and then performs the corresponding segmentation. If the task type label indicates fact query, the system segments the agent outputs according to the fact assertion boundary; if the task type label indicates reasoning analysis, the system segments according to the reasoning proposition boundary; and if the task type label indicates creative generation, the system segments according to the semantic completeness expression boundary. This maintains the stability of the boundaries of the conclusion unit set and ensures that the conclusion units formed under different task types are closer to the needs of subsequent verification.

[0044] The importance level refers to the relative weight of the conclusion unit in the overall agent output, representing the degree of participation of different conclusion units in the subsequent aggregation of overall credibility. For example, when the agent output includes both core judgments and supplementary explanations, the system can assign a higher importance level to the conclusion unit corresponding to the core judgment and a lower importance level to the conclusion unit corresponding to the supplementary explanation.

[0045] In step 32, the system extracts execution process logs, tool call records, and retrieval evidence sets from the unified evaluation object set. It then converts reasoning actions, tool call actions, retrieval actions, and synthesis actions into reasoning chain snapshot nodes according to the execution order. Each reasoning chain snapshot node is then written with its step type, node input, node output, associated evidence, tool call information, timestamp, and anomaly flag, resulting in a set of reasoning chain snapshot nodes. The reasoning action refers to the processing action by which the agent forms a judgment, analysis, or interpretation during intermediate processing; the tool call action refers to the processing action by which the agent initiates a call to an external tool and receives the returned result; the retrieval action refers to the processing action by which the agent extracts evidence content from external information sources; and the synthesis action refers to the processing action by which the agent summarizes and organizes multiple intermediate results to form a staged output.

[0046] The step type indicates the action category to which the inference chain snapshot node belongs; the node input indicates the processing content entering the node; the node output indicates the processing result generated by the node; the associated evidence indicates the evidence items associated with the node; the tool call information indicates the calling object, calling parameters, and calling result; the timestamp indicates the time the node occurred; and the anomaly marker indicates whether there are any anomalies during the execution of the node. Through the above processing, the execution process log, tool call record, and retrieved evidence set no longer exist in a scattered record form, but are uniformly organized into a set of inference chain snapshot nodes.

[0047] It should be noted that the execution order of the inference chain snapshot node set remains unchanged. Therefore, when establishing the correspondence between conclusion units and inference chain snapshot nodes, both content correspondence and order correspondence can be used simultaneously. For example, if a certain inference chain snapshot node completes evidence retrieval first, and a subsequent inference chain snapshot node forms a judgment based on that evidence, then this sequence will be completely preserved in the inference chain snapshot node set.

[0048] In step 33, the system normalizes the importance level of each conclusion unit to the sum of the importance levels of all conclusion units to obtain the conclusion unit weight of each conclusion unit; then, it performs irreversible encoding processing on the unified task identifier, step type, set of associated evidence identifiers, and node output summary in a preset order to obtain the node fingerprint of each inference chain snapshot node; finally, it writes the conclusion unit set, conclusion unit weight, inference chain snapshot node set, and node fingerprint into a unified evaluation object set. The conclusion unit weight refers to the weight value of each conclusion unit in the overall output result credibility aggregation, and their sum is kept consistent so that different conclusion units can be uniformly converted when calculating the subsequent result-level credibility.

[0049] The node fingerprint refers to a unique identifier formed by performing irreversible encoding on the key fields of the inference chain snapshot node. These key fields include a unified task identifier, step type, a set of associated evidence identifiers, and a node output summary. Through this processing, different inference chain snapshot nodes can be stably distinguished, allowing for direct node location based on the node fingerprint during subsequent hierarchical tracing tree construction and tracing record package generation.

[0050] Through the above steps, the system obtains both the set of conclusion units representing the result content and their weights, as well as the set of inference chain snapshot nodes representing the generation path and their node fingerprints, and writes these objects back to a unified evaluation object set. This provides both result-side and process-side objects for subsequent alignment relationship determination, and a stable node index foundation for subsequent source tracing, thus establishing the credibility evaluation of the agent's output and the formation path tracing on the same structured object system.

[0051] In one embodiment of the present invention, after forming a set of conclusion units, a set of inference chain snapshot nodes, and a task-aware credibility contract, the system further determines alignment relationships and node coverage based on these three elements. Here, alignment relationship refers to whether there is a corresponding relationship between the conclusion units and the inference chain snapshot nodes that can be used for subsequent verification and tracking; node coverage refers to the degree to which a conclusion unit covers the set of necessary inference node types required by the task-aware credibility contract. Through this process, the result-side structure and process-side structure obtained in the previous step are truly connected, and subsequent evidence coverage assessment, event chain consistency assessment, and conclusion-level credibility calculation can all be carried out on this connection.

[0052] Specifically, in step 41, the system extracts a set of conclusion units, a set of inference chain snapshot nodes, and a task-aware credibility contract from the unified evaluation object set, and extracts a set of necessary inference node types and a node coverage threshold from the task-aware credibility contract. The set of necessary inference node types refers to the range of node types that constitute an effective inference process under the current task type; the node coverage threshold refers to the minimum proportion that the conclusion units should achieve when covering the set of necessary inference node types.

[0053] It should be noted that the task-aware credibility contract is different for different task types, and therefore the set of necessary inference node types is also different. For example, for fact query tasks, the system pays more attention to whether the retrieval action and the synthesis action occur; for reasoning and analysis tasks, the system pays more attention to whether a complete chain is formed between the reasoning action, the retrieval action, and the synthesis action.

[0054] In step 42, the system determines whether the conclusion unit and the node output satisfy a semantic correspondence, whether the conclusion unit and the associated evidence satisfy an evidence citation correspondence, and whether the conclusion unit and the entity set in the inference chain snapshot node satisfy an equality or inclusion relationship. When at least one of the above three determinations is true, the conclusion unit and the inference chain snapshot node are determined to be aligned; otherwise, they are determined to be unaligned.

[0055] Among them, semantic correspondence means that the core judgment content expressed by the conclusion unit and the processing result represented by the node output are semantically consistent; evidence citation correspondence means that the evidence content cited in the conclusion unit can be found in the related evidence of the inference chain snapshot node; entity set refers to the entity set composed of object names, attribute names, event names or state names extracted from the conclusion unit and the inference chain snapshot node. When the entity sets of the two satisfy the equality relationship or the inclusion relationship, it means that there is a corresponding basis in the object scope between the conclusion unit and the inference chain snapshot node.

[0056] In other words, the system does not rely solely on semantic similarity to determine alignment; instead, it assesses alignment from three perspectives: content semantics, evidence source, and object structure. This approach ensures that even if a conclusion unit and a snapshot node in an inference chain are not entirely identical in their description, an alignment can still be established as long as they share the same evidentiary basis or their entity sets correspond. For example, a conclusion unit might state that equipment malfunctions are caused by temperature fluctuations, while a snapshot node in an inference chain might output that temperature changes are the primary source of the malfunction. If both cite the same retrieved evidence or their entity sets contain objects such as equipment, temperature, and the source of the malfunction, the system can determine that the conclusion unit and the snapshot node in the inference chain are aligned.

[0057] In step 43, for each conclusion unit, the system counts the number of inference chain snapshot nodes that have an alignment relationship with it and whose step type belongs to the set of required inference node types. The number of inference chain snapshot nodes is used as the numerator, and the number of step types in the set of required inference node types is used as the denominator. The ratio of the numerator to the denominator is used to obtain the node coverage rate. When the node coverage rate is not less than the node coverage threshold, the node coverage determination result is determined to meet the node coverage requirement. Otherwise, the node coverage determination result is determined to not meet the node coverage requirement. The alignment relationship, node coverage rate, and node coverage determination result are written into the unified evaluation object set.

[0058] The node coverage here is not a simple count of all aligned nodes, but rather based on two conditions. The first condition is that an alignment relationship has been established between the inference chain snapshot node and the current conclusion unit; the second condition is that the step type of the inference chain snapshot node belongs to the set of required inference node types. Only inference chain snapshot nodes that simultaneously meet both of these conditions are included in the node coverage.

[0059] For example, a task-aware credibility contract requires the current task to cover at least three types of nodes: reasoning actions, retrieval actions, and synthesis actions. If a conclusion unit only aligns with two of these types of nodes, the node coverage rate is determined by the ratio of the number of valid nodes in the two types to the total number of all required reasoning node types. If this ratio reaches the node coverage threshold, the system determines that the conclusion unit meets the node coverage requirement; otherwise, the system determines that the conclusion unit does not meet the node coverage requirement.

[0060] Through the above steps, the system advances the relationship between the conclusion unit set and the inference chain snapshot node set from a parallel storage state to a computable, decidable, and retrievable corresponding structure. On the one hand, the conclusion unit is no longer just a text fragment in the agent's output, but has acquired a clear alignment relationship with the inference chain snapshot nodes; on the other hand, the inference chain snapshot nodes are no longer just execution process records, but are further incorporated into the node coverage calculation scope, thus participating in subsequent credibility assessments. This ensures that subsequent evidence coverage scores and event chain consistency scores revolve around specific conclusion units, and that the determination of conclusion-level credibility is based on the completed node coverage determination, maintaining consistency throughout the entire processing chain.

[0061] In one embodiment of the present invention, after determining the alignment relationship between the conclusion unit set and the inference chain snapshot node set, the system further determines the consistency score, evidence coverage score, event chain consistency score, and questionable fragment set based on the conclusion unit set, alignment relationship, retrieval evidence set, and business event set. Here, the consistency score refers to the degree to which the same conclusion unit maintains content stability under repeated generation conditions; the evidence coverage score refers to the sufficiency of support provided by associated evidence items for the conclusion unit; the event chain consistency score refers to the degree of conformity between the business semantics involved in the conclusion unit and the actual records in the business event set; and the questionable fragment set refers to the set of fragments that failed to maintain consistency with the main result during repeated generation. Through the above processing, the system no longer provides a single credible judgment on the agent's output result, but extracts the state of the conclusion unit in three directions: generation stability, evidence support, and business closure, and writes them back to the unified evaluation object set for subsequent credibility fusion and verification location.

[0062] Specifically, in step 51, for each conclusion unit, the system first extracts associated evidence items and candidate business event records based on alignment relationships. Associated evidence items refer to evidence items that can be traced back to the current conclusion unit through alignment relationships; candidate business event records refer to event records that may have business connections with the current conclusion unit under the unified task identifier constraint. Subsequently, the system performs two rapid regenerations on each conclusion unit and calculates the semantic similarity between the two rapid regeneration results and the original conclusion unit. When both rapid regeneration results meet the preset rapid consistency condition, the average of the two semantic similarities is determined as the consistency score. The preset rapid consistency condition means that the semantic similarity between the rapid regeneration result and the original conclusion unit meets the system's pre-set judgment requirements. If the two rapid regeneration results do not simultaneously meet this condition, three additional regenerations are performed, and the five generation results are clustered. The number of main cluster results is counted, and the consistency score is determined by the ratio between the number of main cluster results and the number of five generation results.

[0063] At the same time, the system organizes the conclusion fragments that deviate from the main cluster into a set of questionable fragments. In other words, the system first judges whether the conclusion unit has stability by regenerating a small number of units. If the stability is insufficient, it further determines the range of fluctuation by expanding the generated results, thereby accurately assigning the unstable content to specific fragments.

[0064] For example, if a conclusion unit maintains the same judgment in the first two rapid regenerations, its consistency score is high; if it only clusters into the same semantic cluster in three of the subsequent five generations, its consistency score is reduced accordingly, and the fragments that do not enter the main cluster are retained in the set of questionable fragments.

[0065] In step 52, the system extracts the relevance and validity markers of related evidence items for each conclusion unit, and determines the evidence coverage score accordingly. The relevance refers to the degree of content fit between the related evidence item and the current conclusion unit; the validity marker refers to whether the related evidence item meets the criteria of being parseable, complete in content, and not conflicting with the current conclusion unit. If the content of a related evidence item can be parsed normally, the evidence content is complete, and there is no contradictory or contradictory relationship with the current conclusion unit, its validity marker is determined to be valid; otherwise, it is determined to be invalid. The system then multiplies and sums the relevance and validity markers of each related evidence item to obtain the numerator; then it sums the relevance of each related evidence item to obtain the denominator; finally, the ratio of the numerator to the denominator is used to determine the evidence coverage score. The formula for the evidence coverage score can be expressed as: ,in, Indicates the first The evidence coverage score for each conclusion unit is used to comprehensively reflect the sufficiency of each related piece of evidence in supporting the current conclusion unit. Indicates the relationship with the first The number of evidence items associated with each conclusion unit Indicates the first The first related evidence item is related to the first The relevance of each conclusion unit Indicates the first The validity flag is assigned to each related piece of evidence. The validity flag can be either valid or invalid; in calculation, a value of one indicates validity, and zero indicates invalidity. When no related evidence items exist, or the sum of the relevance of all related evidence items is zero, the system assigns a score of zero to the evidence coverage score. Through this process, the evidence coverage score considers both the relevance of each evidence item to the conclusion unit and the validity of the evidence item itself, avoiding judgment based solely on the quantity of evidence. For example, if a conclusion unit is associated with multiple pieces of evidence, but most of the evidence is incomplete or inconsistent with the statement of the conclusion unit, the evidence coverage score will not be inflated simply because of the large number of pieces of evidence.

[0066] In step 53, the system extracts ordered event constraints for each conclusion unit involving task execution status, cause of anomaly, handling conclusion, or execution suggestion, and counts the number of business events matching the ordered event constraints to determine the event chain consistency score. The ordered event constraints refer to the event sequence constraints formed by the event objects, event sequence relationships, and event status requirements extracted from the conclusion unit. After obtaining the ordered event constraints, the system compares each candidate business event record, counts the number of business events matching the ordered event constraints, and uses this number as the numerator and the number of ordered event constraints as the denominator to obtain the event chain consistency score through the ratio of the two. When a conclusion unit does not involve task execution status, cause of anomaly, handling conclusion, or execution suggestion, the system records the event chain consistency score as zero. Thus, it is only necessary to perform a closed-loop comparison between the conclusion unit and the business event set when the conclusion unit itself contains business process semantics; if the conclusion unit is merely a general factual description, there is no need to introduce event chain verification.

[0067] For example, if a conclusion unit indicates that an exception has triggered a work order, completed processing, and formed a review result, the system will extract the corresponding event object, sequence, and status requirements, and then match them with the work order record, processing record, and review record in the business event set to obtain the event chain consistency score.

[0068] Through the above steps, the system obtains consistency scores, evidence coverage scores, event chain consistency scores, and a set of questionable segments, and writes these results into a unified evaluation object set. This allows for the separate storage of the state of each conclusion unit in terms of generation stability, evidence support, and business closure, and also provides direct input for the subsequent calculation of basic credibility scores, conclusion-level credibility, and result-level credibility. Meanwhile, the set of questionable segments will continue to be used in subsequent hierarchical tracing and key verification, enabling the system to further refer back to specific conclusion segments for low-credibility parts, rather than remaining at the overall score level. Through this processing, the credibility evaluation of the agent's output results is no longer based on single text judgments, but is established on the continuous correspondence between conclusion units, evidence items, and business event records.

[0069] In one embodiment of the present invention, after obtaining the consistency score, evidence coverage score, event chain consistency score, and node coverage rate, the system further determines the uncertainty score, source quality score, basic credibility score, conclusion-level credibility, and result-level credibility based on the above results and the retrieved evidence set. Here, the uncertainty score represents the degree of fluctuation in the output content of the conclusion unit; the source quality score represents the reliability of the retrieved evidence supporting the conclusion unit at the source level; the basic credibility score represents the intermediate result after multiple evaluation dimensions are integrated under the same weighting system; the conclusion-level credibility represents the credibility of a single conclusion unit combined with node coverage; and the result-level credibility represents the comprehensive credibility of all conclusion units at the overall output level. Through the above processing, the system further consolidates the multi-dimensional verification results obtained in the previous step into two credibility levels: the conclusion level and the result level, enabling subsequent source tracing and key verification to be established on a unified scoring framework.

[0070] Specifically, in step 61, the system extracts the average information entropy of the corresponding output segment of each conclusion unit, and determines the uncertainty score by the ratio between the average information entropy and the system's preset maximum information entropy. The average information entropy refers to the overall degree of uncertainty exhibited by the output segment of the conclusion unit during its generation process, while the system's preset maximum information entropy is the upper limit benchmark used for normalization processing. In other words, when the output content of a conclusion unit is more dispersed and unstable during generation, its average information entropy is higher, and the corresponding uncertainty score is also higher; conversely, when the output content of a conclusion unit is more concentrated and stable, its uncertainty score is lower.

[0071] In step 62, for each conclusion unit, the system extracts the relevance and source quality level of the corresponding search evidence item from the search evidence set, and determines the source quality score accordingly. The relevance refers to the degree of content fit between the search evidence item and the current conclusion unit, and the source quality level refers to the level of credibility of the source of the search evidence item. The system combines the relevance and source quality level of each search evidence item item, accumulates them, and then compares the sum with the total sum of all relevances to obtain the source quality score of the current conclusion unit.

[0072] It's important to note that this isn't a simple averaging of source quality levels, but rather considers the actual relevance between the retrieved evidence items and the conclusion unit. If a conclusion unit has no corresponding retrieved evidence item, or the sum of the relevance of all retrieved evidence items is zero, the system records the source quality score as zero. This processing means that evidence items with reliable sources and closer relevance to the conclusion content have a greater impact on the source quality score; evidence items with high source quality but weak relevance to the current conclusion unit will not be overemphasized.

[0073] In step 63, the system extracts the evaluation dimension weights corresponding to the consistency score, evidence coverage score, event chain consistency score, uncertainty score, and source quality score from the task-aware credibility contract, and determines the basic credibility score by combining the scores of each dimension. The task-aware credibility contract here is the evaluation constraint information formed earlier based on the task type label; therefore, the weight configurations for different task types are not the same. The system combines the weights of each evaluation dimension with their corresponding scores and then summarizes them. Before being included in the summary, the uncertainty score is converted to one minus the uncertainty score; that is, the lower the degree of uncertainty, the higher its contribution to the basic credibility score; the higher the degree of uncertainty, the lower its contribution to the basic credibility score.

[0074] After completing the above processing, the system obtains a basic credibility score. Subsequently, the system combines this basic credibility score with the node coverage rate to determine the conclusion-level credibility. The node coverage rate here is derived from the calculation results of the alignment relationship and the set of necessary inference node types in the previous step. Therefore, conclusion-level credibility does not only consider the score but also whether the conclusion unit has sufficient process support. If a conclusion unit performs well in multiple evaluation dimensions but its node coverage rate is low, the final conclusion-level credibility will still be constrained. Through this processing, the system integrates the comprehensive result of multi-dimensional scores with the inference process coverage in the same judgment chain.

[0075] In step 64, the system normalizes the importance level of each conclusion unit to the sum of the importance levels of all conclusion units to obtain the weight of each conclusion unit. Then, it aggregates the weights of each conclusion unit with their corresponding conclusion-level credibility to obtain the result-level credibility. The result-level credibility is not a simple average of all conclusion-level credibility, but rather a weighted average based on the importance of each conclusion unit in the overall output. That is, conclusion units with a greater impact on the overall judgment have a higher participation rate in the result-level credibility; conclusion units with a smaller impact on the overall judgment have a relatively lower participation rate. Afterward, the system writes the uncertainty score, source quality score, basic credibility score, conclusion-level credibility, and result-level credibility into a unified evaluation object set. In this way, the unified evaluation object set retains both the procedural data formed in the previous steps and the hierarchical credibility results formed in this step. These results can be directly used when constructing a hierarchical traceability tree, generating traceability record packages, and determining a list of key verification nodes.

[0076] After the above steps, the system further consolidates the consistency score, evidence coverage score, event chain consistency score, node coverage rate, and retrieved evidence set into uncertainty score, source quality score, basic credibility score, conclusion-level credibility, and result-level credibility. This not only clearly represents the local credibility state of individual conclusion units but also organizes the comprehensive credibility state of the agent's overall output, ensuring that the credibility assessment results retain both hierarchy and data continuity between steps.

[0077] In one embodiment of the present invention, after obtaining the conclusion-level credibility, result-level credibility, alignment relationship, inference chain snapshot node set, retrieval evidence set, business event set, and questionable fragment set, the system further constructs a hierarchical tracing tree, generates a tracing record package, and determines a list of key verification nodes. Here, the hierarchical tracing tree refers to a tree-like relational structure that expands layer by layer along the inference chain snapshot nodes, evidence items, and business event records, starting from the conclusion unit; the tracing record package refers to a set of structured tracing results formed around the current task to be evaluated; and the list of key verification nodes refers to the conclusion units and their associated object sets that need to be prioritized for review in the current task. Through the above processing, the credibility results, process nodes, evidence relationships, and business event relationships formed in the aforementioned steps are further consolidated into a structured result that can be directly traced and located.

[0078] Specifically, in step 71, the system extracts conclusion-level credibility, result-level credibility, alignment relationship, inference chain snapshot node set, retrieval evidence set, business event set, questionable fragment set, and event chain consistency score from the unified evaluation object set, establishing a four-layer node boundary consisting of conclusion units, inference chain snapshot nodes, evidence items, and business event records. The four-layer node boundary refers to the boundary structure within the current task scope, uniformly dividing the objects participating in source tracing according to hierarchical relationships. The conclusion unit is located in the result layer, the inference chain snapshot nodes are located in the process layer, the evidence items are located in the basis layer, and the business event records are located in the business closed-loop layer.

[0079] It's important to note that the system doesn't regenerate new objects at this stage; instead, it reorganizes the objects created in previous steps into a hierarchical structure. This process ensures that each conclusion unit has a clearly defined downstream scope. Subsequent steps, such as binding inference chain snapshot nodes or extracting evidence items and business event records, all unfold within the same hierarchical framework, thus avoiding confusion and overlap between different objects.

[0080] In step 72, the system binds each conclusion unit to its corresponding inference chain snapshot node based on alignment relationships. Then, based on the associated evidence and associated business events of the inference chain snapshot nodes, it constructs a hierarchical tracing tree from the conclusion unit to the inference chain snapshot node, evidence items, and business event records. Here, binding refers to establishing a one-to-one or one-to-many correspondence between conclusion units that already satisfy the alignment relationship and the inference chain snapshot nodes; associated evidence refers to the evidentiary basis written into the inference chain snapshot nodes; and associated business events refer to the business event records that can be referenced back by the conclusion unit or the inference chain snapshot node.

[0081] In other words, the hierarchical source tree is not derived by inferring from the credibility of the result level, but rather by unfolding layer by layer along alignment relationships, evidence associations, and business event associations. For example, if a conclusion unit corresponds to two inference chain snapshot nodes, one of which is connected to the retrieved evidence item, and the other is connected to the work order flow record, the system will connect the conclusion unit, the two inference chain snapshot nodes, the corresponding evidence item, and the corresponding business event record layer by layer to form a complete backtracking path. Through this process, subsequent verification does not require retracing the entire log; instead, it can directly locate the specific process node and specific business record along the tree structure.

[0082] In step 73, the system extracts a unified task identifier, task type label, result-level credibility, conclusion-level credibility list, timestamp, node fingerprint index, and counter-evidence event list based on the hierarchical tracing tree, generating a tracing record package. The node fingerprint index refers to the unique tracking marker of the inference chain snapshot node in the current task; the counter-evidence event list refers to the set of business events that are inconsistent with the current conclusion or insufficient to support the current conclusion chain. Through the above processing, the structural relationships in the hierarchical tracing tree are further organized into standardized record results that are easy to store, retrieve, and call.

[0083] The traceability record package here does not simply copy the hierarchical traceability tree, but rather writes the task-level identifier, result-level credibility, conclusion-level credibility list, and node index results into the same record object. This preserves the hierarchical relationship in the tree structure and provides a stable data entry point for subsequent key verification node list output and business review processes.

[0084] In step 74, the system compares the conclusion-level credibility of each conclusion unit with the conclusion-level credibility threshold, and compares the event chain consistency score of each conclusion unit with the event chain consistency threshold. Conclusion units that meet any condition, along with their corresponding reasoning chain snapshot nodes, evidence items, business event records, and sets of questionable fragments, are written into the key verification node list. The hierarchical tracing tree, tracing record package, and key verification node list are then written into the unified evaluation object set. Here, the conclusion-level credibility threshold refers to the lower limit of credibility for a conclusion unit to enter the key verification scope; the event chain consistency threshold refers to the minimum requirement for the degree of compliance of the business loop.

[0085] Through this process, the system does not select verification targets solely based on a single credibility result, but considers both conclusion-level credibility and event chain consistency score simultaneously. If a conclusion unit has a low conclusion-level credibility, it will still be added to the list of key verification nodes even if its event chain consistency score is not abnormal; conversely, if a conclusion unit has a high conclusion-level credibility, but its business event chain is not closed or its records are inconsistent, it will also be included in the scope of key verification. This approach can identify both conclusions with insufficient scores and conclusions with abnormal business chains.

[0086] In one embodiment of the present invention, the generation of the counter-evidence event list includes the following processing. In step 81, the system extracts the ordered event constraints corresponding to each conclusion unit in the key verification node list, as well as the business event records corresponding to that conclusion unit. The ordered event constraints refer to the set of constraints formed by the event objects, event sequence relationships, and event state requirements extracted from the semantics of the conclusion unit.

[0087] In step 82, the system compares each business event record according to the event object, event sequence, and event status requirements in the ordered event constraints. It then filters out business event records that lack corresponding events, violate event sequence, or fail to meet event status requirements, identifying them as counter-evidence events. Here, counter-evidence events do not refer to all abnormal business events, but rather to business event records that are inconsistent with the semantic requirements of the current conclusion unit, thus negating or weakening that conclusion. For example, if a conclusion unit indicates that a work order has been processed and approved, but the business event records only contain records of work order creation and processing, without any records of approval, then the relevant missing event will be identified as a counter-evidence event.

[0088] In step 83, the system associates each counter-evidence event with its corresponding conclusion unit, corresponding inference chain snapshot node, and corresponding node fingerprint index to obtain a list of counter-evidence events, which is then written into the tracing record package. Through the above processing, the tracing record package not only retains the positive supporting relationships but also retains the business event paths that provide reverse verification of the conclusion content.

[0089] After the above steps, the system further organizes the conclusion-level credibility, result-level credibility, alignment relationship, inference chain snapshot node set, retrieval evidence set, business event set, and questionable fragment set into a hierarchical tracing tree, tracing record package, and key verification node list, and forms a counter-evidence event list based on the key verification nodes. This not only assigns the credibility results of the agent's output to specific conclusion units, specific inference chain snapshot nodes, specific evidence items, and specific business event records, but also allows for the separate extraction of business events that conflict with or do not match the current conclusion, facilitating subsequent review and tracking.

[0090] It should be noted that the interval and threshold sizes are set for ease of comparison. The size of the threshold depends on the amount of sample data and the base number set by those skilled in the art for each set of sample data, as long as it does not affect the proportional relationship between the parameter and the quantized value. Furthermore, the above formulas are all dimensionless calculations, and the formulas are derived from software simulations using a large amount of collected data to obtain the most recent real-world results. The preset parameters in the formulas are set by those skilled in the art according to the actual situation.

[0091] The embodiments of the present invention have been described above, but the present invention is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other forms based on the guidance of the present embodiments, all of which are within the protection scope of the present embodiments.

Claims

1. An AI agent output result credibility evaluation and provenance tracking system, characterized in that, include: The task aggregation module is used to obtain the task request information, agent output results, execution process logs, tool call records, retrieval evidence set and business event set corresponding to the task to be evaluated, and to associate and organize them to obtain a unified set of evaluation objects; The type determination module is used to determine the task type based on the task request information, obtain the task type label, and determine the task perception trust contract based on the task type label. The result decomposition module is used to segment the results based on the agent's output to obtain a set of conclusion units. Based on the execution process log, tool call records, and retrieval evidence set, a set of inference chain snapshot nodes is obtained; The alignment and coverage module is used to determine alignment relationships and node coverage based on the set of conclusion units, the set of inference chain snapshot nodes, and the task-aware credibility contract. The multidimensional evaluation module is used to determine the consistency score, evidence coverage score, event chain consistency score, and set of questionable fragments based on the set of conclusion units, alignment relationship, set of retrieved evidence, and set of business events. The Trusted Fusion Module is used to determine the Uncertainty Score, Source Quality Score, Basic Trustworthiness Score, Conclusion-Level Trustworthiness, and Result-Level Trustworthiness based on consistency score, evidence coverage score, event chain consistency score, node coverage rate, and retrieved evidence set. The source tracing and verification module is used to construct a hierarchical source tracing tree based on conclusion-level credibility, result-level credibility, alignment relationship, inference chain snapshot node set, retrieval evidence set, business event set, and questionable fragment set, generate source tracing record package, and determine the list of key verification nodes. 2.The AI agent output result credibility evaluation and provenance tracing system of claim 1, wherein, Obtain the task request information, agent output results, execution process logs, tool call records, retrieval evidence set, and business event set corresponding to the task to be evaluated, and correlate and organize them to obtain a unified evaluation object set, including: Step 11: Write the corresponding record type identifier to each record in the task request information, agent output results, execution process log, tool call record, retrieval evidence set and business event set, and write the same unified task identifier to each record; Step 12: Extract the original time information of each record after writing the unified task identifier, convert it into a unified time format, sort it according to the unified time format, and determine the ordinal index of each record according to the record type identifier and the original record order when the unified time formats are the same. Step 13: Based on the reference relationships between records, the generation relationship between tool call records and execution process logs and agent output results, the support relationship between the evidence set and execution process logs and agent output results, and the triggering relationship between agent output results, tool call records, and execution process logs and business event sets, establish an association mapping, and write each record, sequence index, and association mapping after writing the unified task identifier into a unified evaluation object set. 3.The AI agent output result credibility evaluation and provenance tracing system of claim 1, wherein, Based on the task request information, the task type is determined to obtain a task type label, and a task-aware trust contract is determined based on the task type label, including: Step 21: Extract task request information from the unified evaluation object set, and perform semantic analysis and classification processing on the task request information; among which, the task types include fact query, reasoning analysis and creative generation; determine the matching degree between the task request information and fact query, reasoning analysis and creative generation respectively; Step 22: Based on the matching degree of fact query, reasoning analysis and creative generation, determine the task type with the highest matching degree as the task type label; Step 23: Based on the task type label, extract the corresponding evaluation dimension weights, necessary evidence type set, necessary reasoning node type set, node coverage threshold and closed loop verification threshold from the preset task type contract template to form a task perception credibility contract, and write the task type label and task perception credibility contract into a unified evaluation object set. 4.The AI agent output result credibility assessment and provenance tracing system of claim 1, wherein, The conclusion unit set is obtained by segmenting the results of the intelligent agent's output. Based on the execution process log, tool call records, and retrieval evidence set, a set of inference chain snapshot nodes is obtained, including: Step 31: Extract agent output results and task type labels from the unified evaluation object set; when the task type label is fact query, segment according to fact assertion boundary; when the task type label is reasoning analysis, segment according to reasoning proposition boundary; when the task type label is creative generation, segment according to semantic complete expression boundary, to obtain a set of conclusion units arranged in the original order, and determine the importance level of each conclusion unit. Step 32: Extract the execution process log, tool call record and retrieval evidence set from the unified evaluation object set. Convert the reasoning action, tool call action, retrieval action and comprehensive action into reasoning chain snapshot nodes according to the execution order. Write the step type, node input, node output, associated evidence, tool call information, timestamp and anomaly mark to each reasoning chain snapshot node to obtain the reasoning chain snapshot node set. Step 33: Normalize the importance level of each conclusion unit to the sum of the importance levels of all conclusion units to obtain the conclusion unit weight of each conclusion unit; combine the unified task identifier, step type, set of associated evidence identifiers and node output summary in a preset order and perform irreversible encoding to obtain the node fingerprint of each inference chain snapshot node; write the conclusion unit set, conclusion unit weight, inference chain snapshot node set and node fingerprint into the unified evaluation object set. 5.The AI agent output result credibility evaluation and provenance tracing system of claim 1, wherein, Based on the conclusion unit set, the inference chain snapshot node set, and the task-aware credibility contract, alignment relationships and node coverage are determined, including: Step 41: Extract the conclusion unit set, the inference chain snapshot node set, and the task-aware credibility contract from the unified evaluation object set, and extract the necessary inference node type set and node coverage threshold from the task-aware credibility contract; Step 42: For each conclusion unit and each inference chain snapshot node, determine whether the conclusion unit and the node output satisfy a semantic correspondence relationship, determine whether the conclusion unit and the associated evidence satisfy an evidence citation correspondence relationship, and determine whether the conclusion unit and the entity set in the inference chain snapshot node satisfy an equality relationship or an inclusion relationship; when any of the semantic correspondence relationship, evidence citation correspondence relationship, entity set equality relationship, or entity set inclusion relationship is true, determine that the conclusion unit and the inference chain snapshot node are aligned; otherwise, determine that they are not aligned. Step 43: For each conclusion unit, count the number of inference chain snapshot nodes that have an alignment relationship with it and whose step type belongs to the set of required inference node types. Use the number of inference chain snapshot nodes as the numerator and the number of step types in the set of required inference node types as the denominator. The ratio of the numerator to the denominator is used to obtain the node coverage rate. When the node coverage rate is not less than the node coverage threshold, the node coverage determination result is determined to meet the node coverage requirement. Otherwise, the node coverage determination result is determined to not meet the node coverage requirement. Write the alignment relationship, node coverage rate, and node coverage determination result into the unified evaluation object set. 6.The AI agent output result credibility assessment and provenance tracing system of claim 1, wherein, Based on the set of conclusion units, alignment relationships, retrieval evidence sets, and business event sets, consistency scores, evidence coverage scores, event chain consistency scores, and sets of questionable segments are determined, including: Step 51: For each conclusion unit, extract related evidence items and candidate business event records based on alignment relationships; perform two fast regenerations for each conclusion unit. If both fast regeneration results meet the preset fast consistency conditions, the average of the two semantic similarities is determined as the consistency score. Otherwise, perform three additional regenerations. The ratio of the number of main cluster results to the number of five generation results is determined as the consistency score, and the conclusion fragments that deviate from the main cluster are determined as the set of questionable fragments. Step 52: For each conclusion unit, extract the relevance and validity markers of the related evidence items. The sum of the products of the relevance and validity markers of each related evidence item is used as the numerator, and the sum of the relevance of each related evidence item is used as the denominator. The ratio of the numerator to the denominator is determined as the evidence coverage score. When there are no related evidence items or the sum of the relevance of each related evidence item is zero, the evidence coverage score is recorded as zero. Step 53: For each conclusion unit involving task execution status, cause of anomaly, handling conclusion, or execution recommendation, extract ordered event constraints, count the number of business events matching the ordered event constraints, use the number as the numerator, use the number of ordered event constraints as the denominator, and determine the event chain consistency score by the ratio of the numerator to the denominator; when the conclusion unit does not involve task execution status, cause of anomaly, handling conclusion, or execution recommendation, record the event chain consistency score as zero; and write the consistency score, evidence coverage score, event chain consistency score, and set of questionable fragments into the unified evaluation object set.

7. The AI agent output result credibility assessment and provenance tracking system of claim 1, wherein, Based on consistency scores, evidence coverage scores, event chain consistency scores, node coverage, and retrieved evidence sets, uncertainty scores, source quality scores, basic credibility scores, conclusion-level credibility, and outcome-level credibility are determined, including: Step 61: For each conclusion unit, extract the average information entropy of the corresponding conclusion unit output segment, and determine the ratio of the average information entropy to the system's preset maximum information entropy as the uncertainty score. Step 62: For each conclusion unit, extract the relevance and source quality level of the corresponding search evidence item in the search evidence set. Take the sum of the products of the relevance and source quality level of each search evidence item as the numerator, and the sum of the relevance of each search evidence item as the denominator. Determine the source quality score by the ratio of the numerator to the denominator. When there is no corresponding search evidence item or the sum of the relevance of each search evidence item is zero, the source quality score is recorded as zero. Step 63: Extract the evaluation dimension weights corresponding to the consistency score, evidence coverage score, event chain consistency score, uncertainty score, and source quality score from the task-aware credibility contract. Multiply each evaluation dimension weight by its corresponding score and sum them up. Then, convert the uncertainty score to one minus the uncertainty score before adding it to the summation to obtain the basic credibility score. The product of the basic credibility score and the node coverage rate is determined as the conclusion-level credibility. Step 64: Normalize the importance level of each conclusion unit with the sum of the importance levels of all conclusion units to obtain the weight of each conclusion unit; sum the products of the weight of each conclusion unit and the corresponding conclusion-level credibility to obtain the result-level credibility; and write the uncertainty score, source quality score, basic credibility score, conclusion-level credibility, and result-level credibility into a unified evaluation object set. 8.The AI agent output result credibility assessment and provenance tracing system of claim 1, wherein, Based on conclusion-level credibility, result-level credibility, alignment relationship, inference chain snapshot node set, retrieval evidence set, business event set, and questionable fragment set, a hierarchical tracing tree is constructed, a tracing record package is generated, and a list of key verification nodes is determined, including: Step 71: Extract conclusion-level credibility, result-level credibility, alignment relationship, inference chain snapshot node set, retrieval evidence set, business event set, questionable fragment set, and event chain consistency score from the unified evaluation object set, and establish a four-layer node boundary consisting of conclusion unit, inference chain snapshot node, evidence item, and business event record. Step 72: Based on the alignment relationship, bind each conclusion unit to the corresponding inference chain snapshot node, and construct a hierarchical tracing tree from the conclusion unit to the inference chain snapshot node, evidence item and business event record based on the associated evidence and associated business events of the inference chain snapshot node. Step 73: Based on the hierarchical source tree, extract the unified task identifier, task type label, result-level credibility, conclusion-level credibility list, timestamp, node fingerprint index, and counter-evidence event list to generate a source record package; Step 74: Compare the conclusion-level credibility of each conclusion unit with the conclusion-level credibility threshold, and compare the event chain consistency score of each conclusion unit with the event chain consistency threshold. Write the conclusion unit that meets any condition and its corresponding reasoning chain snapshot node, evidence item, business event record and questionable fragment set into the key verification node list, and write the hierarchical tracing tree, tracing record package and key verification node list into the unified evaluation object set. 9.The AI agent output result credibility evaluation and provenance tracing system of claim 8, wherein, The generation of the list of counter-evidence events includes: Step 81: For each conclusion unit in the list of key verification nodes, extract the ordered event constraints corresponding to the conclusion unit and the business event records corresponding to the conclusion unit. Step 82: According to the event object, event sequence and event status requirements in the ordered event constraints, the business event records are compared item by item, and business event records that are missing corresponding events, violate the event sequence or do not meet the event status requirements are selected and identified as counter-evidence events. Step 83: Associate each counter-evidence event with its corresponding conclusion unit, corresponding reasoning chain snapshot node, and corresponding node fingerprint index to obtain a counter-evidence event list, and write it into the tracing record package.