Protocol-guided reinforcement learning based structured fact verification method and device

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing a structured fact-checking method based on protocol-guided reinforcement learning, the reasoning trajectory of large language models is standardized, solving the problems of insufficient transparency and reasoning illusion in existing technologies, and achieving highly accurate fact-checking.

CN122242518APending Publication Date: 2026-06-19INST OF AUTOMATION CHINESE ACAD OF SCI

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: INST OF AUTOMATION CHINESE ACAD OF SCI
Filing Date: 2026-03-23
Publication Date: 2026-06-19

AI Technical Summary

Technical Problem

Existing fact-checking methods lack transparency, are prone to inference illusions and over-decomposition, and fail to strike a balance between atomic decomposition and semantic integrity.

Method used

A structured fact-checking method based on protocol-guided reinforcement learning is adopted. The reasoning trajectory is standardized through structured atomic operation protocols, counterfactual decomposition rewards and gating indicators are set, and parameters are fine-tuned by combining group relative policy optimization algorithm to ensure that the model follows the standardized reasoning process.

Benefits of technology

It achieves a transparent reasoning process for the strategy model, avoiding reasoning illusions and over-decomposition, and improving the accuracy and fidelity of fact-checking results.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122242518A_ABST

Patent Text Reader

Abstract

This invention provides a structured fact-checking method and apparatus based on protocol-guided reinforcement learning. The method includes: acquiring multiple inference trajectories obtained by sampling the same input sample multiple times using a policy model; determining multiple rewards for each inference trajectory; the multiple rewards include at least an outcome reward, a format reward, and a counterfactual decomposition reward; determining a gating index based on the format reward of the current training step, the gating index being used to control whether the counterfactual decomposition reward is included in the calculation of the total reward signal; if the gating index indicates that inclusion is allowed, calculating the total reward signal based on the multiple rewards; fine-tuning the parameters of the policy model using a group-relative policy optimization algorithm and the total reward signal of each inference trajectory to obtain a trained policy model. The statement to be checked and its corresponding reference document are input into the trained policy model to obtain the fact-checking result. This method solves the problems of lack of transparency, susceptibility to inference illusions, and over-decomposition in existing fact-checking methods.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of natural language processing and large language model technology, and in particular to a structured fact-checking method and apparatus based on protocol-guided reinforcement learning. Background Technology

[0002] Large language models excel in knowledge-intensive tasks such as question answering and text summarization, but are prone to generating illusions of content that deviates from reference documents or contains factual errors. Therefore, fact checking has become a crucial security mechanism for ensuring model reliability. While using large models like GPT-4o as fact checkers is effective, their lengthy context processing and repeated calls for multi-statement responses incur extremely high costs, limiting large-scale deployment. Existing dedicated small-scale fact checking models mainly fall into two paradigms: the first is a black-box model that relies solely on label classification, lacking transparency and unable to provide auditable evidence or pinpoint the causes of local failures; the second is a student model based on inference distillation, which often only learns the form of complex reasoning but lacks rigorous underlying logic, easily generating fluent but fabricated reasoning illusions that mislead users. Furthermore, both types of models lack effective guidance during statement decomposition, creating a decomposition dilemma. Finding a balance between atomic decomposition and semantic integrity, and avoiding over-decomposition, is a pressing problem that current technologies need to address. Summary of the Invention

[0003] This invention provides a structured fact-checking method and apparatus based on protocol-guided reinforcement learning, which addresses the problems of lack of transparency, susceptibility to reasoning illusions, and over-decomposition in existing fact-checking methods.

[0004] This invention provides a structured fact-checking method based on protocol-guided reinforcement learning, comprising: Input the statement to be verified and the corresponding reference document into the trained policy model to obtain the fact-checking results; The trained policy model is obtained through the following reinforcement learning steps: Multiple inference trajectories are obtained by sampling the same input sample multiple times using a policy model; wherein the policy model follows atomic operation steps defined by a structured atomic operation protocol when generating inference trajectories. Multiple rewards are determined for each inference trajectory; the multiple rewards include at least an outcome reward, a format reward, and a counterfactual decomposition reward. The outcome reward is used to evaluate the consistency between the structured adjudication result output by the inference trajectory and the true label. The format reward is used to evaluate whether the inference trajectory conforms to the format requirements specified by the structured atomic operation protocol. The counterfactual decomposition reward is used to evaluate the effectiveness of the decomposition operation in the inference trajectory. A gating metric is determined based on the format reward of the current training step, and the gating metric is used to control whether the counterfactual decomposition reward is included in the calculation of the total reward signal; If the gating indicator indicates that inclusion is allowed, the total reward signal is calculated based on the multiple rewards; if the gating indicator indicates that inclusion is prohibited, the total reward signal is calculated based on rewards other than counterfactual decomposition rewards. The policy model is fine-tuned using a group relative policy optimization algorithm and the total reward signal of each inference trajectory to obtain a trained policy model; wherein, the group relative policy optimization algorithm optimizes the policy model based on the relative performance of the multiple inference trajectories within the group.

[0005] In one embodiment, the method further includes: Construct a composite reward system, which is used to provide the multiple rewards; The result reward is determined based on the consistency between the structured adjudication result output by the inference trajectory and the true label. The format reward is determined based on whether the inference trajectory conforms to the format requirements specified by the structured atomic operation protocol. The format requirements include whether the inference trajectory correctly uses allowed labels, whether it conforms to the topological sorting of decomposition, verification, and synthesis, and whether the conclusion is parsable. The counterfactual decomposition reward is determined based on the validity of the decomposition operation in the inference trajectory.

[0006] In one embodiment, the structured atomic operation protocol includes: In the decomposition phase, the strategy model decomposes the input declaration into multiple independently verifiable sub-declarations, resulting in a set of decomposed sub-declarations. During the verification phase, the strategy model compares each sub-declaration in the sub-declaration set with a given reference document and generates a verification result for each sub-declaration that includes evidence citations and logical analysis. In the synthesis phase, the strategy model aggregates the verification results of all sub-declarations and outputs a structured adjudication result.

[0007] In one embodiment, the counterfactual decomposition reward is calculated as follows: The accuracy of the inference trajectory generated following the structured atomic operation protocol will be compared with the accuracy of the overall check counterfactual baseline without hypothesis decomposition. If the inference trajectory generated by following the structured atomic operation protocol corrects the error made by the overall verification counterfactual baseline, a positive reward value is assigned; if the inference trajectory generated by following the structured atomic operation protocol introduces a new error, a negative penalty value is assigned.

[0008] In one embodiment, determining the gating metric based on the formatted reward of the current training step includes: Calculate the average formatted reward for all inference trajectories in the current training step; If the average format reward exceeds a preset threshold, the gating indicator will be set to an allowed state. If the average format reward is lower than a preset threshold, the gating indicator will be set to a prohibited state.

[0009] In one embodiment, the step of fine-tuning the parameters of the policy model using a group-relative policy optimization algorithm and the total reward signal of each inference trajectory to obtain a trained policy model includes: Calculate the mean and standard deviation of the total reward signal for all inference trajectories within the group to obtain the average reward and standard deviation within the group; The total reward signal for each inference trajectory is compared with the average reward within the group, and normalized by dividing by the standard deviation within the group to obtain the corresponding advantage function value. The parameters of the policy model are updated based on the dominance function values and KL divergence penalty terms within each group.

[0010] The present invention also provides a structured fact-checking device based on protocol-guided reinforcement learning, comprising the following modules: The trajectory generation module is used to obtain multiple inference trajectories obtained by the policy model sampling the same input sample multiple times; wherein, the policy model follows the atomic operation steps defined by the structured atomic operation protocol when generating inference trajectories; The reward determination module is used to determine multiple rewards for each inference trajectory; the multiple rewards include at least a result reward, a format reward, and a counterfactual decomposition reward. The result reward is used to evaluate the consistency between the structured adjudication result output by the inference trajectory and the true label. The format reward is used to evaluate whether the inference trajectory conforms to the format requirements specified by the structured atomic operation protocol. The counterfactual decomposition reward is used to evaluate the effectiveness of the decomposition operation in the inference trajectory. A gating determination module is used to determine a gating index based on the format reward of the current training step. The gating index is used to control whether the counterfactual decomposition reward is included in the calculation of the total reward signal. The signal calculation module is used to calculate the total reward signal based on the multiple rewards if the gating indicator indicates that inclusion is allowed; and to calculate the total reward signal based on rewards other than counterfactual decomposition rewards if the gating indicator indicates that inclusion is prohibited. The model fine-tuning module is used to fine-tune the parameters of the policy model using the group relative policy optimization algorithm and the total reward signal of each inference trajectory to obtain the trained policy model; wherein, the group relative policy optimization algorithm optimizes the policy model based on the relative performance of the multiple inference trajectories within the group; The fact-checking module is used to input the statement to be checked and the corresponding reference document into the trained policy model to obtain the fact-checking results.

[0011] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the structured fact-checking method based on protocol-guided reinforcement learning as described in any of the preceding claims.

[0012] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the structured fact-checking method based on protocol-guided reinforcement learning as described in any of the preceding claims.

[0013] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the structured fact-checking method based on protocol-guided reinforcement learning as described in any of the preceding claims.

[0014] This invention provides a structured fact-checking method and apparatus based on protocol-guided reinforcement learning. By introducing a structured atomic operation protocol, it forces the policy model to generate inference trajectories according to atomic operation steps during training, making the policy model's inference process fully visible. Each inference trajectory contains a complete inference chain, allowing users to trace the basis of judgment from input claims to the final fact-checking result, avoiding the deficiency of traditional black-box models in providing auditable evidence. To address the problem of inference illusion, this invention sets a counterfactual decomposition reward to evaluate the effectiveness of decomposition operations in the inference trajectory, ensuring that the policy model infers based on real evidence dependencies rather than simply mimicking inference forms, effectively suppressing the illusion of fluent but fabricated inference. To address the problem of over-decomposition, this invention determines a gating index based on the format reward of the current training step, controlling whether the counterfactual decomposition reward is included in the calculation of the total reward signal. In the early stages of training, before the policy model has mastered the basic format specifications, the gating metric prohibits the counterfactual decomposition reward from participating in the calculation, allowing the policy model to focus on learning the correct protocol format. Once the format reward reaches a threshold, the gating metric allows the counterfactual decomposition reward to participate in optimization, guiding the policy model to achieve a balance between atomic decomposition and semantic integrity, avoiding excessive fragmentation and semantic drift caused by unguided decomposition. Furthermore, this invention employs a group-relative policy optimization algorithm to fine-tune the parameters of the policy model. This algorithm calculates the optimization direction based on the relative performance of multiple inference trajectories corresponding to the same input sample within a group, enabling the policy model to converge stably during reinforcement learning training, ultimately resulting in fact-checking results with higher accuracy and fidelity. Attached Figure Description

[0015] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0016] Figure 1 This is one of the flowcharts of the structured fact-checking method based on protocol-guided reinforcement learning provided by the present invention.

[0017] Figure 2 This is the second flowchart of the structured fact-checking method based on protocol-guided reinforcement learning provided by the present invention.

[0018] Figure 3 This is a schematic diagram of the structure of the structured fact-checking device based on protocol-guided reinforcement learning provided by the present invention.

[0019] Figure 4 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation

[0020] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0021] The following is combined Figures 1 to 4 This invention describes a structured fact-checking method and apparatus based on protocol-guided reinforcement learning.

[0022] This invention proposes a structured fact-checking method based on protocol-guided reinforcement learning. This method, based on protocol-guided reinforcement learning, guides the policy model to autonomously learn rigorous reasoning logic through a novel counterfactual decomposition reward mechanism. Specifically, firstly, a structured atomic operation protocol is established, mandating that the policy model follow decomposition, verification, and synthesis steps when generating inference trajectories. Based on this, a composite reward system is constructed, including result rewards, format rewards, and counterfactual decomposition rewards. The counterfactual decomposition reward dynamically compares the accuracy of inference trajectories generated following the structured atomic operation protocol with the accuracy of the overall counterfactual baseline without hypothesis decomposition. Simultaneously, a course stabilization gating mechanism is introduced, determining the gating index based on the average format reward to adaptively control whether the counterfactual decomposition reward is included in the calculation of the total reward signal. Finally, during reinforcement learning training, the policy model's parameters are fine-tuned using a group-relative policy optimization algorithm and the total reward signal of each inference trajectory, resulting in a trained policy model that outputs fact-checking results following the structured atomic operation protocol during the inference phase.

[0023] Figure 1 This is one of the flowcharts illustrating the structured fact-checking method based on protocol-guided reinforcement learning provided by this invention, such as... Figure 1 As shown, the method includes the following: First, the policy model is trained through the following reinforcement learning steps (S110-S150): S110. Obtain multiple inference trajectories obtained by the policy model sampling the same input sample multiple times; wherein, the policy model follows the atomic operation steps defined by the structured atomic operation protocol when generating inference trajectories.

[0024] Input samples include input statements and reference documents. Input statements refer to the text content to be verified, typically containing one or more factual statements that need to be validated. These statements can be simple factual descriptions or complex multi-hop statements containing multiple logical relationships, such as "Mercosur is an organization with 9 member countries and 19 associated observers, headquartered in Montevideo." During the training phase, input statements serve as training samples, with corresponding real labels used to calculate rewards. During the inference phase, input statements serve as tasks to be processed, and the trained policy model outputs its fact-checking results. For ease of distinction, this invention refers to the text content to be verified during the training phase as input statements, and the text content to be verified during the inference phase as statements to be verified.

[0025] An inference trajectory is a record of the complete inference process generated by a policy model when processing input claims, following the atomic operation steps defined by the structured atomic operation protocol. This record includes the set of sub-claims generated during the decomposition phase, the evidence citations and logical analysis generated for each sub-claim during the verification phase, and the structured decision result output during the synthesis phase, reflecting the complete inference chain of the policy model from receiving the claims to outputting the final conclusion. During training, the policy model must follow the atomic operation steps defined by the structured atomic operation protocol when generating inference trajectories.

[0026] The structured atomic operation protocol refers to a set of mandatory reasoning rules defined in this invention, which are used to regulate the atomic operation steps that the strategy model must follow when generating reasoning trajectories, thereby ensuring the standardization and interpretability of the reasoning process and making each step of the model's operation clear and traceable.

[0027] Sampling the same input sample multiple times to obtain multiple inference trajectories refers to the process during reinforcement learning training where the policy model generates multiple different inference results for the same input claim, with each result constituting a complete inference trajectory. The specific process is as follows: First, the same input sample is fed into the policy model. When generating inference trajectories, the model does not always output deterministic results. Instead, it introduces randomness through sampling mechanisms, such as adjusting the probability distribution using temperature parameters during decoding, or randomly selecting the next token from the probability distribution output by the model, thereby generating diverse inference trajectories.

[0028] Secondly, for each sampling, the policy model follows a structured atomic operation protocol to generate a complete inference trajectory. Each trajectory includes a set of sub-claims generated in the decomposition phase, evidence citations and logical analysis generated for each sub-claim in the verification phase, and a structured ruling output in the synthesis phase. Due to the randomness of the sampling process, trajectories generated by different samplings may differ in the way sub-claims are divided, the selection of evidence, the expression of logical analysis, and even the final ruling. For example, for the same input sample "Mercosur is an organization with 9 member states and 19 associated observers, headquartered in Montevideo," one sampling may decompose it into two sub-claims, while another sampling may decompose it into three more detailed sub-claims; one sampling may cite a piece of evidence in the verification phase, while another sampling may cite another piece of related evidence; one sampling may output a "contradictory" ruling, while another sampling may output "support" due to a reasoning error.

[0029] Finally, after sampling the same input sample multiple times, a set of inference trajectories is obtained. These trajectories form a group for subsequent group-relative policy optimization. The multiple trajectories within a group collectively reflect the inference diversity of the policy model on that input statement. By comparing the relative performance of these trajectories within the group, the merits of each trajectory can be evaluated more accurately, and a stable optimization direction can be provided for model parameter updates.

[0030] S120, Determine multiple rewards for each reasoning trajectory.

[0031] Multiple rewards refer to multi-dimensional scoring signals used to evaluate the quality of inference trajectories generated by the policy model during reinforcement learning training. Each inference trajectory corresponds to a set of reward values, reflecting the quality of the trajectory from different perspectives. These multiple rewards include at least outcome rewards, format rewards, and counterfactual decomposition rewards. Format rewards assess whether the inference trajectory conforms to the format requirements specified in the Structured Atomic Operations Protocol, including whether it is organized according to the three stages of decomposition, verification, and synthesis, whether it uses the labels allowed by the protocol, and whether the conclusion is parsable. Counterfactual decomposition rewards assess the effectiveness of the decomposition operations in the inference trajectory, determined by comparing the accuracy result of the trajectory with the accuracy result of the overall verification counterfactual baseline without hypothesis decomposition. These multiple rewards together constitute a comprehensive evaluation of the inference trajectory, providing a basis for the subsequent calculation of the total reward signal and the updating of model parameters.

[0032] For each inference trajectory, a result reward is calculated. This result reward is used to evaluate the consistency between the structured adjudication result output by the inference trajectory and the true labels. Specifically, the final structured adjudication result output by the inference trajectory is compared with the pre-labeled true labels; if they match, a result reward value of 1 is assigned; if they do not match, a result reward value of 0 is assigned. The purpose of the result reward is to guide the policy model to output accurate fact-checking conclusions, ensuring the correctness of the final adjudication result.

[0033] For each inference trajectory, a format reward is calculated. The format reward is used to evaluate whether the inference trajectory conforms to the format requirements specified in the Structured Atomic Operations Protocol. Specifically, it is determined by verifying whether the trajectory is organized according to the three stages of decomposition, verification, and synthesis, whether it uses the labels allowed by the protocol, and whether the conclusion is parsable. Trajectories that fully conform to the format requirements receive a higher format reward value, while trajectories with missing or incorrect format receive a lower format reward value.

[0034] For each inference trajectory, a counterfactual decomposition reward is calculated. This reward evaluates the effectiveness of the decomposition operation within the inference trajectory and is determined by comparing the trajectory's accuracy with the accuracy of the overall counterfactual baseline (assuming no decomposition). A positive counterfactual decomposition reward is awarded if the decomposition operation improves the trajectory's accuracy above the counterfactual baseline; a negative reward is awarded if the decomposition operation lowers the trajectory's accuracy below the counterfactual baseline.

[0035] S130. Determine a gating index based on the format reward of the current training step. The gating index is used to control whether the counterfactual decomposition reward is included in the calculation of the total reward signal. This invention introduces a course stabilization gating mechanism, which borrows the idea of course learning to allow the model to learn in an order of increasing difficulty: first, it learns the basic structured protocol format requirements; after mastering the format specifications, it then learns the complex decomposition logic. This mechanism controls the timing of the counterfactual decomposition reward's activation through a gating index, ensuring that the decomposition logic is optimized only after the model has mastered the basic structural protocol, thus avoiding training instability or divergence caused by introducing complex decomposition rewards in the early stages of training.

[0036] A gating metric is a state variable used to control whether the counterfactual decomposition reward is included in the calculation of the total reward signal. This metric is dynamically determined based on the format reward of the current training step and has two states: allowed and prohibited. When the gating metric is in the allowed state, the counterfactual decomposition reward participates in the calculation of the total reward signal, and the policy model considers both format requirements and decomposition logic during optimization. When the gating metric is in the prohibited state, the counterfactual decomposition reward does not participate in the calculation of the total reward signal, and the policy model optimizes only based on the format reward and other rewards. Through dynamic adjustment of the gating metric, it is ensured that the model only begins optimizing the decomposition logic after mastering the basic structural protocol, achieving curricular-style stable control of training.

[0037] There are several ways to determine the gating metric: One approach is to calculate the average format reward of all inference trajectories in the current training step; if this average reaches or exceeds a preset threshold, the gating metric is set to the allowed state; otherwise, it is set to the prohibited state. Alternatively, the gating metric can be set to the allowed state and maintained when the average format reward remains above the preset threshold for N consecutive training steps; once the average format reward falls below the preset threshold, the gating metric is reset to the prohibited state. This mechanism ensures that the model only introduces decomposition logic optimization after it has stably mastered the format requirements, avoiding premature activation due to single fluctuations.

[0038] S140. If the gating indicator indicates that inclusion is allowed, a total reward signal is calculated based on the multiple rewards; if the gating indicator indicates that inclusion is prohibited, a total reward signal is calculated based on rewards other than counterfactual decomposition rewards.

[0039] When the gating indicator is in the allowed state, the counterfactual decomposition reward is combined with other rewards to obtain the total reward signal. The combination can be done using a weighted summation method, where different types of rewards are assigned corresponding weight coefficients, and the counterfactual decomposition reward is multiplied by its respective weight before being summed to obtain the total reward signal. Alternatively, a product combination method can be used, where the counterfactual decomposition reward is multiplied by other rewards, or the product of these products is combined to obtain the total reward signal.

[0040] S150. The policy model is fine-tuned using the group relative policy optimization algorithm and the total reward signal of each inference trajectory to obtain the trained policy model; wherein, the group relative policy optimization algorithm optimizes the policy model based on the relative performance of the multiple inference trajectories within the group.

[0041] Group Relative Policy Optimization (GRPO) is a reinforcement learning optimization method based on within-group comparisons. This algorithm generates multiple inference trajectories for the same input sample and calculates the optimization direction based on the relative performance of these trajectories within the group. Unlike traditional reinforcement learning algorithms, it does not require training independent value function networks, thus reducing computational overhead. Furthermore, by directly calculating the optimization signal through within-group comparisons, it avoids training fluctuations caused by inaccurate value function estimation, enabling the policy model to converge stably and improving training efficiency, ultimately resulting in a fully trained policy model.

[0042] After obtaining the trained policy model, fact-checking is performed based on the model: S160. Input the statement to be verified and the corresponding reference document into the trained policy model to obtain the fact-checking results; For example, given the statement to be verified, "Mercosur is an organization with 9 member states and 19 associate observers, headquartered in Montevideo," inputting it and the corresponding reference document into a trained policy model, the model outputs "contradictory" as the fact-checking result.

[0043] This invention provides a structured fact-checking method based on protocol-guided reinforcement learning. By introducing a structured atomic operation protocol, it forces the policy model to generate inference trajectories according to atomic operation steps during training, making the policy model's inference process fully visible. Each inference trajectory contains a complete inference chain, allowing users to trace the basis of judgment from input claims to the final fact-checking result, avoiding the deficiency of traditional black-box models that cannot provide auditable evidence. To address the problem of inference illusion, this invention sets a counterfactual decomposition reward to evaluate the effectiveness of decomposition operations in the inference trajectory, ensuring that the policy model infers based on genuine evidence dependencies rather than simply mimicking inference forms, effectively suppressing the illusion of fluent but fabricated inference. To address the problem of over-decomposition, this invention determines a gating metric based on the format reward of the current training step, controlling whether the counterfactual decomposition reward is included in the calculation of the total reward signal. In the early stages of training, before the policy model has mastered the basic format specifications, the gating metric prohibits the counterfactual decomposition reward from participating in the calculation, allowing the policy model to focus on learning the correct protocol format. Once the format reward reaches a threshold, the gating metric allows the counterfactual decomposition reward to participate in optimization, guiding the policy model to achieve a balance between atomic decomposition and semantic integrity, avoiding excessive fragmentation and semantic drift caused by unguided decomposition. Furthermore, this invention employs a group-relative policy optimization algorithm to fine-tune the parameters of the policy model. This algorithm calculates the optimization direction based on the relative performance of multiple inference trajectories corresponding to the same input sample within a group, enabling the policy model to converge stably during reinforcement learning training, ultimately resulting in fact-checking results with higher accuracy and fidelity.

[0044] In one embodiment, in this invention, atomic operation steps specifically refer to the three core stages constituting a complete reasoning process: decomposition stage, verification stage, and synthesis stage. Each stage represents an independent and necessary operation unit, and the three steps are combined sequentially to form a complete reasoning trajectory, ensuring the standardization and interpretability of the reasoning process. The structured atomic operation protocol includes: In the decomposition phase, the strategy model decomposes the input declaration into multiple independently verifiable sub-declarations, resulting in a set of decomposed sub-declarations to isolate ambiguity and complexity. During the verification phase, the strategy model compares each sub-declaration in the sub-declaration set with a given reference document and generates a verification result for each sub-declaration that includes evidence citations and logical analysis. In the synthesis phase, the strategy model aggregates the verification results of all sub-declarations and outputs a structured adjudication result.

[0045] Specifically, the policy model refers to the large language model to be trained, which can be various open-source or closed-source language models based on the Transformer architecture, such as the LLaMA series, Qwen series, and ChatGLM series. During reinforcement learning training, this model acts as an agent, receiving input statements and generating inference trajectories. In the inference phase, the trained policy model receives statements to be verified and outputs fact-checking results. The parameters of the policy model are fine-tuned using a group-relative policy optimization algorithm and the total reward signal.

[0046] During the decomposition phase, the policy model generates independently verifiable sub-claims based on the semantic content of the input claim. In one optional implementation, decomposition can be performed as follows: the policy model receives the input claim, identifies multiple factual statements contained within it through semantic understanding, and breaks down the input claim into several atomic sub-claims. These operations can all be performed by a large language model: the model performs semantic parsing of the input claim through self-attention mechanisms and contextual understanding capabilities, internally identifies entities in the claim through named entity recognition, identifies logical relationships in the claim through dependency parsing, and identifies core factual elements through semantic role labeling. Based on these internal representations, the model breaks down the input claim into multiple atomic sub-claims; this process is entirely completed during the model's forward propagation, without the need for external tools. For example, for the input claim "Mercosur is an organization with 9 member states and 19 associated observers, headquartered in Montevideo," the policy model decomposes it into two sub-claims: "Headquartered in Montevideo" and "The organization has 9 member states and 19 observers." After decomposition, a set of sub-claims is output.

[0047] During the verification phase, the policy model verifies each sub-claim based on the reference document and outputs a verification conclusion that includes evidence citations and logical analysis. In an optional implementation, verification can be performed as follows: For each sub-claim, the policy model compares it with a given reference document. These operations can all be performed by a large language model: the model first locates the evidence text related to the sub-claim through an internal retrieval mechanism or attention mechanism, semantically aligns the sub-claim with relevant paragraphs in the reference document, and determines consistency by calculating the semantic similarity or through logical reasoning. This process utilizes the model's pre-trained knowledge and contextual understanding capabilities to complete evidence citation and logical analysis in a single forward propagation, and outputs a verification conclusion for each sub-claim, including support, contradiction, or inability to attribution.

[0048] During the synthesis phase, the strategy model summarizes the verification conclusions of all sub-declarations. These operations can all be performed by the large language model: the model collects the verification conclusions of all sub-declarations, performs a comprehensive judgment through internal aggregation logic, and merges multiple sub-conclusions into a final structured decision result according to preset aggregation rules. In an optional implementation, aggregation can be performed according to the following rules: if all sub-declarations are verified as supported, then the final output is "supported"; if at least one sub-declaration is verified as contradictory, then the final output is "contradictory"; if there are sub-declarations that cannot be attributed and there is no contradiction, then the output is "cannot be attributed". The final conclusion output by the synthesis phase is called the structured decision result.

[0049] It should be noted that the structured adjudication results and the fact-checking results during the inference phase refer to the same thing: the final verification conclusion output by the model, which includes three possibilities: support, contradiction, or no attribution. The difference lies in their usage scenarios. When describing the final conclusion output by the model during training, it is called the structured adjudication result; when describing the model's output on the input claims during the inference phase, it is called the fact-checking result. Both are essentially the same, representing the model's final judgment on the truth or falsity of the claims being verified.

[0050] This invention decomposes the fact-checking process into three stages: decomposition, verification, and synthesis, through a structured atomic operation protocol. The decomposition stage breaks down complex claims into independently verifiable sub-claims, solving the problem of single verification struggling to handle multi-hop logic. The verification stage requires the policy model to provide evidence citations and logical analysis, ensuring the interpretability of the reasoning process. The synthesis stage aggregates the results of each sub-claim and outputs the final decision, guaranteeing the completeness of the conclusion. This protocol makes the reasoning process of the policy model completely transparent, allowing users to trace the basis of each judgment.

[0051] In one embodiment, the method further includes: Construct a composite reward system, which is used to provide the multiple rewards.

[0052] Specifically, the reward is determined based on the consistency between the structured decision result output by the inference trajectory and the true label. The specific implementation is as follows: compare the final structured decision result output by the inference trajectory with the pre-labeled true label; if they match, a reward value of 1 is assigned; otherwise, a reward value of 0 is assigned.

[0053] The format reward is determined based on whether the inference trajectory conforms to the format requirements specified in the Structured Atomic Operations Protocol. These format requirements include whether the inference trajectory correctly uses allowed tags, whether it conforms to the topological ordering of decomposition, verification, and synthesis, and whether the conclusion is parsable. Specifically, the implementation is as follows: First, verify whether the inference trajectory correctly uses the protocol-allowed tags, such as whether the "decomposition" tag is used in the decomposition stage, the "verification" tag in the verification stage, and the "synthesis" tag in the synthesis stage. Second, verify whether the inference trajectory conforms to the topological ordering of decomposition, verification, and synthesis, ensuring that the three stages appear in sequence without any omissions. Finally, verify whether the conclusion of the inference trajectory is parsable, i.e., whether a clear structured decision result can be extracted from the trajectory. If the inference trajectory simultaneously meets all the above format requirements, a format reward value of 1 is assigned; if any format requirement is not met, a format reward value of 0 is assigned.

[0054] The counterfactual decomposition reward is determined based on the effectiveness of the decomposition operation in the inference trajectory. Specifically, it is implemented as follows: First, a global counterfactual baseline trajectory for hypothesis verification without decomposition is generated. This baseline trajectory refers to the global counterfactual baseline trajectory obtained by the hypothesis policy model when processing the same input sample, where the hypothesis is not decomposed. This trajectory is obtained by retaining the input statements in the original inference trajectory and skipping the decomposition phase, allowing the policy model to continue generating the verification and synthesis phases. This trajectory compares the input statements as a whole with the reference document, directly outputting a structured decision result for the overall statements, which is then compared with the inference trajectory containing the decomposition operation to evaluate the effectiveness of the decomposition operation.

[0055] The formula for calculating the total reward signal is: ; Where R represents the total reward signal. Indicates a reward for the result. Indicates a formatted reward. This indicates a counterfactual decomposition reward. This is the balancing coefficient for format constraints, used to adjust the weight of format rewards in the total reward. The balance coefficient for counterfactual shaping is used to adjust the weight of counterfactual decomposition rewards in the total reward. Used for gating counterfactual decomposition rewards.

[0056] When the gating indicator is in a prohibited state, the counterfactual decomposition reward is not included in the calculation. At this time, the formula for calculating the total reward signal is: ; This invention provides multi-dimensional optimization signals for model training by constructing a composite reward system that includes outcome rewards, format rewards, and counterfactual decomposition rewards. Outcome rewards ensure the accuracy of the model's output, format rewards force the model to follow structured protocols, and counterfactual decomposition rewards guide the model to learn effective decomposition strategies. The three rewards work synergistically, enabling the model to continuously optimize its decomposition logic and the accuracy of the final result while maintaining format conformity.

[0057] In one embodiment, to prevent the strategy model from undergoing invalid or destructive over-decomposition, the present invention generates a global counterfactual baseline assuming no decomposition. The counterfactual decomposition reward is calculated as follows: The accuracy of the inference trajectory generated following the structured atomic operation protocol will be compared with the accuracy of the overall check counterfactual baseline without hypothesis decomposition. If the inference trajectory generated by following the structured atomic operation protocol corrects the error made by the overall verification counterfactual baseline, a positive reward value is assigned; if the inference trajectory generated by following the structured atomic operation protocol introduces a new error, a negative penalty value is assigned.

[0058] Specifically, the overall counterfactual baseline will be checked without decomposition of the assumptions. The accuracy result is compared with the accuracy result of the current inference trajectory containing the decomposition operation. If the decomposition trajectory is correct and the baseline trajectory is incorrect, it means that the decomposition operation corrected the errors that the overall check might make, and a positive reward value is assigned; if the decomposition trajectory is incorrect and the baseline trajectory is correct, it means that the decomposition operation introduced new errors, and a negative penalty value is assigned; if both are correct or both are incorrect, it means that the decomposition operation did not change the correctness of the check result, and a zero reward is assigned.

[0059] This invention evaluates the effectiveness of counterfactual decomposition by comparing the reward with the overall verification baseline. Positive rewards are given when a decomposition corrects errors in the overall verification, and penalties are imposed when a decomposition introduces new errors. This dynamic comparison mechanism effectively solves the problems of excessive fragmentation and semantic drift caused by unguided decomposition, enabling the model to learn to strike a balance between atomic decomposition and semantic integrity, thus avoiding the decomposition dilemma.

[0060] In one embodiment, determining the gating metric based on the formatted reward of the current training step includes: Calculate the average formatted reward for all inference trajectories in the current training step; If the average format reward exceeds a preset threshold, the gating indicator will be set to an allowed state. If the average format reward is lower than a preset threshold, the gating indicator will be set to a prohibited state.

[0061] Specifically, in the current training step, the policy model samples multiple input claims. Each input claim is sampled multiple times to generate multiple inference trajectories, thus the current training step contains multiple inference trajectories from different input claims. For each inference trajectory, its format reward value is determined according to the calculation method of the format reward. The evaluation of the format reward includes: verifying whether the inference trajectory correctly uses the protocol-allowed labels, verifying whether the inference trajectory conforms to the module topology ordering of decomposition, verification, and synthesis, and verifying whether the conclusion of the inference trajectory is parsable. If the inference trajectory meets all of the above format requirements, it is assigned a format reward value of 1; if any format requirement is not met, it is assigned a format reward value of 0. The format reward values of all inference trajectories in the current training step are collected, and the arithmetic mean of these reward values is calculated to obtain the average format reward. The average format reward is output as the format reward of the current training step.

[0062] The average format reward is compared to a preset threshold. If the average format reward exceeds the preset threshold... If the threshold is lower than the preset threshold, the gate indicator will be set to the allowed state; if it is lower than the preset threshold, the gate indicator will be set to the allowed state. If the condition is met, it is set to a prohibited state. This mechanism ensures that the decomposition logic is optimized only after the model has mastered the basic structural protocol, thus guaranteeing training stability.

[0063] The gating metric controls the timing of counterfactual decomposition rewards based on the average format reward of the current training step. In the early stages of training, before the policy model has mastered the basic format specifications, the gating metric is disabled, and the policy model focuses on learning the format requirements. Once the format reward reaches a threshold, the gating metric switches to enabled, and counterfactual decomposition rewards begin to participate in optimization. This learning-based training mechanism ensures training stability and prevents training divergence caused by introducing complex decomposition rewards before the policy model has mastered the basic protocol.

[0064] In one embodiment, the step of fine-tuning the parameters of the policy model using a group-relative policy optimization algorithm and the total reward signal of each inference trajectory to obtain a trained policy model includes: Calculate the mean and standard deviation of the total reward signal for all inference trajectories within the group to obtain the average reward and standard deviation within the group; The total reward signal for each inference trajectory is compared with the average reward within the group, and normalized by dividing by the standard deviation within the group to obtain the corresponding advantage function value. The parameters of the policy model are updated based on the dominance function values and KL divergence penalty terms within each group.

[0065] Specifically, the mean and standard deviation of the total reward signal for all inference trajectories within the group are calculated to obtain the group average reward and group standard deviation. Then, the total reward signal for each inference trajectory is compared with the group average reward and normalized by dividing by the group standard deviation to obtain the corresponding dominance function value. Finally, the parameters of the policy model are updated based on the dominance function values and the KL divergence penalty term, completing the reinforcement learning training of the policy model.

[0066] Specifically, the KL divergence penalty term originates from the distributional difference between the reference policy model and the current policy model. During reinforcement learning training, a reference policy model is maintained; this model is typically the policy model before parameter updates or a periodically saved historical version. For each input statement, the input statement is simultaneously input into both the current policy model and the reference policy model. The probabilities of the current policy model generating the inference trajectory and the reference policy model generating the inference trajectory are calculated separately. The logarithm of the ratio of these two probabilities is taken as the KL divergence value of the trajectory, which is the KL divergence penalty term. The larger the KL divergence value, the further the current policy model deviates from the reference policy model.

[0067] The advantage function is: ; in, For the first in the group The advantage function value of each inference trajectory For the first The total reward signal for each reasoning trajectory It is the average of the total reward signal for all reasoning trajectories within the group. The standard deviation of the total reward signal for all inference trajectories within the group. This is a preset, minimal constant used to prevent division by zero. The dominance function reflects the performance of each inference trajectory relative to the group average. Standard deviation normalization eliminates the influence of differences in reward scales between groups. A positive dominance function value indicates that the trajectory performs better than the group average; a negative dominance function value indicates that the trajectory performs worse than the group average.

[0068] The specific process of parameter update is as follows: For each inference trajectory, a loss term is constructed for that trajectory. The loss term consists of two parts: the first part is the negative value of the advantage function, because the goal of reinforcement learning is to increase the advantage function value, which requires a negative value in the loss term; the second part is the KL divergence penalty term multiplied by a balance coefficient, used to constrain the magnitude of model updates. Represented as: ; in, The preset balance coefficient, The dominant function value, This is the KL divergence penalty term. The total loss for the group is obtained by summing or averaging the loss terms of all inference trajectories within the group. This total loss is then minimized using gradient descent to update the policy model's parameters. After the update, the current policy model can be selected as the new reference policy model for KL divergence calculation in subsequent training steps.

[0069] This invention employs a group-based relative policy optimization algorithm to fine-tune the parameters of a policy model. This algorithm generates multiple inference trajectories from the same input sample and calculates the optimization direction based on the relative performance of these trajectories within the group. By replacing the training of independent value function networks in traditional reinforcement learning algorithms with intra-group comparisons, it avoids training fluctuations caused by inaccurate value function estimation, enabling the policy model to converge stably while reducing computational overhead and improving training efficiency. Specifically, it calculates the mean and standard deviation of the total reward signal for all inference trajectories within the group. The total reward signal of each inference trajectory is compared with the average reward within the group and normalized by dividing by the intra-group standard deviation to obtain the corresponding advantage function value. This value is then combined with a KL divergence penalty term to update the parameters of the policy model. This advantage function, normalized by standard deviation, eliminates the influence of differences in reward scale between different groups, allowing the model to optimize based on relative performance. This results in a stable update direction under finite sampling, significantly improving the accuracy of fact checking and the stability of the inference process.

[0070] The invention will now be illustrated with a complex multi-hop example. Assume a given statement: "Mercosur is an organization with 9 member states and 19 associate observers, headquartered in Montevideo." The given reference documents show that its headquarters are indeed in Montevideo, but the "9 member states and 19 associate observers" actually belong to another organization, the CPLP; Mercosur only has 4 full members.

[0071] 1. Failure path of existing methods: If models such as MiniCheck are used, they are easily affected by the bias of surface word overlap and directly output "support" (false positive error); if the ClearCheck model is used, it extracts evidence of headquarters location and performs local reasoning, but ignores the contradiction in the number of members, and also draws wrong conclusions.

[0072] 2. Protocol decomposition and verification in this invention: The strategy model follows the structured atomic operation protocol, decomposing declarations into multiple sub-declarations (i.e., Figure 2 The sub-statements (1-n) are processed to obtain the set of sub-statements: "Headquarters located in Montevideo" and "The organization has 9 member states and 19 observers". During the verification phase, the strategy model searches for citations for both: the former receives a verification result of "supporting", while the latter finds that the document explicitly states this is CPLP data, thus determining the verification result as "contradictory".

[0073] 3. Final Synthesis and Strategy Closure: The strategy model integrates the verification results of all the above sub-declarations ( Figure 2 The verification results (1-n) are aggregated, and due to partial claims being inconsistent, the final output is "contradictory" (correct conclusion). During the training phase, the input claims are directly input into the policy model as a whole, skipping the overall counterfactual baseline trajectory generated in the decomposition phase. This easily leads to the neglect of local errors, similar to existing small-scale verification models (i.e., black-box models relying solely on label classification or student models based on inference distillation). However, the decomposition path of this method successfully identifies contradictions, thus the counterfactual decomposition reward... It assigns a strong positive signal of +1. Through continuous iteration of group relative policy optimization, the policy model internalizes this rigorous ability to analyze and dissect complex issues, rather than simply imitating a format.

[0074] To verify the effectiveness of the proposed protocol-guided reinforcement learning-based structured fact-checking method, extensive experiments were conducted on the CLEARFACTS comprehensive benchmark, which includes 14 datasets from different domains and tasks. The evaluation metric used was the macro-average F1 score.

[0075] First, the proposed protocol-guided reinforcement learning-based structured fact-checking method is comprehensively compared with existing general-purpose large language models (such as GPT-4o, Llama-3.1-405B) and existing dedicated fact-checking baseline models (such as ClearCheck, MiniCheck, and FactCG). ClaimVerify in Table 1 is an evaluation dataset included in the CLEARFACTS comprehensive benchmark, used to test the model's accuracy in judging the truthfulness of claims. HoVer, short for Hop Verification, is a dataset included in the CLEARFACTS comprehensive benchmark, specifically used to evaluate the model's fact-checking capabilities in multi-hop reasoning scenarios.

[0076] The experimental results are shown in Table 1, which presents the performance comparison data of different verification models on the CLEARFACTS benchmark. The results show that although this invention uses only an 8B parameter-scale underlying model, it achieves the best average performance (86.6 points) compared to dedicated verification models such as ClearCheck, MiniCheck, and FactCG on complex tasks including Retrieval Augmentation (RAG Truth), multi-hop inference (CoverBench, HoVer), and scientific claim verification (SciFact). This method not only significantly outperforms all existing dedicated verifiers but also surpasses, under the same conditions, state-of-the-art large models such as GPT-4o and Llama-3.1-405B, which have far more parameters. This demonstrates that this invention greatly improves the accuracy of the model through a rigorous structured protocol and reinforcement learning-internalized inference logic.

[0077] Table 1

[0078] To verify the various components in this invention (especially reinforcement learning, counterfactual decomposition reward) Format rewards To demonstrate the necessity of a curriculum stabilization mechanism, an ablation experiment was conducted.

[0079] The ablation experiment results are shown in Appendix Table 2. The results confirm that: 1) Removing reinforcement learning (using only prompts to make the model mimic the protocol) leads to a significant drop in performance (down to 83.3), proving that RL is the key to making the model truly internalize the "decomposition-verification-synthesis" logic.

[0080] 2) Remove counterfactual decomposition rewards ( The performance degradation was severe on datasets such as CoverBench that require long-range complex logic, which proves that the reward mechanism, as a "dynamic logic regulator", successfully prevents semantic drift and error accumulation caused by over-decomposition.

[0081] 3) Format rewards and course stabilization mechanisms are equally indispensable, as they ensure the structural robustness of the inference trajectory and the convergence of the training process.

[0082] Table 2

[0083] Existing fine-tuning and distillation methods make small models ineffective in complex multi-hop logic. The proposed protocol-guided reinforcement learning-based structured fact-checking method achieves an exceptionally high average Macro-F1 score of 86.6 on the CLEARFACTS benchmark by internalizing structured atomic operation protocols. As a model with only 8B parameters, this invention significantly outperforms dedicated fact-checking baselines such as ClearCheck and MiniCheck, and even surpasses, under similar conditions, models much larger than itself, such as GPT-4o (83.3 points) and Llama-3.1-405B. Experimental data demonstrate that the counterfactual reward mechanism of this invention effectively suppresses meaningless speculative outputs, significantly reduces invalid token consumption during inference, and greatly improves the local localization accuracy and inference fidelity of the fact-checking results.

[0084] The following describes the structured fact-checking device based on protocol-guided reinforcement learning provided by the present invention. The structured fact-checking device based on protocol-guided reinforcement learning described below and the structured fact-checking method based on protocol-guided reinforcement learning described above can be referred to and correspond to each other.

[0085] The structured fact-checking device based on protocol-guided reinforcement learning provided by this invention refers to... Figure 3 As shown, it includes the following modules: The trajectory generation module 210 is used to obtain multiple inference trajectories obtained by the policy model sampling the same input sample multiple times; wherein, the policy model follows the atomic operation steps defined by the structured atomic operation protocol when generating the inference trajectory. The reward determination module 220 is used to determine multiple rewards for each inference trajectory; the multiple rewards include at least a result reward, a format reward, and a counterfactual decomposition reward. The result reward is used to evaluate the consistency between the structured adjudication result output by the inference trajectory and the true label. The format reward is used to evaluate whether the inference trajectory conforms to the format requirements specified by the structured atomic operation protocol. The counterfactual decomposition reward is used to evaluate the effectiveness of the decomposition operation in the inference trajectory. The gating determination module 230 is used to determine a gating index based on the format reward of the current training step. The gating index is used to control whether the counterfactual decomposition reward is included in the calculation of the total reward signal. The signal calculation module 240 is used to calculate a total reward signal based on the multiple rewards if the gating indicator indicates that inclusion is allowed; and to calculate a total reward signal based on rewards other than counterfactual decomposition rewards if the gating indicator indicates that inclusion is prohibited. The model fine-tuning module 250 is used to fine-tune the parameters of the policy model using the group relative policy optimization algorithm and the total reward signal of each inference trajectory to obtain the trained policy model; wherein, the group relative policy optimization algorithm optimizes the policy model based on the relative performance of the multiple inference trajectories within the group; The fact-checking module 260 is used to input the statement to be checked and the corresponding reference document into the trained policy model to obtain the fact-checking results.

[0086] Figure 4 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 4 As shown, the electronic device may include a processor 310, a communications interface 320, a memory 330, and a communication bus 340. The processor 310, communications interface 320, and memory 330 communicate with each other via the communication bus 340. The processor 310 can invoke logical instructions from the memory 330 to execute a protocol-guided reinforcement learning-based structured fact-checking method.

[0087] Furthermore, the logical instructions in the aforementioned memory 330 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, essentially, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0088] On the other hand, the present invention also provides a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer is able to execute the structured fact-checking method based on protocol-guided reinforcement learning provided by the above methods.

[0089] In another aspect, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, is implemented to perform the structured fact-checking method based on protocol-guided reinforcement learning provided by the above methods.

[0090] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0091] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0092] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for protocol-guided reinforcement learning based structured fact verification, characterized in that, include: Input the statement to be verified and the corresponding reference document into the trained policy model to obtain the fact-checking results; The trained policy model is obtained through the following reinforcement learning steps: Multiple inference trajectories are obtained by sampling the same input sample multiple times using a policy model; wherein the policy model follows atomic operation steps defined by a structured atomic operation protocol when generating inference trajectories. Multiple rewards are determined for each inference trajectory; the multiple rewards include at least an outcome reward, a format reward, and a counterfactual decomposition reward. The outcome reward is used to evaluate the consistency between the structured adjudication result output by the inference trajectory and the true label. The format reward is used to evaluate whether the inference trajectory conforms to the format requirements specified by the structured atomic operation protocol. The counterfactual decomposition reward is used to evaluate the effectiveness of the decomposition operation in the inference trajectory. A gating metric is determined based on the format reward of the current training step, and the gating metric is used to control whether the counterfactual decomposition reward is included in the calculation of the total reward signal; If the gating indicator indicates that inclusion is allowed, the total reward signal is calculated based on the multiple rewards; if the gating indicator indicates that inclusion is prohibited, the total reward signal is calculated based on rewards other than counterfactual decomposition rewards. The policy model is fine-tuned using a group relative policy optimization algorithm and the total reward signal of each inference trajectory to obtain a trained policy model; wherein, the group relative policy optimization algorithm optimizes the policy model based on the relative performance of the multiple inference trajectories within the group.

2. The structured fact-checking method based on protocol-guided reinforcement learning according to claim 1, characterized in that, The method further includes: Construct a composite reward system, which is used to provide the multiple rewards; The result reward is determined based on the consistency between the structured adjudication result output by the inference trajectory and the true label. The format reward is determined based on whether the inference trajectory conforms to the format requirements specified by the structured atomic operation protocol. The format requirements include whether the inference trajectory correctly uses allowed labels, whether it conforms to the topological sorting of decomposition, verification, and synthesis, and whether the conclusion is parsable. The counterfactual decomposition reward is determined based on the validity of the decomposition operation in the inference trajectory.

3. The structured fact-checking method based on protocol-guided reinforcement learning according to claim 1, characterized in that, The structured atomic operation protocol includes: In the decomposition phase, the strategy model decomposes the input declaration into multiple independently verifiable sub-declarations, resulting in a set of decomposed sub-declarations. During the verification phase, the strategy model compares each sub-declaration in the sub-declaration set with a given reference document and generates a verification result for each sub-declaration that includes evidence citations and logical analysis. In the synthesis phase, the strategy model aggregates the verification results of all sub-declarations and outputs a structured adjudication result.

4. The structured fact-checking method based on protocol-guided reinforcement learning according to claim 1, characterized in that, The calculation method for the counterfactual decomposition reward is as follows: The accuracy of the inference trajectory generated following the structured atomic operation protocol will be compared with the accuracy of the overall check counterfactual baseline without hypothesis decomposition. If the inference trajectory generated by following the structured atomic operation protocol corrects the error made by the overall verification counterfactual baseline, a positive reward value is assigned. If a new error is introduced into the inference trajectory generated following the structured atomic operation protocol, a negative penalty value is assigned.

5. The structured fact-checking method based on protocol-guided reinforcement learning according to claim 1, characterized in that, The determination of the gating metric based on the formatted reward of the current training step includes: Calculate the average formatted reward for all inference trajectories in the current training step; If the average format reward exceeds a preset threshold, the gating indicator will be set to an allowed state. If the average format reward is lower than a preset threshold, the gating indicator will be set to a prohibited state.

6. The structured fact-checking method based on protocol-guided reinforcement learning according to claim 1, characterized in that, The step of fine-tuning the parameters of the policy model using a group-relative policy optimization algorithm and the total reward signal of each inference trajectory to obtain the trained policy model includes: Calculate the mean and standard deviation of the total reward signal for all inference trajectories within the group to obtain the average reward and standard deviation within the group; The total reward signal for each inference trajectory is compared with the average reward within the group, and normalized by dividing by the standard deviation within the group to obtain the corresponding advantage function value. The parameters of the policy model are updated based on the dominance function values and KL divergence penalty terms within each group.

7. A structured fact-checking device based on protocol-guided reinforcement learning, characterized in that, include: The trajectory generation module is used to obtain multiple inference trajectories obtained by the policy model sampling the same input sample multiple times; wherein, the policy model follows the atomic operation steps defined by the structured atomic operation protocol when generating inference trajectories; The reward determination module is used to determine multiple rewards for each inference trajectory; the multiple rewards include at least a result reward, a format reward, and a counterfactual decomposition reward. The result reward is used to evaluate the consistency between the structured adjudication result output by the inference trajectory and the true label. The format reward is used to evaluate whether the inference trajectory conforms to the format requirements specified by the structured atomic operation protocol. The counterfactual decomposition reward is used to evaluate the effectiveness of the decomposition operation in the inference trajectory. A gating determination module is used to determine a gating index based on the format reward of the current training step. The gating index is used to control whether the counterfactual decomposition reward is included in the calculation of the total reward signal. The signal calculation module is used to calculate the total reward signal based on the multiple rewards if the gating indicator indicates that inclusion is allowed; and to calculate the total reward signal based on rewards other than counterfactual decomposition rewards if the gating indicator indicates that inclusion is prohibited. The model fine-tuning module is used to fine-tune the parameters of the policy model using the group relative policy optimization algorithm and the total reward signal of each inference trajectory to obtain the trained policy model; wherein, the group relative policy optimization algorithm optimizes the policy model based on the relative performance of the multiple inference trajectories within the group; The fact-checking module is used to input the statement to be checked and the corresponding reference document into the trained policy model to obtain the fact-checking results.

8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, When the processor executes the computer program, it implements the structured fact-checking method based on protocol-guided reinforcement learning as described in any one of claims 1 to 6.

9. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the structured fact-checking method based on protocol-guided reinforcement learning as described in any one of claims 1 to 6.

10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the structured fact-checking method based on protocol-guided reinforcement learning as described in any one of claims 1 to 6.