Adaptive multi-modal fact checking method and device with explainability

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By adopting an adaptive multimodal fact-checking method, dynamically adjusting evidence retrieval and utilizing a multi-agent debate system, the problems of resource redundancy and insufficient interpretation in existing technologies are solved, achieving efficient, accurate, and transparent multimodal fact-checking.

CN122196253APending Publication Date: 2026-06-12BEIJING UNIV OF POSTS & TELECOMM

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: BEIJING UNIV OF POSTS & TELECOMM
Filing Date: 2026-03-12
Publication Date: 2026-06-12

AI Technical Summary

⚠Technical Problem

Existing multimodal fact-checking methods cannot adaptively adjust strategies during the evidence retrieval stage, resulting in redundant computational resources or missing evidence. Furthermore, they lack explicit modeling of cross-modal relationships and have insufficient interpretability of reasoning, making it difficult to achieve efficient, accurate, and transparent verification in complex information environments.

⚗Method used

An adaptive multimodal fact-checking method is adopted. By dynamically adjusting the evidence retrieval strategy, a heterogeneous graph is constructed and a multi-agent debate system is used to conduct multi-round interactive debates, generating interpretable natural language reports and explicitly modeling the high-order semantic associations between statements and evidence.

🎯Benefits of technology

It enables on-demand allocation of computing resources, captures fine-grained multi-hop inference paths, generates traceable natural language explanations, improves the accuracy and transparency of verification, and reduces the false positive rate.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122196253A_ABST

Patent Text Reader

Abstract

The application provides an adaptive multi-modal fact verification method and device with explainability, and relates to the technical field of electric digital data processing. The method comprises the following steps: dynamically adjusting an evidence retrieval strategy according to a statement text, retrieving a multi-modal evidence set containing text and visual evidence, constructing a heterogeneous graph based on the statement text and the evidence set, and updating node representation in a graph structure; inputting the updated heterogeneous graph into a multi-agent debate system containing counter-factual reasoning agents, generating a truth determination result and a natural language explanation report traceable to evidence nodes through multi-round interactive debate; wherein the counter-factual reasoning agents generate counter-factual reasoning points by making hypothetical modifications to the evidence. The application can solve the problems of rigid retrieval strategy, lack of cross-modal relationship modeling and uninterpretable reasoning process in existing text fact verification technology, and can realize the coordinated improvement of retrieval efficiency, reasoning accuracy and explanation transparency.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of electronic digital data processing technology, and in particular to an interpretable adaptive multimodal fact-checking method and apparatus. Background Technology

[0002] With the booming development of the digital information ecosystem, the scale of content carried by online platforms is growing exponentially, and the joint expression of multimodal information (text, images, videos, etc.) has become the mainstream form of user-generated content. While this trend improves the efficiency of information dissemination, it also significantly exacerbates the risk of the spread of misinformation. To mitigate the adverse effects of misinformation on society and other levels, automated fact-checking systems have received widespread attention in recent years. Current fact-checking technologies typically rely on two core steps: retrieving candidate evidence related to the claim from external knowledge resources, and using the retrieved evidence to infer the veracity of the claim. With the popularization of multimodal content, multimodal fact-checking has emerged, which improves the accuracy of verification results by utilizing the complementary relationship between textual and visual evidence.

[0003] In existing technologies, multimodal fact-checking methods have made some progress in evidence retrieval strategies and reasoning interpretability. For example, Tool-MAD (Tool-assisted Multi-Agent Debate) achieves dynamic retrieval by incrementally querying evidence information through multiple rounds of debate; DEFAME (Dynamic Evidence Fetching with Adaptive Multi-agent Exploration) has task-driven retrieval depth planning capabilities, which can adjust the retrieval depth according to task requirements. Regarding interpretability, methods such as Tool-MAD and MAD-Sherlock (Multi-Agent Debate Sherlock) can output reasoning paths and generate interpretable detection results through multi-agent collaboration, thus improving the model's transparency to some extent.

[0004] However, despite the breakthroughs in dynamic retrieval and interpretable output achieved by the aforementioned methods, systemic shortcomings remain: First, in the evidence retrieval stage, while existing methods (such as Tool-MAD and DEFAME) possess some dynamic retrieval capabilities, they cannot adaptively adjust retrieval strategies based on the complexity of the claims. This results in simple claims still requiring a complete retrieval process, leading to redundant computational resources; while complex claims may miss key evidence due to insufficient retrieval depth, affecting the accuracy of the judgment. Second, in terms of cross-modal relationship modeling, existing methods mainly rely on simple feature splicing or fusion strategies, lacking explicit modeling of high-order semantic relationships between claims and evidence, and between pieces of evidence. This makes it difficult to capture complex multi-hop reasoning paths, limiting the model's ability to express fine-grained evidence interactions. Finally, regarding reasoning interpretability, existing interpretable methods (such as Tool-MAD and MAD-Sherlock) are mostly posterior interpretations, i.e., supplementing the reasoning path after generating the judgment result, lacking a unified modeling and controllable interpretation mechanism for the retrieval and reasoning stages. The explanations provided cannot be traced back to specific points of evidence, and it is difficult to actively test the robustness of the arguments during the debate, resulting in insufficient reliability when faced with conflicting evidence. These shortcomings collectively make it difficult for existing multimodal fact-checking methods to achieve efficient, accurate, and transparent end-to-end verification in real-world, complex information environments. Summary of the Invention

[0005] In view of this, embodiments of this application provide an adaptive multimodal fact-checking method and apparatus with interpretability to eliminate or improve one or more defects existing in the prior art.

[0006] One aspect of this application provides an interpretable adaptive multimodal fact-checking method, comprising: The evidence retrieval strategy is dynamically adjusted based on the statement text, and a multimodal evidence set of the statement text is retrieved based on the evidence retrieval strategy; wherein, the modality of the evidence in the multimodal evidence set includes text and visual. A heterogeneous graph is constructed based on the statement text and the multimodal evidence set, and the node representation of the heterogeneous graph is updated by graph structure learning; wherein, the types of nodes in the heterogeneous graph include: the statement node corresponding to the statement text and the evidence node corresponding to each of the evidence; the types of edges in the heterogeneous graph include: first type edges connecting the statement node and the evidence node, and second type edges connecting different evidence nodes; The heterogeneous graph with updated node representations is input into a multi-agent debate system containing a counterfactual reasoning agent. Through multiple rounds of agent-based interactive debate, a veracity determination result for the stated text and a corresponding natural language interpretation report are generated. The natural language interpretation report is used to trace back to at least one of the evidence nodes in the heterogeneous graph. The counterfactual reasoning agent is used to actively generate counterfactual reasoning points by hypothetically modifying the evidence corresponding to the evidence node during the multi-round agent-based interactive debate.

[0007] In some embodiments of this application, the multi-agent debate system further includes: an evidence-specific agent, a standard reasoning agent, an aggregation module, and a judging agent that participate in multi-round agent-to-agent interactive debate together with the counterfactual reasoning agent; Each of the evidence-specific intelligent agents corresponds to one of the pieces of evidence in the multimodal evidence set and is used to generate an evaluation result on the correlation between the evidence and the statement text; The aggregation module is used to summarize the evaluation results generated by each of the evidence-specific intelligent agents, and generate global evidence summary data and corresponding explanatory data based on the evaluation results; The standard reasoning agent is used to construct standard arguments that support or refute the statement text based on the global evidence summary data; The judging agent is used to evaluate whether each round of debate in the agent's interactive debate process has converged based on the global evidence summary data and the interpretation data, and triggers the generation of the authenticity judgment result of the statement text and the corresponding final natural language interpretation report when convergence is achieved.

[0008] In some embodiments of this application, the multi-agent debate system generates a truthfulness determination result and a corresponding natural language interpretation report for the statement text through the following steps: The first round of debate steps: Each evidence-specific intelligent agent generates an initial evaluation result based on the statement text and the unique corresponding evidence; the aggregation module summarizes all initial evaluation results and generates global evidence summary data and corresponding explanation data for the first round of debate; Convergence Judgment Step: The judging agent evaluates whether the current debate round has converged or reached the preset maximum number of rounds based on the received global evidence summary data and corresponding explanation data of the current debate round. If so, the explanation data of the current debate round is output as a natural language explanation report of the statement text, and the corresponding authenticity judgment result of the statement text is also output. If not, the subsequent debate steps are executed to obtain the global evidence summary data and corresponding explanation data of the next debate round. The subsequent debate steps include: The global evidence summary data from the previous debate round is input into the standard reasoning agent and the counterfactual reasoning agent, respectively. The standard reasoning agent generates the standard argument, and the counterfactual reasoning agent generates the counterfactual reasoning argument. The standard argument and the counterfactual reasoning argument are fused through an update function to obtain the updated interpretation state for the current debate round. The updated interpretation state serves as the input for the standard reasoning agent and the counterfactual reasoning agent in the next debate round, enabling them to generate the standard argument and counterfactual reasoning argument for the next round based on the updated interpretation state. Each evidence-specific agent re-evaluates its uniquely corresponding evidence based on the global evidence summary data from the previous debate round, the updated counterfactual reasoning points for the current debate round, and the updated standard arguments, thereby obtaining the evaluation result for the current debate round. The aggregation module summarizes all evaluation results of the current debate round and combines them with the updated standard arguments and the updated counterfactual reasoning arguments of the current debate round to generate global evidence summary data and corresponding explanatory data for the current debate round. The global evidence summary data and corresponding explanatory data of the current debate round are then input into the judging agent to execute the convergence judgment step.

[0009] In some embodiments of this application, the hypothetical modification includes at least one of: deleting factual information in the evidence, replacing entities in the evidence, and modifying visual elements in the evidence to the visual modality of the vision.

[0010] In some embodiments of this application, updating the node representation of the heterogeneous graph using graph structure learning includes: Based on the predefined meta-paths in the heterogeneous graph, determine the meta-path neighbor set corresponding to each node; Based on preset pruning rules, samples are taken from the meta-path neighbor sets to obtain each sampled meta-path neighbor set; wherein, the pruning rules include: retaining less than or equal to a preset number of edges for each node; The heterogeneous graph is pruned based on the sampled meta-path neighbor sets to obtain a subgraph; The subgraph is input into a graph neural network, which aggregates the features of the neighboring nodes corresponding to each node in the subgraph, thereby updating the node representation of each node in the subgraph.

[0011] In some embodiments of this application, the step of dynamically adjusting the evidence retrieval strategy based on the statement text and retrieving a multimodal evidence set of the statement text based on the evidence retrieval strategy includes: The declaration text is input into a pre-trained complexity classifier, which outputs a retrieval complexity label for the declaration text; wherein, the type of the complexity label includes simple, medium and complex, which represent the retrieval complexity of the declaration text in ascending order; If the type of the retrieved complexity tag is medium, then a single-step retrieval strategy is executed to obtain a multimodal evidence set of the statement text through a single evidence retrieval; If the type of the retrieved complexity label is complex, then a multi-step iterative retrieval strategy is executed to obtain a multimodal evidence set of the statement text: The multi-step iterative retrieval strategy includes: performing a preset number of iterations of retrieval steps on the declaration text; the retrieval steps include: Based on the evidence set of the statement text in the current iteration and the intermediate conclusions accumulated in the previous iteration, the information gap that still needs to be verified is analyzed, and the retrieval requirement data for the current iteration is generated according to the information gap. The retrieval machine performs evidence retrieval based on the retrieval requirement data to obtain new evidence for the current iteration. The new evidence for the current iteration is added to the evidence set to obtain the target evidence set for the current iteration. It is determined whether a target evidence set sufficient to verify the truth or falsehood of the statement has been formed: if yes, the target evidence set for the current iteration is used as the multimodal evidence set of the statement text; if no, the target evidence set for the current iteration and the updated understanding state are used as the input for the next iteration, and the iterative retrieval continues.

[0012] In some embodiments of this application, before dynamically adjusting the evidence retrieval strategy according to the statement text and retrieving the multimodal evidence set of the statement text based on the evidence retrieval strategy, the method further includes: Obtain the training set containing the various training declaration texts; For each training statement text in the training set, different types of retrieval strategies are used to process it, and the evidence retrieved by each type of retrieval strategy is recorded to meet the preset answer accuracy requirements for the corresponding training statement text; wherein, the types of retrieval strategies include: no retrieval strategy with retrieval complexity increasing sequentially, the single-step retrieval strategy, and the multi-step iterative retrieval strategy. If the training declaration text corresponds to multiple retrieval strategies that meet the preset answer accuracy requirements, then the complexity of the retrieval strategy with the lowest retrieval complexity is selected as the complexity label of the training declaration text. If the training declaration text corresponds to a retrieval strategy that meets the preset answer accuracy requirement, then the complexity of the retrieval strategy is used as the complexity label of the training declaration text. If none of the various retrieval strategies corresponding to the training declaration text meet the preset answer accuracy requirements, then the training declaration text is output to the annotation expert client device, and the complexity label of the training declaration text returned by the annotation expert client device is received.

[0013] In some embodiments of this application, the judging agent is also used to output the confidence score corresponding to the authenticity judgment result when the debate rounds converge. The confidence score is determined based on at least one of the following indicators: The degree of consistency among the evaluation results generated by each of the evidence-specific intelligent agents; The coverage of evidence associated with the statement text in the multimodal evidence set; The self-evaluation result of the judging agent on its own judgment.

[0014] Another aspect of this application provides an interpretable adaptive multimodal fact-checking system, comprising: The MOE-based multimodal retrieval module is used to dynamically adjust the evidence retrieval strategy according to the statement text, and retrieve the multimodal evidence set of the statement text based on the evidence retrieval strategy; wherein, the modality of the evidence in the multimodal evidence set includes text and visual. The heterogeneous graph structure learning module is used to construct a heterogeneous graph based on the declaration text and the multimodal evidence set, and update the node representation of the heterogeneous graph in a graph structure learning manner; wherein, the types of nodes in the heterogeneous graph include: the declaration node corresponding to the declaration text and the evidence node corresponding to each of the evidence; the types of edges in the heterogeneous graph include: first type edges connecting the declaration node and the evidence node, and second type edges connecting different evidence nodes; A multi-agent debate module is used to input the heterogeneous graph with updated node representations into a multi-agent debate system containing counterfactual reasoning agents. Through multiple rounds of agent-based interactive debate, it generates a veracity judgment result for the stated text and a corresponding natural language interpretation report. The natural language interpretation report is used to trace back to at least one of the evidence nodes in the heterogeneous graph. The counterfactual reasoning agent is used to actively generate counterfactual reasoning points by hypothetically modifying the evidence corresponding to the evidence node during the multi-round agent-based interactive debate.

[0015] A third aspect of this application provides an electronic device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the interpretable adaptive multimodal fact-checking method described above.

[0016] A fourth aspect of this application provides a computer-readable storage medium having a computer program stored thereon that, when executed by a processor, implements the described interpretable adaptive multimodal fact-checking method.

[0017] The fifth aspect of this application provides a computer program product comprising a computer program that, when executed by a processor, implements the described interpretable adaptive multimodal fact-checking method.

[0018] The interpretable adaptive multimodal fact-checking method provided in this application dynamically adjusts the evidence retrieval strategy based on the statement text, effectively avoiding computational redundancy in simple statements and evidence omission in complex statements, and achieving on-demand allocation of computing resources. By explicitly modeling the high-order semantic relationships between statement nodes and evidence nodes, as well as between evidence nodes, it can effectively capture fine-grained cross-modal multi-hop reasoning paths, providing structured information input for subsequent debates. By inputting the updated heterogeneous graph into a debate system containing counterfactual reasoning agents, and through multiple rounds of interactive debate and hypothetical modifications, it can not only generate natural language interpretation reports traceable to specific evidence nodes, but also test the robustness of arguments when evidence conflicts occur, ensuring the logical consistency of the conclusions. Therefore, this method can be deployed in content review platforms of news media, false information governance systems of social media platforms, and intelligence analysis processes of think tanks, significantly improving the efficiency of manual review, reducing the misjudgment rate, and providing traceable judgment criteria for regulatory decisions. Thus, in the context of the explosion of multimodal information, it provides core technical support for building a trustworthy, transparent, and efficient information ecosystem.

[0019] Additional advantages, objectives, and features of this application will be set forth in part in the description which follows, and will in part become apparent to those skilled in the art upon review of the following description, or may be learned by practice of the application. The objectives and other advantages of this application can be realized and obtained by means of the structures specifically pointed out in the specification and drawings.

[0020] Those skilled in the art will understand that the purposes and advantages that can be achieved with this application are not limited to those specifically described above, and that the above and other purposes that this application can achieve will be more clearly understood from the following detailed description. Attached Figure Description

[0021] The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, do not constitute a limitation thereof. The components in the drawings are not drawn to scale but are merely for illustrating the principles of this application. For ease of illustration and description of certain parts of this application, corresponding portions in the drawings may be enlarged, i.e., may appear larger relative to other components in an exemplary device actually manufactured according to this application. In the drawings: Figure 1 This is a schematic diagram of the first process of an interpretable adaptive multimodal fact-checking method according to an embodiment of this application.

[0022] Figure 2 This is a schematic diagram of the architecture of a multi-agent debate system according to an embodiment of this application.

[0023] Figure 3 This is a flowchart illustrating the process of generating a truthfulness determination result and a corresponding natural language interpretation report for a multi-agent debate system in one embodiment of this application.

[0024] Figure 4 This is a schematic diagram of the execution architecture of a multi-agent debate system according to an embodiment of this application.

[0025] Figure 5 This is a schematic diagram of a second flow of an interpretable adaptive multimodal fact-checking method according to an embodiment of this application.

[0026] Figure 6 This is a schematic diagram of the structure of an interpretable adaptive multimodal fact-checking system according to an embodiment of this application.

[0027] Figure 7 This is a schematic diagram of the framework of an interpretable adaptive multimodal fact-checking system, as illustrated in an application example of this application. Detailed Implementation

[0028] To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the embodiments and accompanying drawings. Here, the illustrative embodiments and their descriptions are used to explain this application, but are not intended to limit it.

[0029] It should also be noted that, in order to avoid obscuring this application with unnecessary details, only the structures and / or processing steps closely related to the solution according to this application are shown in the accompanying drawings, while other details that are not closely related to this application are omitted.

[0030] It should be emphasized that the term "including / comprises" as used herein refers to the presence of a feature, element, step, or component, but does not exclude the presence or addition of one or more other features, elements, steps, or components.

[0031] It should also be noted that, unless otherwise specified, the term "connection" in this article can refer not only to a direct connection, but also to an indirect connection involving an intermediary.

[0032] In the following description, embodiments of the present application will be illustrated with reference to the accompanying drawings. In the drawings, the same reference numerals represent the same or similar parts, or the same or similar steps.

[0033] First, it's important to note that fact-checking technology has been widely applied in various scenarios, including news release quality assurance and social media content regulation, to effectively alleviate the pressure of manual review and improve systemic governance capabilities. In fact-checking tasks, locating evidence relevant to a specific claim from a large pool of knowledge sources has a decisive impact on the accuracy of the final judgment. Previous mainstream research has largely focused on relying on structured knowledge or textual corpora as evidence. However, the forms of information expression on social media have changed significantly in recent years, with user-generated content employing a large number of multimodal joint expressions. Besides text, images, videos, and even audio materials can all carry factually relevant information and are potential sources of evidence that fact-checking models should utilize. This trend has spurred the development of multimodal fact-checking. Compared to traditional methods that rely on a single modality, multimodal methods can infer the plausibility of information from cross-modal complementary relationships, which is crucial for improving the confidence level of results.

[0034] While existing fact-checking research has made progress in evidence retrieval strategies and the interpretability of reasoning, relevant methods still have significant limitations. For example, in dynamic retrieval, Tool-MAD incrementally queries evidence information through multiple rounds of debate, while DEFAME possesses task-driven retrieval depth planning capabilities. Although these methods have a certain degree of adaptability, they cannot distinguish the differences in complexity between different claims, leading to resource redundancy in simple tasks and the potential omission of key evidence information in complex tasks. Regarding reasoning and interpretability, although methods such as Tool-MAD and MAD-Sherlock can output reasoning paths, existing research only provides posterior interpretations of the reasoning results, lacking a unified modeling and controllable interpretation mechanism for the retrieval and reasoning stages.

[0035] With the development of multimodal fact-checking research, existing findings have verified that visual information can effectively supplement the semantic deficiencies of textual evidence, thereby improving the accuracy of evidence retrieval and judgment in fact-checking tasks. At the technical design level, researchers have proposed various cross-modal fusion strategies to fully utilize knowledge information from heterogeneous sources such as text and images.

[0036] One typical approach employs a phased modeling framework, where fact-determining learning is performed separately on each independent modality, and then the prediction results or feature representations are integrated through a cross-modal fusion module. Representative methods include: first, using neural networks to extract features from textual or visual data; then, mapping multimodal features to a unified representation space to jointly determine whether a claim is true. These methods have a clear implementation path, a simple structure, and are easy to implement in real-world scenarios.

[0037] Other studies have adopted an integrated fusion approach to improve the model's ability to model complex semantic interactions. For example, they utilize parametrically efficient adapters to extend the model's capabilities and combine them with specially designed multi-type fusion modules to explicitly model the interaction relationships between statements, documents, and images. Such methods can improve the system's response quality to complex inputs.

[0038] Despite significant progress in multimodal fusion and reasoning capabilities, existing methods still have many limitations that seriously affect their applicability and stability in real-world fact-checking tasks.

[0039] First, current evidence retrieval paradigms typically employ a fixed strategy, executing the same process regardless of the complexity of the input claims. This leads to uneven performance of algorithms on data of varying difficulty: single-step retrieval systems struggle to support complex reasoning requiring cross-document integration, while iterative retrieval mechanisms requiring multi-stage processing are inefficient and resource-intensive when handling simple claims.

[0040] Secondly, in the assertion verification phase, mainstream methods typically input the retrieved multimodal evidence directly into a Large Language Model (LLM) or a fusion model for judgment. However, this process is susceptible to interference from two key factors. On the one hand, LLMs have an inherent tendency to fabricate content, potentially generating inferences that conflict with the facts when there is insufficient external support, thereby amplifying inference errors. On the other hand, the retrieved evidence itself often contains noisy or even directly contradictory viewpoints, which can easily lead to decision deviations.

[0041] Furthermore, multimodal verification systems generally lack interpretable reasoning capabilities. Existing models typically only output the final judgment without providing supporting evidence selection criteria, cross-modal information correlation paths, or reasoning chains, resulting in insufficient system transparency. Moreover, existing interpretable solutions are mostly designed for single text tasks, making it difficult to cover the needs of image-text joint verification for alignment interpretation and conflict resolution.

[0042] In summary, to address the significant shortcomings of existing multimodal fact-checking models in areas such as adaptive evidence retrieval capabilities, cross-modal conflict handling, noise-resistant reasoning, and interpretability, this application provides an interpretable adaptive multimodal fact-checking method, an interpretable adaptive multimodal fact-checking system for executing the method, an electronic device, a computer-readable storage medium, and a computer program product. These solutions aim to overcome the limitations of traditional models that rely on fixed retrieval paths and single verification strategies, enabling intelligent retrieval configurations tailored to different claim difficulties, and providing reliable reasoning and natural language explanations during the verification phase.

[0043] First, existing fact-checking models mostly employ a single retrieval strategy in the evidence acquisition stage, failing to flexibly allocate computational resources based on the complexity of the claim. Simple claims can often be supported by sufficient evidence in a single retrieval, while complex claims may involve recursive reasoning across paragraphs, documents, or even modalities. Fixed strategies not only lead to insufficient information coverage for complex issues but also cause significant computational redundancy in simple scenarios. Therefore, this application proposes an adaptive evidence retrieval mechanism that dynamically adjusts the retrieval strategy according to the complexity of the claim.

[0044] Secondly, in the statement verification stage, while relying on fusion models to directly integrate textual and visual information can provide a final credibility judgment, it still faces two core obstacles: First, retrieved evidence often contains inconsistent viewpoints or even conflicting information, and existing models lack the ability to identify the reliability of evidence and resolve conflicts; second, current multimodal fact-checking models generally output uninterpretable results, failing to reveal the basis for judgment. To address this, this application designs a heterogeneous graph structure learning mechanism, enabling the model to model the high-order relational structure between statements, textual evidence, and visual cues, and to evaluate and filter evidence through a multi-agent debate strategy. This ensures the accuracy of the judgment while generating reliable natural language explanations, improving the transparency of the reasoning chain.

[0045] This application proposes a novel fact-checking framework, AMMFV (Adaptive Multi-modal Fact Verification). This model invokes different expert modules to dynamically handle the fine-grained complexity of input claims. To this end, a dedicated retrieval engine based on a Mixture-of-Experts (MoE) model is designed, routing claims to the optimal expert based on their complexity, thus improving the accuracy of evidence retrieval while maintaining model efficiency. To resolve knowledge conflicts and enhance interpretability, a multi-agent debate framework and counterfactual reasoning mechanism are integrated during the claim verification phase, enabling the model to dynamically evaluate evidence and generate corresponding explanations.

[0046] The following examples will provide a detailed description.

[0047] Based on this, embodiments of this application provide an interpretable adaptive multimodal fact-checking method that can be implemented by an interpretable adaptive multimodal fact-checking system. See [link to relevant documentation]. Figure 1 The interpretable adaptive multimodal fact-checking method specifically includes the following: Step 100: Dynamically adjust the evidence retrieval strategy according to the statement text, and retrieve the multimodal evidence set of the statement text based on the evidence retrieval strategy; wherein, the modality of the evidence in the multimodal evidence set includes text and visual.

[0048] In one or more embodiments of this application, the statement text refers to a natural language sentence or paragraph whose authenticity is to be verified, such as "In May 2023, a 7.8 magnitude earthquake occurred in a certain location." Dynamically adjusting the evidence retrieval strategy refers to adaptively selecting different retrieval methods based on the characteristics (such as complexity) of the statement text, including no retrieval, single-step retrieval, or multi-step iterative retrieval. A multimodal evidence set refers to a collection of evidence containing both textual and visual modalities. For example, textual evidence could be a report from an authoritative news website, while visual evidence could be photos or video frames from the earthquake site. The modality of evidence includes text and visual, indicating that the form of evidence can be text (such as a press release) and visual, where visual evidence includes images and / or videos (such as photos or video clips).

[0049] Specifically, the set of declarations containing the declaration text is Given a specific declaration text (hereinafter referred to as the declaration). The fact-checking process requires retrieving relevant evidence from both textual and visual sources. Formally defined as: (1) Collection of textual evidence : ,in, It is the total number of evidence in the modality of text (i.e., textual evidence). And each This refers to textual evidence, including but not limited to news reports, investigative reports, or posts on social media platforms; .

[0050] (2) Visual evidence set : ,in, It is the total number of visual evidence (i.e., visual evidence). And each This represents visual evidence, such as an image or a frame of video. .

[0051] In step 100, the text of the statement to be verified can be obtained first, for example, the statement text is "In May 2023, a 7.8 magnitude earthquake occurred in a certain place". Then, the evidence retrieval strategy is dynamically adjusted according to the statement text. Specifically, a complexity classifier can be pre-configured (detailed in subsequent embodiments). This classifier outputs labels based on the complexity of the statement, such as: simple, medium, and complex. Then, the corresponding retrieval strategy can be selected according to the label: if the output label is simple, the judgment can be made directly based on the knowledge already possessed by the underlying LLMs. Therefore, a strategy without retrieval is adopted, and the subsequent steps 200 and 300 do not need to be executed. This application does not specifically limit the execution process under this strategy, and subsequent execution can be carried out according to the actual application situation. If the output label is medium, a single-step retrieval strategy is executed, that is, relevant evidence is obtained from external knowledge sources (such as news databases, search engines) through a single retrieval; if the output label is complex, a multi-step iterative retrieval strategy is executed, that is, evidence is gradually collected through multiple rounds of retrieval and reasoning interaction.

[0052] Assuming the complexity classifier determines the statement to be of medium complexity in this example, a single-step retrieval is performed: using the statement text as the query, a set of multimodal evidence is retrieved, including textual evidence (such as news reports "A 7.8-magnitude earthquake occurred in a certain place, and the casualties are yet to be verified") and visual evidence (such as earthquake scene pictures and video screenshots).

[0053] Step 200: Construct a heterogeneous graph based on the statement text and the multimodal evidence set, and update the node representation of the heterogeneous graph using graph structure learning; wherein, the types of nodes in the heterogeneous graph include: the statement node corresponding to the statement text and the evidence node corresponding to each of the evidence; the types of edges in the heterogeneous graph include: first type edges connecting the statement node and the evidence node, and second type edges connecting different evidence nodes.

[0054] It is understood that a heterogeneous graph refers to a graph structure containing multiple types of nodes and multiple types of edges. In the heterogeneous graph of this application, the node types include declaration nodes and evidence nodes corresponding to each type of evidence. Evidence nodes include text evidence nodes and visual evidence nodes. The first type of edge can also be written as a declaration-evidence edge; the second type of edge can also be written as an evidence-evidence edge. The node representation refers to the feature vector corresponding to each node in the heterogeneous graph, for example, the embedding vector obtained by mapping text and images to the same vector space using the multimodal pre-training model CLIP (Contrastive Language-Image Pre-training). Specifically, the graph structure learning can use graph neural networks (such as graph convolutional networks GCN or graph attention networks GAT) to update the features of nodes in the heterogeneous graph, enabling nodes to aggregate neighbor information.

[0055] Specifically, after retrieving textual and visual evidence, claim verification typically requires aggregating information from these heterogeneous pieces of evidence to form a final judgment. A straightforward strategy is to concatenate the claims and evidence and have them processed by LLMs, but this approach may overlook the complex relationships between fine-grained evidence. To address this issue, this application introduces a heterogeneous graph structure learning framework to explicitly model cross-evidence relevance.

[0056] Given a declaration and the retrieved text evidence set and visual evidence set This application constructs a heterogeneous graph. This heterogeneous graph contains three types of nodes: declaration nodes. Textual evidence Corresponding textual evidence nodes, and visual evidence The corresponding visual evidence nodes. Utilizing the pre-trained CLIP model. , will declare With evidence (This can be textual or visual evidence) is embedded into a shared latent space. Subsequently, metric learning methods are applied to these embeddings to generate a semantic adjacency matrix. And a similar strategy to the above method is used to construct the evidence-evidence adjacency matrix (denoted as...). By using statement-evidence With evidence - evidence They are integrated into a unified graph structure. In this model, the complex multi-hop relationships that exist between claims and evidence, as well as between the evidence itself, are captured.

[0057] Next, node representations can be updated using graph structure learning. Specifically, a heterogeneous graph can be input into a graph neural network, such as a graph attention network (GAT). The graph neural network updates the representation of each node by aggregating the features of its neighboring nodes. For example, for a claim node, its updated node representation is obtained by weighted summation of its own features and the features of all connected evidence nodes through an attention mechanism.

[0058] Step 300: Input the heterogeneous graph with updated node representations into a multi-agent debate system containing a counterfactual reasoning agent, so as to generate a veracity judgment result for the statement text and a corresponding natural language interpretation report through multiple rounds of agent interactive debate. The natural language interpretation report is used to trace back to at least one of the evidence nodes in the heterogeneous graph. The counterfactual reasoning agent is used to actively generate counterfactual reasoning points by hypothetically modifying the evidence corresponding to the evidence node during the multi-round agent interactive debate.

[0059] It should be noted that a counterfactual reasoning agent is a specialized agent designed to generate counterfactual arguments. It tests the truth or falsity of claims under different conditions by hypothetically modifying evidence (such as deleting key facts, replacing entities, or modifying visual elements). A multi-agent debate system is a system composed of multiple agents (such as evidence-specific agents, standard reasoning agents, and counterfactual reasoning agents) that reach consensus and generate results through multiple rounds of interactive debate. Multi-round agent interactive debate is a process of multiple rounds of information exchange and argument updates among multiple agents, with each round improving upon the results of the previous round until convergence. The truthfulness determination result is the final judgment on the truth or falsity of the statement, which can be "support," "refute," or "insufficient information." A natural language explanation report is a report describing the basis for judgment and the reasoning process in natural language, traceable to specific evidence nodes. Tracing back to at least one evidence node in the heterogeneous graph means that the content of the explanation report corresponds to a specific evidence node in the heterogeneous graph, for example, "Evidence node A shows... therefore, the conclusion is reached." Hypothetical modification refers to fictitious alterations to evidence, such as deleting key facts from text or replacing one object in an image with another. Counterfactual arguments are based on evidence that has undergone hypothetical modification and are used to test the robustness of a statement.

[0060] Specifically, the method provided in this application aims to generate a label for judging the truthfulness of a claim based on multimodal evidence, and the model predicts the truthfulness label of the claim. (That is, the final output of the authenticity determination result for the stated text) comes from the following set of discrete authenticity tags. : in, (Supported) indicates that there is sufficient evidence to support the statement. (Refuted) means "to refute", that is, the evidence provided is sufficient to refute the statement; (Not Enough Information) indicates insufficient information, meaning the existing evidence is insufficient to make a definitive judgment. Therefore, the learning objective of this task is to learn a mapping function such that: in, These represent the parameters of the fact-checking model; This indicates the declaration text; Represents a set of textual evidence; This represents a set of visual evidence.

[0061] In step 300, the heterogeneous graph with updated node representations is input into a multi-agent debate system, which includes at least one counterfactual reasoning agent. The debate system initiates multiple rounds of agent-based interactive debate. Taking the first round as an example, the system pre-defines multiple evidence-specific agents (one for each piece of evidence). Each evidence-specific agent generates a preliminary assessment of the relevance of the evidence to the statement based on its corresponding evidence node representation and statement node representation. The preliminary assessment results of all evidence-specific agents are input into an aggregation module, which integrates the judgments of all parties from a global perspective, identifies points of consistency and conflict, and generates a global evidence summary with explanations. The counterfactual reasoning agent adopts a hypothetical analysis stance, systematically exploring the logical consequences of hypothetically modifying key evidence (e.g., "If the key information in a piece of evidence changes, is the statement still valid?"). After multiple rounds of debate, the agents exchange viewpoints and update their assessments, and finally, the judging agent determines whether convergence has occurred. If convergence has occurred, the final result is generated.

[0062] For example, for a claim "an earthquake occurred in a certain location in May 2023," the evidence includes a news report and a picture. Evidence-specific agent A (corresponding to the news report) evaluates it as "supportive," while evidence-specific agent B (corresponding to the picture) evaluates it as "neutral." The aggregation module then generates a global summary. The standard reasoning agent constructs its argument based on this: "The explicit description in the news report constitutes primary supporting evidence; the picture, though blurry, does not contradict the report." The counterfactual reasoning agent proposes a hypothetical test: "If the date in the picture is not May, but another month, then the picture will change from 'neutral' to 'contradictory.'" After multiple rounds of interaction, the system may ultimately determine it as "supportive" and generate a natural language explanation report: "Based on the explicit description in the news report and the supporting information in the picture, the claim is supported; the counterfactual test shows that if the picture date is inconsistent, the conclusion will change, but the current evidence is consistent." The content in the explanation report is traceable to specific evidence nodes; for example, the report mentions "the news report (evidence node A) shows…," thus achieving traceability.

[0063] As described above, the interpretable adaptive multimodal fact-checking method provided in this application dynamically adjusts the evidence retrieval strategy based on the statement text, effectively avoiding computational redundancy in simple statements and evidence omission in complex statements, and realizing on-demand allocation of computing resources. By explicitly modeling the high-order semantic relationships between statement nodes and evidence nodes, as well as between evidence nodes, it can effectively capture fine-grained cross-modal multi-hop reasoning paths, providing structured information input for subsequent debates. By inputting the updated heterogeneous graph into a debate system containing counterfactual reasoning agents, and through multiple rounds of interactive debate and hypothetical modifications, it can not only generate natural language explanation reports traceable to specific evidence nodes, but also test the robustness of the argument when evidence conflicts occur, ensuring the logical consistency of the conclusion. Therefore, this method can be deployed in content review platforms of news media, false information governance systems of social media platforms, and intelligence analysis processes of think tanks, significantly improving the efficiency of manual review, reducing the misjudgment rate, and providing traceable judgment basis for regulatory decisions. Thus, in the context of the explosion of multimodal information, it provides core technical support for building a trustworthy, transparent, and efficient information ecosystem.

[0064] To further address the lack of clear division of labor and collaboration mechanisms in multi-agent debate systems, and to make the debate process structured, manageable, and interpretable while resolving conflicts between different pieces of evidence, this application provides an interpretable adaptive multimodal fact-checking method, see [link to relevant documentation]. Figure 2 The multi-agent debate system in the interpretable adaptive multimodal fact-checking method specifically includes the following: The counterfactual reasoning agent, and the evidence-specific agent, standard reasoning agent, aggregation module and judging agent that participate in multiple rounds of agent interaction debate with the counterfactual reasoning agent; Each of the evidence-specific agents corresponds to one piece of evidence in the multimodal evidence set and is used to generate an evaluation result regarding the relevance of that evidence to the statement text. The evidence-specific agent is an agent corresponding to a specific piece of evidence in the multimodal evidence set, responsible for independently evaluating the relevance of that evidence to the statement text, avoiding interference between different pieces of evidence. For example, each piece of evidence is assigned an agent based on a Large Language Model (LLM). The evaluation result refers to the judgment output by the evidence-specific agent regarding the relevance of its corresponding evidence to the statement, which can be a binary label (support / refute) or a numerical value with confidence.

[0065] Suppose that a multimodal evidence set E={e} has been obtained through dynamic retrieval. t 1 ,e t 2 ,ev 1}, where e t 1 The textual evidence states, "A 7.8 magnitude earthquake occurred in a certain area; casualties are pending verification." t 2 The textual evidence states that "earthquake experts say the earthquake's magnitude was 7.8," e v 1 For visual evidence (a photograph showing the scene of the building collapse). Assign a separate evidence-specific agent to each piece of evidence: the first evidence-specific agent A. t 1 Corresponding evidence e t 1 The second piece of evidence is dedicated to agent A. t 2 Corresponding evidence e t 2 The third piece of evidence is dedicated to the intelligent agent A. v 1 Corresponding evidence e v 1 .

[0066] Each evidence-specific agent receives the statement text c ("In May 2023, a 7.8 magnitude earthquake occurred in a certain location") and its corresponding evidence, independently assesses the relevance of the evidence to the statement, and generates an assessment result. The assessment result can be a category or a confidence score. For example: A t 1 Assessment: News reports explicitly mentioned the earthquake, therefore the assessment is "supportive"; A t 2 Assessment: Expert opinions support the earthquake's strength; therefore, the assessment is "supportive"; A v 1 Assessment: The image shows a collapsed building, consistent with an earthquake scene, but it cannot be confirmed whether it is from that earthquake; therefore, it is classified as "neutral".

[0067] The aggregation module is used to summarize the evaluation results generated by each of the evidence-specific agents, and generate global evidence summary data and corresponding explanatory data based on the evaluation results. The global evidence summary data refers to comprehensive information generated by the aggregation module that reflects the overall situation of all evidence; it can be statistical quantities (such as the support / rebuttal ratio) or structured data. The explanatory data is a preliminary explanatory text corresponding to the global evidence summary, used for subsequent debate and the generation of the final explanatory report.

[0068] For example, the aggregation module receives A t 1 "Support", A t 2 "Support", A v1 The aggregation module can generate global evidence summary data using methods such as weighted voting, text summarization, or neural network fusion. For example, the statistical support ratio is 2 / 3, the opposition ratio is 0, and the neutral ratio is 1 / 3. Simultaneously, the aggregation module generates corresponding explanatory data based on these evaluation results, i.e., a piece of natural language text, such as "Most textual evidence supports the occurrence of the earthquake, but visual evidence cannot definitively confirm it."

[0069] The standard reasoning agent is used to construct standard arguments supporting or refuting the statement text based on the global evidence summary data. The standard reasoning agent is an agent that performs deductive and abductive reasoning based on the global evidence summary, responsible for constructing arguments that conform to the logic of existing evidence. The standard arguments are the reasoning conclusions generated by the standard reasoning agent that support or refute the statement, and are typically expressed in natural language.

[0070] For example, based on the summary of "most textual evidence supports it, visual evidence is neutral", a standard reasoning agent might generate the argument: "The textual evidence consistently indicates that an earthquake occurred, and although the visual evidence does not directly confirm it, it is not contradictory, therefore the statement is likely to be true."

[0071] The judging agent is used to evaluate whether each round of debate in the agent-interactive debate process has converged based on the global evidence summary data and the interpretation data, and triggers the generation of the authenticity judgment result for the statement text and the corresponding final natural language interpretation report when convergence occurs. The judging agent is responsible for monitoring the multi-round debate process, judging whether convergence has occurred, and triggering the generation of the final result when convergence occurs. Convergence means that the output of each agent tends to stabilize during the debate process, or reaches a preset stopping condition, which can be set manually. The final natural language interpretation report is the natural language interpretation generated after the debate converges, which can be traced back to the specific evidence node. For example, the convergence condition can be: the evaluation results of all evidence-specific agents no longer change, the standard argument and the counterfactual argument tend to be consistent, or the preset maximum number of rounds is reached. If convergence occurs, the judging agent triggers the generation of the final output result, including the authenticity judgment result (e.g., "supports") and the corresponding final natural language interpretation report (e.g., "The textual evidence consistently supports the occurrence of the earthquake, and the visual evidence, although not explicit, is not contradictory; after counterfactual testing, there is no evidence to suggest that the statement is invalid").

[0072] As can be seen from the above description, the interpretable adaptive multimodal fact-checking method provided in this application embodiment achieves independent evaluation of evidence, integration of global information, standardization of logical reasoning, and automatic judgment of debate convergence by clearly defining the functions of evidence-specific intelligent agents, aggregation modules, standard reasoning intelligent agents, and evaluation intelligent agents, thus providing a structured modular foundation for subsequent multi-round debate processes.

[0073] In addition, to further address the problem that the authenticity determination results lack credibility indicators and users cannot intuitively judge the reliability of the conclusions, the evaluation agent is also used to output the confidence score corresponding to the authenticity determination results when the debate rounds converge. The confidence score is determined based on at least one of the following indicators: (1) The degree of consistency among the evaluation results generated by each of the evidence-specific intelligent agents; (2) The coverage of evidence associated with the statement text in the multimodal evidence set; (3) The self-evaluation results of the judging agent on its own judgment.

[0074] By outputting a confidence score, the system provides users with a reference for the credibility of the conclusions, enhancing the system's usability and transparency, and enabling users to reasonably accept or further verify the judgment results based on the confidence level.

[0075] To further address the lack of specific interaction steps and iterative mechanisms in multi-agent debate, making the debate process operable and controllable, this application provides an interpretable adaptive multimodal fact-checking method, see [link to relevant documentation]. Figure 3 The multi-agent debate system generates a veracity determination result and a corresponding natural language interpretation report for the statement text through the following steps: Step 310: First round of debate: Each evidence-specific agent generates an initial evaluation result based on the statement text and the uniquely corresponding evidence; the aggregation module summarizes all initial evaluation results and generates global evidence summary data and corresponding explanation data for the first round of debate.

[0076] The first round of debate is the first phase of a multi-agent debate, where the evidence-specific agent independently generates the evaluation results, and the aggregation module generates the global summary and explanation for the first time. The initial evaluation results are those generated by the evidence-specific agent in the first round and have not yet been revised by subsequent debates. The global evidence summary data and corresponding explanation data from the first round of debate are also the output of the aggregation module at the end of the first round of debate, serving as the input basis for subsequent debates.

[0077] Step 320: Convergence Judgment Step: The judging agent evaluates whether the current debate round has converged or reached the preset maximum number of rounds based on the received global evidence summary data and corresponding explanation data for the current debate round. If yes, proceed to step 330; otherwise, proceed to steps 340 to 360 to obtain the global evidence summary data and corresponding explanation data for the next debate round. The preset maximum number of rounds is an upper limit set to avoid infinite loops, for example, 5 times.

[0078] Step 330: Output the explanation data of the current debate round as a natural language explanation report of the statement text, and output the corresponding authenticity judgment result of the statement text.

[0079] Step 340: Input the global evidence summary data from the previous debate round into the standard reasoning agent and the counterfactual reasoning agent respectively. The standard reasoning agent generates the standard argument, and the counterfactual reasoning agent generates the counterfactual reasoning argument. The standard argument and the counterfactual reasoning argument are fused through an update function to obtain the updated interpretation state for the current debate round. The updated interpretation state serves as the input for the standard reasoning agent and the counterfactual reasoning agent in the next debate round, so that the standard reasoning agent and the counterfactual reasoning agent generate the standard argument and the counterfactual reasoning argument for the next round based on the updated interpretation state.

[0080] Step 350: Each evidence-specific agent re-evaluates its uniquely corresponding evidence based on the global evidence summary data from the previous debate round, the updated counterfactual reasoning points for the current debate round, and the updated standard arguments, to obtain the evaluation result for the current debate round.

[0081] Step 360: The aggregation module summarizes all evaluation results of the current debate round and combines them with the updated standard arguments and updated counterfactual reasoning arguments of the current debate round to generate global evidence summary data and corresponding explanatory data for the current debate round; the global evidence summary data and corresponding explanatory data of the current debate round are input into the judging agent to execute the convergence judgment step. Then, the process returns to step 320.

[0082] In one example, see Figure 4 The multi-agent debate system is based on LLM-based independent dialogue agents, each generating intermediate outputs based on their own evidence. These outputs are then aggregated from various perspectives, facilitating multiple rounds of debate among standard reasoning agents, and a final judgment and confidence calibration by a judging agent. Let the text of the factual statement to be verified be... The retrieved evidence set is In this example, n ≥ 4. Based on this, the execution flow of a multi-agent debate system is illustrated below: (1) Evidence-specific agent: For each piece of evidence, an agent is assigned as the evidence-specific agent to process the statement text and its corresponding evidence, that is, ultimately there is These n evidence-specific agents. Each evidence-specific agent then generates an intermediate evidence evaluation result. This process relies solely on the claims and the assigned evidence to complete independently. This design ensures that each evidence-specific agent fully reviews all information in its evidence, thereby reducing the possibility of overlooking details in long contexts; and guarantees that the evaluation of the evidence-specific agent is not affected by the inappropriate frequency or position of the evidence in the retrieval results.

[0083] In a single round, each agent Generate its intermediate evaluation , forming a set ,in This indicates the current round of the debate. The first round of the debate relies solely on statements and evidence, while subsequent rounds provide each agent with a summary of the previous round.

[0084] (2) Aggregation module: in the first Wheel, aggregator Receive results from all evidence-specific agents and generate an interpreted summary of evidence. To mitigate positional bias, the system randomly sorts the agents' responses before analysis. The aggregator evaluates each agent's judgment, detects inconsistencies, and synthesizes them into a summary of the existing evidence. It possesses a global perspective, capable of distinguishing between valid conflicts (such as entity ambiguity) and factual contradictions caused by misinformation.

[0085] (3) Multi-round debate: In order to rigorously verify and improve the interpretation of evidence, this application has developed a structured debate mechanism based on standard reasoning agents. This process starts with the summary generated by the aggregator. A multi-round debate process requires setting up two intelligent agents with different roles: Counterfactual reasoning agent (Agent-C): This agent adopts a perturbation-based analytical stance. Its core function is to test the stability of a statement's truth value by systematically exploring hypothetical modifications to supporting evidence. This approach can assess whether the statement still holds true under different evidentiary conditions.

[0086] Standard reasoning agent (Agent-S): This agent constructs arguments strictly based on the provided evidence, using deductive and abductive logic. Its role is to construct arguments that affirm or challenge the statement based on direct logical reasoning from existing factual premises.

[0087] This dialectical design ensures the dialectical tension between exploratory, boundary-testing reasoning (performed by Agent-C) and conservative, evidence-based reasoning (performed by Agent-S).

[0088] The debate unfolded through a series of simulated natural language interactions. Indicates the debate iteration round, in the ... In each iteration, each agent Based on the previous iteration round -1 represents the current interpretation status of the output. (Initially the output of the aggregator) Generate critical content .

[0089] Here, and These represent the corresponding counterfactual reasoning agents. The generated critical content (i.e., counterfactual reasoning points) and the standard reasoning agent The function that generates the critical content (i.e., the standard arguments); Learnable parameters representing the counterfactual reasoning agent; The learnable parameters represent the standard reasoning agent. The aforementioned critical opinions are exchanged, and each agent then further refines its argument based on the feedback from the other. This iterative process is represented as follows: ,in The goal of the update function is to converge to an explanation that is both evidentiary and logically robust.

[0090] Meanwhile, evidence-specific intelligent agents It also participates in the above iterative process. In each iteration t, the agent receives the results and interpretations from the previous round of aggregation, and re-evaluates and updates its own assessment based on these results. in, This represents the updated evaluation result generated by the i-th evidence-specific agent in the t-th round of debate; Denotes the evaluation function for the i-th evidence-specific agent; This indicates the authenticity determination result (or aggregated label) after the (t-1)th round of debate; This represents the explanatory data (or evidence summary) after the (t-1)th round of debate. This mechanism enables evidence-specific agents to defend, refute, or adjust their judgments within the context of continuous dialectical reasoning.

[0091] This two-layer iterative process, with a standard reasoning agent and an evidence-specific agent, facilitates convergence towards a consistent and evidence-supported output. Content lacking supporting evidence is filtered out, while reasonable ambiguity is retained as multiple valid perspectives.

[0092] (4) Judge Agent and Final Judgment: In order to arbitrate the debate and ensure the reliability of the understanding, a judge agent (Agent-J) is introduced. The judge agent performs meta-analysis and is responsible for evaluating the logical consistency, evidence support and argument validity of the final debate results of Agent-C and Agent-S, and makes a judgment in combination with the comprehensive evaluation of the evidence-specific agent.

[0093] The consensus function is used to evaluate the agent: in, This represents the consensus function used to evaluate the agent; This represents the final argument generated by the counterfactual reasoning agent in the final round of debate; This represents the final argument generated by the standard reasoning agent in the final round of debate. This represents the set of intermediate evaluation results generated by the evidence-specific agent; True indicates that the debate has converged, and False indicates that it has not yet converged; when the return value is True, it means that the agent's argument has reached sufficient consensus, and further iteration will lead to diminishing marginal returns, so the debate needs to be terminated.

[0094] Subsequently, the judging agent synthesizes the key, validated viewpoints from all the arguments and passes its judgment to the rewriting module, which generates the final, refined explanation. This rewriting module integrates the insights optimized through dialectical reasoning into a coherent statement, clearly articulating the reasoning basis for the final ruling.

[0095] (5) Convergence and final output: Debate at most execution The debate proceeds in rounds, following an early stopping criterion. When all evidence-specific agents and standard reasoning agents maintain the same evaluation results as the previous round (indicating convergence), the debate ends in a given iteration. Termination. The final answer is jointly determined by the judging agent and the rewriting module.

[0096] The system's final output is a triplet. ;in: , This is the result of determining the authenticity of the stated text; Expressing support; This indicates a rebuttal; This indicates insufficient information. It is a comprehensive explanatory report, namely a natural language interpretation report of the stated text, which details the underlying evidence, the counterfactual investigation process, and the consensus. It is an optional confidence score, which is calculated using multi-dimensional indicators such as inter-agent consistency, evidence coverage, and the certainty of the self-evaluation of the judging agent.

[0097] In one or more embodiments of this application, in order to further address the problem that hypothetical modifications in counterfactual reasoning are too abstract and lack specific operational types, the hypothetical modifications include at least one of: deleting factual information in the evidence, replacing entities in the evidence, and modifying visual elements in the evidence to the visual modality.

[0098] The factual information refers to the objective facts contained in the evidence, such as time, place, people, and event descriptions. For example, "May 6, 2023" is a piece of factual information. The entities are specific referents in the evidence, such as names of people, places, organizations, times, and values. For example, "a certain place" and "level 7.8" are entities. The visual elements are identifiable components in an image or video, such as objects, people, text, scenes, date watermarks, and background details.

[0099] (6) Objective function design The consensus label obtained from the multi-agent debate process is represented as The corresponding supervision target adopts the standard cross-entropy loss. : in, This represents the true label, while Indicates the first The predicted probability of the class. This loss function is used to evaluate the difference between the predicted probability distribution and the true probability distribution. The set of learnable parameters optimized using this loss includes parameters from the graph structure learning module.

[0100] As can be seen from the above description, the interpretable adaptive multimodal fact-checking method provided in this application makes counterfactual reasoning operable by specifying three specific modification operations (deleting factual information, replacing entities, and modifying visual elements). It can specifically test the degree of dependence of claims on different types of evidence information, and improve the comprehensiveness and refinement of the robustness assessment of arguments.

[0101] To further address the lack of specific interaction steps and iterative mechanisms in multi-agent debate, making the debate process operable and controllable, this application provides an interpretable adaptive multimodal fact-checking method, see [link to relevant documentation]. Figure 5 Step 200 of the interpretable adaptive multimodal fact-checking method specifically includes the following: Step 210: Construct a heterogeneous graph based on the statement text and the multimodal evidence set.

[0102] Step 220: Determine the set of metapath neighbors corresponding to each node according to the predefined metapaths in the heterogeneous graph.

[0103] A meta-path is a path in a heterogeneous graph that connects nodes of different types and is used to define higher-order semantic relationships between nodes. For example, "declaration node → text evidence node → visual evidence node" is a meta-path. The meta-path neighbor set refers to the set of all neighboring nodes that can be reached from a given node via a predefined meta-path. This set reflects the associated nodes of a node under a specific semantic path.

[0104] Step 230: Sample each meta-path neighbor set based on preset pruning rules to obtain each sampled meta-path neighbor set; wherein, the pruning rules include: retaining less than or equal to a preset number of edges for each node.

[0105] Step 240: Prune the heterogeneous graph based on the sampled meta-path neighbor sets to obtain a subgraph.

[0106] Step 250: Input the subgraph into a graph neural network so that the graph neural network aggregates the features of the neighboring nodes corresponding to each node in the subgraph, and then updates the node representation of each node in the subgraph.

[0107] Specifically, after constructing the heterogeneous graph, the claims and their associated evidence are integrated into a more compact graph structure. However, the increase in edges may introduce additional noise; therefore, it is necessary to perform graph pruning to remove irrelevant relationships while retaining valid information. This application introduces a neighbor concept based on meta-paths. Specifically, meta-paths... A path is defined as follows: (abbreviated as) ),in, Each represents a different node. This definition describes the nodes. With nodes The composite relationship between them ,in Represents the relational composition operator. These represent the basic relationships that constitute a meta-path, such as "declaration-evidence relationship" and "evidence-evidence relationship". l This indicates the number of relations contained in the meta-path; in a heterogeneous graph, each node has a set of relations that can be accessed via a given meta-path. The set of reached neighbor nodes reflects diverse structural semantics. The neighbor definition based on metapath is as follows: for the x-th node in a heterogeneous graph, the neighbor is defined as follows: [The text abruptly ends here, so the translation stops as well.] Its meta-path neighbor set By all paths that can be accessed The nodes reached from the x-th node constitute the path. Given a metapath... And the central node; the goal of graph pruning is to optimize the neighbor set. Noise is filtered out. Therefore, from the enhanced heterogeneity map... Sampling a t-neighborhood subgraph The t-neighborhood subgraph preserves the relationship with... While retaining identical nodes, each node is limited to retaining a maximum number of edges from its initial neighbor set. The resulting fixed-size subgraph not only supports efficient parallel computation but also facilitates mini-batch training.

[0108] To further address the problem that existing retrieval methods cannot dynamically adjust retrieval depth based on claim complexity, leading to wasted resources for simple tasks and insufficient evidence for complex tasks, this application provides an interpretable adaptive multimodal fact-checking method. (See also...) Figure 5 Step 100 of the interpretable adaptive multimodal fact-checking method specifically includes the following: Step 110: Input the declaration text into a pre-trained complexity classifier and output the retrieval complexity label of the declaration text; wherein, the type of the complexity label includes simple, medium and complex, which are used to represent the retrieval complexity of the declaration text in ascending order; Step 120: If the type of the retrieved complexity tag is simple, determine the authenticity of the declaration text.

[0109] Step 130: If the type of the retrieved complexity tag is medium, then execute the single-step retrieval strategy to obtain the multimodal evidence set of the statement text through a single evidence retrieval; then execute step 200.

[0110] Step 140: If the type of the retrieved complexity label is complex, then execute a multi-step iterative retrieval strategy to obtain a multimodal evidence set of the statement text; then execute step 200.

[0111] The multi-step iterative retrieval strategy includes: performing a preset number of iterative retrieval steps for the declaration text; the retrieval steps include: analyzing the information gaps that still need to be verified based on the evidence set of the declaration text in the current iteration and the intermediate conclusions accumulated in the previous iteration, and generating retrieval requirement data for the current iteration accordingly; the retrieval unit performs evidence retrieval based on the retrieval requirement data to obtain new evidence for the current iteration; the new evidence for the current iteration is added to the evidence set to obtain the target evidence set for the current iteration; it is determined whether an evidence set sufficient to verify the authenticity of the declaration has been formed: if yes, the target evidence set for the current iteration is used as the multimodal evidence set of the declaration text; if no, the target evidence set for the current iteration and the updated understanding state are used as the input for the next iteration, and iterative retrieval continues.

[0112] Specifically, since the complexity of declarations objectively varies, this application designs corresponding operating strategies for declarations of different complexity levels: (1) Simple forward claims: These can be simply referred to as simple; claims with clear structure and single meaning can directly rely on the knowledge already possessed by the underlying LLMs to make judgments. Therefore, the system adopts a no-retrieval strategy, directly generating verification results based on internal knowledge to maximize response speed. This path has significant advantages when processing general and common-sense information.

[0113] (2) Moderate Complexity Claims: These can be simply referred to as moderate. To address claims that cannot be directly answered, the framework introduces a single-step retrieval enhancement generation strategy. Specifically, a dedicated retrieval model is used to generate claims based on the statements. Relevance, retrieve corresponding textual evidence and visual evidence .

[0114] (3) Complex Claims: These can be simply referred to as complex claims. When dealing with complex claims that require integrating multiple pieces of evidence and performing multi-step reasoning, single-step retrieval strategies face significant limitations. In this case, inferential LLMs interact with the retrieval system in multiple rounds. This iterative process allows the model to progressively deepen its understanding of the claims and continuously accumulate intermediate conclusions until a final answer is formed.

[0115] The key to the effectiveness of the above strategy lies in its ability to automatically determine the complexity of the declaration. To this end, this application proposes a complexity classifier as a router across expert modules, implemented by a finely tuned small language model. Given a declaration text, the classifier predicts a complexity label, including: simple declarations, medium-complexity declarations requiring a single-step strategy, and complex declarations requiring a multi-step strategy.

[0116] As can be seen from the above description, the interpretable adaptive multimodal fact-checking method provided in this application embodiment achieves on-demand allocation of computing resources through adaptive retrieval guided by a complexity classifier: fast response to medium-sized claims (single-step retrieval), and progressive in-depth retrieval of complex claims (multi-step iterative retrieval), thereby improving retrieval efficiency and the completeness of evidence coverage.

[0117] Prior to step 100, the method further includes: Step 010: Obtain the training set containing the various training declaration texts; Step 020: For each training statement text in the training set, process it using different types of retrieval strategies, and record whether the evidence retrieved by each type of retrieval strategy meets the preset accuracy requirements for the corresponding training statement text; wherein, the types of retrieval strategies include: no retrieval strategy with progressively increasing retrieval complexity, the single-step retrieval strategy, and the multi-step iterative retrieval strategy. Step 030: If the training declaration text corresponds to multiple retrieval strategies that meet the preset answer accuracy requirements, then select the complexity of the retrieval strategy with the lowest retrieval complexity as the complexity label of the training declaration text. Step 040: If the training declaration text corresponds to a retrieval strategy that meets the preset answer accuracy requirement, then the complexity of the retrieval strategy is used as the complexity label of the training declaration text. Step 050: If none of the various retrieval strategies corresponding to the training declaration text meet the preset answer accuracy requirements, then output the training declaration text to the annotation expert client device and receive the complexity label of the training declaration text returned by the annotation expert client device.

[0118] Based on this, in one application example, the specific implementation process of the interpretable adaptive multimodal fact-checking method provided in this application may include: a data preparation stage, a model training stage, and a reasoning and interpretation generation stage.

[0119] 1) Data Preparation Stage. This application first inputs a statement, which is natural language text that needs to be verified. Subsequently, a set of candidate multimodal evidence is obtained from internal indexes or external information resources (such as web retrieval systems, knowledge bases, databases, etc.). This set may include: textual evidence (news reports, encyclopedia entries, forum posts, and research reports) and visual evidence (such as relevant images and video frames). The above data can be pre-constructed into a unified index structure to support efficient retrieval.

[0120] 2) Model Training Phase. The model training phase of this application mainly trains two core modules. First, a router for distinguishing statement complexity needs to be trained. The router's training data is constructed by automatically labeling queries based on the actual performance of the three strategies mentioned above. If a statement can be correctly answered by a non-retrieval strategy, it is labeled as a simple statement. In cases where multiple strategies succeed, the simpler strategy is preferred as the label; for example, when both single-step and multi-step strategies succeed, but the non-retrieval strategy fails, the statement is labeled as a medium-complexity statement rather than a complex statement. Second, this application needs to train a heterogeneous graph structure learning module, which is co-optimized with the subsequent inference process. For each training sample, a heterogeneous graph with statements and evidence as nodes is constructed, connecting statement-evidence and evidence-evidence relationships, and graph pruning is performed to reduce noise. Its output participates in the multi-agent verification process, and the final prediction result calculates the error using the cross-entropy loss function, and the gradient is backpropagated to optimize the parameters in the graph neural network.

[0121] 3) Reasoning and Explanation Generation Stage. In the reasoning stage, the complexity type of the input statement is determined by the trained router, and the corresponding retrieval strategy is selected: simple statements generate predictions directly, medium-complexity statements undergo a single-round retrieval, and complex statements undergo multiple rounds of retrieval and iterative reasoning. The retrieved textual and visual evidence are used to construct a heterogeneous graph and undergo pruning. Subsequently, the multi-agent reasoning module independently generates evaluations for each piece of evidence, and the aggregation module summarizes them to form a global summary. The standard reasoning agent and the counterfactual reasoning agent engage in multiple rounds of interactive debate, while the evidence-specific agent updates its judgments synchronously, gradually converging the output. Finally, the judging agent determines whether the debate has converged and, in conjunction with the rewriting module, generates the final truth label and natural language explanation, ensuring that the conclusion is both logically robust and traceable.

[0122] In summary, the embodiments and application examples of this application provide an adaptive, interpretable, and efficient multimodal fact-checking method, enabling accurate verification and transparent reasoning for claims of varying complexity. Its core technical features include: (1) Adaptive retrieval mechanism based on MoE: This application proposes an adaptive multimodal evidence retrieval architecture. Through a learnable complexity classifier, claims are automatically categorized into three levels: simple, medium, and complex, and corresponding retrieval strategies are activated accordingly. This mechanism can adaptively select "no retrieval," "single-step retrieval," or "multi-step iterative retrieval" based on the semantic complexity of the claim, significantly improving system efficiency while ensuring the sufficiency of evidence.

[0123] (2) Verification process of integrating heterogeneous graph reasoning and multi-agent debate: This application innovatively combines heterogeneous graph structure learning with multi-agent collaborative debate to construct a two-stage verification process. The first stage models the complex semantic relationships between claims and multimodal evidence, as well as within the evidence itself, using heterogeneous graphs. The second stage proposes a verification and debate framework composed of multiple types of agents, including evidence-specific agents, standard reasoning agents, counterfactual reasoning agents, aggregation modules, and judging agents. Each agent completes evidence comparison, contradiction identification, reasoning rebuttal, and consistency convergence through multiple rounds of interaction, thereby outputting a final credible conclusion.

[0124] (3) Supports a two-layer reasoning and adjudication process for interpretation generation: This application introduces an explanation generation process that integrates evidence-based reasoning, conflict identification, and consistency judgment, with a multi-agent debate module outputting transparent and traceable reasoning evidence. This structured explanation framework not only improves interpretability but also allows for direct engineering extensions.

[0125] Therefore, this application proposes an adaptive multimodal fact-checking method with interpretability, which significantly improves the accuracy and interpretability of fact-checking in complex evidence scenarios.

[0126] First, this application addresses the limitation of existing fact-checking systems in flexibly allocating retrieval resources based on the complexity of claims by introducing a multimodal adaptive retrieval framework based on a hybrid expert model. This application can dynamically select non-retrieval, single-step retrieval, or multi-hop retrieval strategies according to the complexity of claims, thereby achieving on-demand allocation of computing resources, improving overall retrieval efficiency, and reducing interference from redundant information.

[0127] Secondly, this application addresses the core problem that traditional multimodal fusion methods tend to overlook the connections between evidence and struggle to capture cross-modal multi-hop relationships by proposing a heterogeneous graph structure learning module. This module explicitly constructs cross-modal reasoning paths by modeling the relationship between declarations and evidence and evidence-evidence relationships. Furthermore, it employs a meta-path-based graph pruning strategy to retain key information nodes, effectively filtering noise and enhancing the model's ability to express fine-grained evidence interactions, thereby obtaining more accurate and reliable factual judgments.

[0128] Finally, to overcome the shortcomings of single-model reasoning processes—such as lack of explanatory power and insufficient reliability in the face of conflicting evidence—this application innovatively constructs a multi-agent debate framework consisting of an evidence-specific agent, a standard reasoning agent, and a judging agent. This framework not only supports independent evaluation of evidence but also simulates multi-round dialogic scrutiny through a dialectical reasoning mechanism of standard reasoning and counterfactual reasoning. This enables the system to identify logical conflicts, discover implicit chains of evidence, and gradually converge to a self-consistent conclusion, providing not only a judgment result but also an interpretable final output.

[0129] From a software perspective, this application also provides an interpretable adaptive multimodal fact-checking system for executing all or part of the aforementioned interpretable adaptive multimodal fact-checking method, see [link to relevant documentation]. Figure 6 The interpretable adaptive multimodal fact-checking system specifically includes the following: The MOE-based multimodal retrieval module 10 is used to dynamically adjust the evidence retrieval strategy according to the statement text, and retrieve the multimodal evidence set of the statement text based on the evidence retrieval strategy; wherein, the modality of the evidence in the multimodal evidence set includes text and visual. The heterogeneous graph structure learning module 20 is used to construct a heterogeneous graph based on the declaration text and the multimodal evidence set, and update the node representation of the heterogeneous graph in a graph structure learning manner; wherein, the types of nodes in the heterogeneous graph include: the declaration node corresponding to the declaration text and the evidence node corresponding to each of the evidence; the types of edges in the heterogeneous graph include: first type edges connecting the declaration node and the evidence node, and second type edges connecting different evidence nodes; The multi-agent debate module 30 is used to input the heterogeneous graph after updating the node representation into a multi-agent debate system containing a counterfactual reasoning agent, so as to generate a veracity judgment result for the statement text and a corresponding natural language interpretation report through multiple rounds of agent interactive debate. The natural language interpretation report is used to trace back to at least one of the evidence nodes in the heterogeneous graph. The counterfactual reasoning agent is used to actively generate counterfactual reasoning points by hypothetically modifying the evidence corresponding to the evidence node during the multi-round agent interactive debate.

[0130] The embodiments of the interpretable adaptive multimodal fact-checking system provided in this application can be used to execute the processing flow of the embodiments of the interpretable adaptive multimodal fact-checking method described above. Its functions will not be repeated here, but can be referred to the detailed description of the embodiments of the interpretable adaptive multimodal fact-checking method described above.

[0131] The interpretable adaptive multimodal fact-checking system described herein can perform the interpretable adaptive multimodal fact-checking portion of the process in either a server or a client device. The choice can be made based on the processing capabilities of the client device and the limitations of the user's usage scenario. This application does not impose any limitations in this regard. If all operations are performed in the client device, the client device may further include a processor for the specific processing of the interpretable adaptive multimodal fact-checking.

[0132] The aforementioned client device may have a communication module (i.e., a communication unit) that can communicate with a remote server to achieve data transmission with the server. The server may include a server on the task scheduling center side; in other implementation scenarios, it may also include a server on an intermediate platform, such as a server on a third-party server platform that has a communication link with the task scheduling center server. The server may include a single computer device, a server cluster consisting of multiple servers, or a distributed server structure.

[0133] The server and the client device can communicate using any suitable network protocol, including those not yet developed as of the date of this application. Such network protocols may include, for example, TCP / IP, UDP / IP, HTTP, HTTPS, etc. Furthermore, such network protocols may also include RPC (Remote Procedure Call Protocol) and REST (Representational State Transfer Protocol) protocols used on top of the aforementioned protocols.

[0134] See Figure 7 In one application example of an interpretable adaptive multimodal fact-checking system, it consists of three core modules: (1) a multimodal retrieval module based on a Mixture-of-Experts (MoE) model, which adaptively selects retrieval strategies according to the complexity of the claims; (2) a heterogeneous graph structure learning module for modeling and capturing cross-evidence relevance; and (3) a multi-agent debate framework for generating interpretations and resolving evidence conflicts. The three modules work together to achieve an efficient fact-checking process from evidence retrieval and information fusion to the final output interpretation.

[0135] Specifically, for the heterogeneous graph structure learning module, this application allows the use of different graph neural network architectures to model the complex relationships between evidence. In addition to the currently used heterogeneous graph neural network based on meta-path sampling, graph attention networks (GAT), graph convolutional networks (GCN), or Transformer-based graph structure learning methods can also be used to capture higher-order, cross-modal, or non-explicit associations between evidence nodes.

[0136] To further illustrate the effectiveness of the interpretable adaptive multimodal fact-checking method and system provided in the above embodiments of this application, this application also provides corresponding experimental and performance evaluation data, as detailed below: (1) Dataset The MOCHEG (Multimodal Claim Verification and Explanation Generation) and COSMOS (Context-Oriented Semantic Multimodal Out-of-context Scene) datasets were used as experimental datasets. The COSMOS dataset was constructed through a self-supervised process, drawing its content from a large amount of online news, blogs, and social media content. It contains approximately 200,000 images and 450,000 related texts, covering multiple fields such as climate change. The core task of this dataset is not to directly verify factual claims, but rather to identify common-sense inconsistencies between images and text titles. To support model evaluation, it provides 1700 manually annotated image-title triples, making it an important resource for exploring the model's ability to detect common implicit attribution errors and context misalignments in real-world misinformation.

[0137] In contrast, the MOCHEG dataset is specifically designed for evidence-based fact-checking tasks. This dataset collects 15,601 truth-labeled statements from authoritative fact-checking bodies, along with 33,880 pieces of textual evidence and 12,112 image evidence. The dataset is divided into a training set (11,669 samples), a validation set (1,490 samples), and a test set (2,442 samples). In addition to truth labels, the dataset also provides natural language explanations, allowing the evaluation of model performance to extend beyond classification accuracy to include the quality and coherence of its reasoning process.

[0138] (2) Baseline model To ensure comprehensiveness of the evaluation, this application selects several classic and cutting-edge methods as comparative baselines. The baseline models cover three typical traditional multimodal fact-checking frameworks: Triple-Check, Ino, and MOCHEG. In addition, this application incorporates five recent methods based on Large Language Models (LLMs) and Large Vision-Language Models (LVLMs): LVLM4FV for fact-checking, LLaVA for Large Language and Vision Assistants, LLaVA+PE for Large Language and Vision Assistants and Cue Enhancement, LLaVA+ICL-1 for Context Learning, and CRMFC for Cross-Modal Fact Checking. For the explanation generation subtask, this application uses two standard text summarization models as benchmarks: the unsupervised LEAD-3 (extracting the first three sentences as a summary), and the ORACLE model (which selects a set of sentences that maximizes the score of the reference summary ROUGE (a supplementary metric for summary evaluation oriented towards recall)).

[0139] (3) Evaluation indicators Performance evaluation employs standardized metrics designed for each stage of the fact-checking process. For the claim verification (classification) stage, this application uses four metrics: Label Accuracy (LA), Precision (Pre), Recall (Rec), and F1 score. Label accuracy provides statistics on overall classification correctness; precision represents the proportion of samples predicted as belonging to a certain class that actually belong to that class; recall measures the proportion of actual claims for a particular class that are correctly identified; and the F1 score, as the harmonic mean of precision and recall, provides a more robust comprehensive evaluation to address class imbalance issues.

[0140] For the explanation generation stage, standard text generation evaluation metrics were used: ROUGE-1 (unigram-based metric), ROUGE-2 (bigram-based metric), ROUGE-L (longest common subsequence-based metric), and BLEU (Bilingual Evaluation Understudy). ROUGE-1 and ROUGE-2 measure the lexical similarity between the generated explanation and the reference explanation based on unigram and bigram overlap, respectively. ROUGE-L, based on the Longest Common Subsequence (LCS), reflects the degree of matching between the generated and reference texts at the syntactic structure and word order levels. BLEU measures n-gram precision with an added length penalty, thus emphasizing the accuracy of word sequences in the generated sentence. These metrics together constitute a comprehensive evaluation of the generated explanation in terms of fluency, relevance, and factual coverage.

[0141] (4) Analysis of experimental results Table 1 - Comparison of Experimental Results Table 1 presents the main experimental results of the proposed method compared to baseline methods. Under all evidence settings, the proposed AMFV (Adaptive Multi-modal Fact Verification) model achieves state-of-the-art performance on both datasets. On the gold-evidence dataset of COSMOS, AMFV achieves a label accuracy of 70.71% and an F1 score of 66.92%, significantly outperforming all baseline models. Similarly, on the gold-evidence dataset of MOCHEG, it achieves a label accuracy of 66.11% and an F1 score of 64.36%, respectively. This result demonstrates the effectiveness of AMFV's adaptive multi-modal fusion mechanism in integrating visual and textual evidence for robust fact verification.

[0142] All models exhibit a consistent trend: the performance using gold-evidence significantly outperforms the retrieval-evidence setting. For example, on the COSMOS dataset, the label accuracy of AMFV drops from 70.71% in the gold-evidence setting to 66.53% in the retrieval-evidence setting. This phenomenon is consistent across all baseline models, highlighting the inherent noise and incompleteness of retrieval evidence and underscoring the key challenges of evidence acquisition in real-world scenarios.

[0143] In summary, AMMFV achieves optimal performance by proposing a novel fine-grained modeling approach to characterize intramodal and intermodal interactions, thereby enabling more reliable and accurate fact verification under both retrieved evidence and ideal evidence conditions.

[0144] This application also provides an electronic device, which may include a processor, a memory, a receiver, and a transmitter. The processor is used to execute the interpretable adaptive multimodal fact-checking method mentioned in the above embodiments. The processor and the memory can be connected via a bus or other means, taking a bus connection as an example. The receiver can be connected to the processor and the memory via wired or wireless means.

[0145] The processor can be a central processing unit (CPU). The processor can also be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or combinations of the above types of chips.

[0146] Memory, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as the program instructions / modules corresponding to the interpretable adaptive multimodal fact-checking method in the embodiments of this application. The processor executes various functional applications and data processing by running the non-transitory software programs, instructions, and modules stored in the memory, thereby implementing the interpretable adaptive multimodal fact-checking method in the above method embodiments.

[0147] The memory may include a program storage area and a data storage area. The program storage area may store the operating system and applications required for at least one function; the data storage area may store data created by the processor, etc. Furthermore, the memory may include high-speed random access memory and non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, the memory may optionally include memory remotely located relative to the processor, which can be connected to the processor via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

[0148] The one or more modules are stored in the memory, and when executed by the processor, they execute the interpretable adaptive multimodal fact-checking method in the embodiment.

[0149] In some embodiments of this application, the user equipment may include a processor, a memory, and a transceiver unit. The transceiver unit may include a receiver and a transmitter. The processor, memory, receiver, and transmitter may be connected via a bus system. The memory is used to store computer instructions, and the processor is used to execute the computer instructions stored in the memory to control the transceiver unit to send and receive signals.

[0150] As one implementation method, the functions of the receiver and transmitter in this application can be implemented by transceiver circuits or dedicated transceiver chips, and the processor can be implemented by dedicated processing chips, processing circuits or general-purpose chips.

[0151] As another implementation approach, the server provided in this application embodiment can be implemented using a general-purpose computer. That is, the program code implementing the processor, receiver, and transmitter functions is stored in memory, and the general-purpose processor implements the processor, receiver, and transmitter functions by executing the code in memory.

[0152] This application also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of the aforementioned interpretable adaptive multimodal fact-checking method. The computer-readable storage medium may be a tangible storage medium, such as random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, floppy disks, hard disks, removable storage disks, CD-ROMs, or any other form of storage medium known in the art.

[0153] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the aforementioned interpretable adaptive multimodal fact-checking method.

[0154] Those skilled in the art will understand that the exemplary components, systems, and methods described in conjunction with the embodiments disclosed herein can be implemented in hardware, software, or a combination of both. Whether implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application. When implemented in hardware, it can be, for example, electronic circuits, application-specific integrated circuits (ASICs), appropriate firmware, plug-ins, function cards, etc. When implemented in software, the elements of this application are programs or code segments used to perform the required tasks. The programs or code segments can be stored on a machine-readable medium or transmitted over a transmission medium or communication link via data signals carried on a carrier wave.

[0155] It should be clarified that this application is not limited to the specific configurations and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of this application is not limited to the specific steps described and shown. Those skilled in the art can make various changes, modifications, and additions, or change the order of steps, after understanding the spirit of this application.

[0156] In this application, features described and / or illustrated for one embodiment may be used in the same or similar manner in one or more other embodiments, and / or combined with or in place of features of other embodiments.

[0157] The above description is merely a preferred embodiment of this application and is not intended to limit this application. Various modifications and variations can be made to the embodiments of this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the protection scope of this application.

Claims

1. An interpretable adaptive multimodal fact-checking method, characterized in that, include: The evidence retrieval strategy is dynamically adjusted based on the statement text, and a multimodal evidence set of the statement text is retrieved based on the evidence retrieval strategy; wherein, the modality of the evidence in the multimodal evidence set includes text and visual. A heterogeneous graph is constructed based on the statement text and the multimodal evidence set, and the node representation of the heterogeneous graph is updated by graph structure learning; wherein, the types of nodes in the heterogeneous graph include: the statement node corresponding to the statement text and the evidence node corresponding to each of the evidence; the types of edges in the heterogeneous graph include: first type edges connecting the statement node and the evidence node, and second type edges connecting different evidence nodes; The heterogeneous graph with updated node representations is input into a multi-agent debate system containing a counterfactual reasoning agent. Through multiple rounds of agent-based interactive debate, a veracity determination result for the stated text and a corresponding natural language interpretation report are generated. The natural language interpretation report is used to trace back to at least one of the evidence nodes in the heterogeneous graph. The counterfactual reasoning agent is used to actively generate counterfactual reasoning points by hypothetically modifying the evidence corresponding to the evidence node during the multi-round agent-based interactive debate.

2. The interpretable adaptive multimodal fact-checking method according to claim 1, characterized in that, The multi-agent debate system also includes: an evidence-specific agent, a standard reasoning agent, an aggregation module, and a judging agent that participate in multiple rounds of agent-interactive debates together with the counterfactual reasoning agent; Each of the evidence-specific intelligent agents corresponds to one of the pieces of evidence in the multimodal evidence set and is used to generate an evaluation result on the correlation between the evidence and the statement text; The aggregation module is used to summarize the evaluation results generated by each of the evidence-specific intelligent agents, and generate global evidence summary data and corresponding explanatory data based on the evaluation results; The standard reasoning agent is used to construct standard arguments that support or refute the statement text based on the global evidence summary data; The judging agent is used to evaluate whether each round of debate in the agent's interactive debate process has converged based on the global evidence summary data and the interpretation data, and triggers the generation of the authenticity judgment result of the statement text and the corresponding final natural language interpretation report when convergence is achieved.

3. The interpretable adaptive multimodal fact-checking method according to claim 2, characterized in that, The multi-agent debate system generates a veracity determination result and a corresponding natural language interpretation report for the statement text through the following steps: The first round of debate steps: Each evidence-specific intelligent agent generates an initial evaluation result based on the statement text and the unique corresponding evidence; the aggregation module summarizes all initial evaluation results and generates global evidence summary data and corresponding explanation data for the first round of debate; Convergence Judgment Step: The judging agent evaluates whether the current debate round has converged or reached the preset maximum number of rounds based on the received global evidence summary data and corresponding explanation data of the current debate round. If so, the explanation data of the current debate round is output as a natural language explanation report of the statement text, and the corresponding authenticity judgment result of the statement text is also output. If not, the subsequent debate steps are executed to obtain the global evidence summary data and corresponding explanation data of the next debate round. The subsequent debate steps include: The global evidence summary data from the previous debate round is input into the standard reasoning agent and the counterfactual reasoning agent, respectively. The standard reasoning agent generates the standard argument, and the counterfactual reasoning agent generates the counterfactual reasoning argument. The standard argument and the counterfactual reasoning argument are fused through an update function to obtain the updated interpretation state for the current debate round. The updated interpretation state serves as the input for the standard reasoning agent and the counterfactual reasoning agent in the next debate round, enabling them to generate the standard argument and counterfactual reasoning argument for the next round based on the updated interpretation state. Each evidence-specific agent re-evaluates its uniquely corresponding evidence based on the global evidence summary data from the previous debate round, the updated counterfactual reasoning points for the current debate round, and the updated standard arguments, thereby obtaining the evaluation result for the current debate round. The aggregation module summarizes all evaluation results of the current debate round and combines them with the updated standard arguments and the updated counterfactual reasoning arguments of the current debate round to generate global evidence summary data and corresponding explanatory data for the current debate round. The global evidence summary data and corresponding explanatory data of the current debate round are then input into the judging agent to execute the convergence judgment step.

4. The interpretable adaptive multimodal fact-checking method according to claim 1, characterized in that, The hypothetical modifications include at least one of: deleting factual information from the evidence, replacing entities in the evidence, and modifying visual elements in the evidence to the visual modality of the visual.

5. The interpretable adaptive multimodal fact-checking method according to claim 1, characterized in that, The method of updating the node representation of the heterogeneous graph using graph structure learning includes: Based on the predefined meta-paths in the heterogeneous graph, determine the meta-path neighbor set corresponding to each node; Based on preset pruning rules, samples are taken from the meta-path neighbor sets to obtain each sampled meta-path neighbor set; wherein, the pruning rules include: retaining less than or equal to a preset number of edges for each node; The heterogeneous graph is pruned based on the sampled meta-path neighbor sets to obtain a subgraph; The subgraph is input into a graph neural network, which aggregates the features of the neighboring nodes corresponding to each node in the subgraph, thereby updating the node representation of each node in the subgraph.

6. The interpretable adaptive multimodal fact-checking method according to claim 1, characterized in that, The step of dynamically adjusting the evidence retrieval strategy based on the statement text and retrieving a multimodal evidence set of the statement text based on the evidence retrieval strategy includes: The declaration text is input into a pre-trained complexity classifier, which outputs a retrieval complexity label for the declaration text; wherein, the type of the complexity label includes simple, medium and complex, which represent the retrieval complexity of the declaration text in ascending order; If the type of the retrieved complexity tag is medium, then a single-step retrieval strategy is executed to obtain a multimodal evidence set of the statement text through a single evidence retrieval; If the type of the retrieved complexity label is complex, then a multi-step iterative retrieval strategy is executed to obtain a multimodal evidence set of the statement text: The multi-step iterative retrieval strategy includes: performing a preset number of iterations of retrieval steps on the declaration text; the retrieval steps include: Based on the evidence set of the statement text in the current iteration and the intermediate conclusions accumulated in the previous iteration, the information gap that still needs to be verified is analyzed, and the retrieval requirement data for the current iteration is generated according to the information gap. The retrieval machine performs evidence retrieval based on the retrieval requirement data to obtain new evidence for the current iteration. The new evidence for the current iteration is added to the evidence set to obtain the target evidence set for the current iteration. It is determined whether a target evidence set sufficient to verify the truth or falsehood of the statement has been formed: if yes, the target evidence set for the current iteration is used as the multimodal evidence set of the statement text; if no, the target evidence set for the current iteration and the updated understanding state are used as the input for the next iteration, and the iterative retrieval continues.

7. The interpretable adaptive multimodal fact-checking method according to claim 6, characterized in that, Before dynamically adjusting the evidence retrieval strategy based on the statement text and retrieving a multimodal evidence set of the statement text based on the evidence retrieval strategy, the method further includes: Obtain the training set containing the various training declaration texts; For each training statement text in the training set, different types of retrieval strategies are used to process it, and the evidence retrieved by each type of retrieval strategy is recorded to meet the preset answer accuracy requirements for the corresponding training statement text; wherein, the types of retrieval strategies include: no retrieval strategy with retrieval complexity increasing sequentially, the single-step retrieval strategy, and the multi-step iterative retrieval strategy. If the training declaration text corresponds to multiple retrieval strategies that meet the preset answer accuracy requirements, then the complexity of the retrieval strategy with the lowest retrieval complexity is selected as the complexity label of the training declaration text. If the training declaration text corresponds to a retrieval strategy that meets the preset answer accuracy requirement, then the complexity of the retrieval strategy is used as the complexity label of the training declaration text. If none of the various retrieval strategies corresponding to the training declaration text meet the preset answer accuracy requirements, then the training declaration text is output to the annotation expert client device, and the complexity label of the training declaration text returned by the annotation expert client device is received.

8. The interpretable adaptive multimodal fact-checking method according to claim 2, characterized in that, The judging agent is also used to output the confidence score corresponding to the authenticity judgment result when the debate rounds converge. The confidence score is determined based on at least one of the following indicators: The degree of consistency among the evaluation results generated by each of the evidence-specific intelligent agents; The coverage of evidence associated with the statement text in the multimodal evidence set; The self-evaluation result of the judging agent on its own judgment.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the interpretable adaptive multimodal fact-checking method as described in any one of claims 1 to 8.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, the computer program implements the interpretable adaptive multimodal fact-checking method as described in any one of claims 1 to 8.