Multi-modal learning diagnosis method and system based on multi-agent and triangular mutual authentication

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By using a distributed multi-agent architecture and triangular mutual verification reasoning based on dynamic knowledge graphs, the problems of semantic fusion, interpretability, and temporal misalignment in multimodal learning diagnosis are solved, enabling efficient and accurate learner cognitive diagnosis and personalized intervention.

CN122241280APending Publication Date: 2026-06-19SHANDONG NORMAL UNIV

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: SHANDONG NORMAL UNIV
Filing Date: 2026-03-23
Publication Date: 2026-06-19

Application Information

Patent Timeline

23 Mar 2026

Application

19 Jun 2026

Publication

CN122241280A

IPC: G06F18/2321; G06N5/022; G06N5/043; G06N3/092

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing multimodal learning diagnostic technologies suffer from problems such as difficulty in semantic fusion, poor interpretability, contradiction between real-time response and resource consumption, and difficulty in handling timing misalignment. These issues result in low accuracy and reliability of diagnostic results, an inability to provide immediate intervention, and excessive system load.

⚗Method used

A distributed multi-agent architecture is adopted for local silent monitoring and asynchronous triggering. Visual diagnostic, language diagnostic and behavioral diagnostic agents process multimodal data respectively, and a global diagnostic process is initiated when an anomaly is detected. Combined with dynamic knowledge graph, triangulation reasoning is performed to generate interpretable diagnostic conclusions and implement adaptive hierarchical intervention.

🎯Benefits of technology

It effectively reduces network bandwidth and server load, improves diagnostic accuracy and interpretability, enables personalized and timely intervention, and enhances learning efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122241280A_ABST

Patent Text Reader

Abstract

This invention proposes a multimodal learning diagnostic method and system based on multiple agents and triangulation, relating to the fields of artificial intelligence and smart education. It addresses the problems of existing technologies, such as difficulty in semantic fusion and lack of interpretability, the contradiction between real-time performance and resource overhead, and difficulties in root cause tracing due to temporal misalignment. This method collects multimodal learning data from learners. Multiple analytical agents perform local silent monitoring and feature extraction on the multimodal learning data. When any analytical agent detects a local modality anomaly, it generates an asynchronous trigger signal and sends it to a central diagnostic agent. The central diagnostic agent receives and aggregates the data reported by each analytical agent, constructs a diagnostic context, maps it to a dynamic knowledge graph, executes triangulation reasoning, and generates a final diagnostic conclusion. The teaching agent executes an adaptive hierarchical intervention strategy, generating and outputting intervention content. This invention solves the problems of existing technologies and improves learning diagnostic capabilities.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of artificial intelligence and smart education technology, and in particular relates to a multimodal learning diagnostic method and system based on multiple agents and triangulation. Background Technology

[0002] The statements in this section are merely background information related to the present invention and do not necessarily constitute prior art.

[0003] With the increasing application of intelligent learning guidance systems in scientific experiments, programming instruction, and complex skills training, traditional assessment methods based on click logs and answer results are no longer sufficient to meet the diagnostic needs of learners' deep cognitive states. Therefore, in order to more comprehensively capture the learning process, multimodal learning analysis and diagnostic technologies that integrate eye tracking, speech, and operational behavior have emerged and are gradually becoming the mainstream research in this field.

[0004] Currently, multimodal educational diagnostic techniques are mainly divided into three categories: The first category is early fusion methods based on feature vector concatenation, which simply concatenate feature vectors from different modalities and input them into classification models such as Support Vector Machines (SVM) or Random Forests. This type of method is simple to implement but struggles to handle temporal misalignment and semantic conflicts between modalities. The second category is late fusion methods based on end-to-end deep learning, which use convolutional neural networks (CNNs), recurrent neural networks (RNNs), or attention mechanisms to learn multimodal data end-to-end, such as two-stream networks and multimodal LSTMs. These methods can learn complex nonlinear mappings but suffer from black-box characteristics and lack interpretability. The third category is expert system methods based on rule bases, which use manually generated rules for reasoning and diagnosis. This method offers strong interpretability, but the rule base coverage is limited and maintenance costs are high.

[0005] It is evident that existing technologies still have some shortcomings: (1) Deep semantic fusion of multi-source heterogeneous data is difficult and lacks interpretability. In complex learning tasks, visual gaze, verbal expression, and operational behavior are often asynchronous and semantically ambiguous. For example, a learner's gaze may precede the operation, while verbal expression may lag behind the thought process. Existing mainstream solutions usually use feature vector concatenation or end-to-end deep neural networks for black-box processing. Such early or late fusion strategies are difficult to effectively handle semantic conflicts between modalities. Specifically, when a learner's gaze is on the correct point but the operation is wrong, such as looking at the correct tool but clicking the wrong button, the system cannot accurately distinguish whether it is a cognitive misunderstanding or an operational error. When the learner's verbal intention is inconsistent with the actual behavior, the system has difficulty judging whether it is a conceptual misunderstanding or operational negligence. This ambiguity greatly reduces the accuracy and credibility of the diagnostic results. In addition, black-box models cannot explain to educators the specific evidence chain that the system judges that the learner has 'cognitive confusion' or 'conceptual misunderstanding', making the diagnostic results lack transparency, making it difficult to gain the trust of teachers and learners, and failing to provide effective guidance for subsequent teaching improvement.

[0006] (2) There is an irreconcilable contradiction between the system's real-time response capability and computational resource overhead. In order not to miss learners' abnormal moments, traditional systems often need to upload and centrally process high-frequency eye-tracking data (usually with a sampling rate of 60-120Hz), real-time speech streams (16kHz sampling), and operation logs in full and continuously. This all-time, full-data processing mode brings the following problems: 1) The network bandwidth pressure is enormous. Each learner generates hundreds of KB of eye-tracking data per second, and the network load surges when multiple people are using it concurrently. 2) The server-side computing load is too heavy. The central server needs to process multimodal data streams from multiple learners simultaneously, which can easily become a system bottleneck. 3) Data backlog leads to diagnostic delays. In high-concurrency scenarios, data queue backlog can cause diagnostic response delays of several seconds or even tens of seconds, missing the optimal intervention window. This delay often means that the system can only provide summative evaluations after learners have completed tasks, rather than providing immediate scaffolding interventions when errors occur, greatly reducing the practical value of intelligent tutoring systems.

[0007] (3) Difficulty in effectively handling temporal misalignment between modalities. In real-world learning scenarios, there is a natural time difference in the generation of data from different modalities. For example, learners may have already expressed their intentions through eye movements and speech before performing an operation, but existing systems typically slice and fuse data based on fixed time windows, making it difficult to flexibly capture such causal relationships. When the system detects an operational error, it has often lost the key visual cues and speech intentions from the seconds prior to the error, making it impossible to accurately trace the root cause of the error. Summary of the Invention

[0008] To overcome the shortcomings of the existing technologies, this invention provides a multimodal learning diagnostic method and system based on multi-agent and triangulation. It uses distributed analytical agents for local silent monitoring and asynchronous triggering, and only initiates historical data backtracking and spatiotemporal alignment when an anomaly is detected. Then, it performs triangulation reasoning based on a dynamic knowledge graph to generate interpretable diagnostic conclusions and implement adaptive hierarchical intervention. This solves the problems of semantic fusion difficulties, poor interpretability, contradiction between real-time response and resource consumption, and difficulty in handling temporal misalignment in existing multimodal diagnostic technologies.

[0009] To achieve the above objectives, one or more embodiments of the present invention provide the following technical solutions: The first aspect of this invention provides a multimodal learning diagnostic method based on multi-agent and triangulation verification, comprising: Collect learners' multimodal learning data; Multiple analytical agents perform local silent monitoring and feature extraction on the corresponding multimodal learning data, and cache them in a local time sliding window; Based on the extracted features, when any analytical agent detects a local modal anomaly, it generates an asynchronous trigger signal and sends it to the central diagnostic agent. The central diagnostic agent sends evidence retrieval instructions to other analytical agents based on asynchronous triggering signals, receives and aggregates the data reported by each analytical agent, and constructs a diagnostic context. The central diagnostic agent maps the diagnostic context to a dynamic knowledge graph and performs triangulation reasoning to generate the final diagnostic conclusion. Based on the final diagnostic conclusion and the learner's current cognitive state, the teaching agent executes an adaptive tiered intervention strategy and generates and outputs intervention content.

[0010] As one implementation method, multimodal learning data of learners is collected. This multimodal learning data includes at least eye-tracking data, voice interaction data, and operational behavior data. The specific process is as follows: The learner's eye tracking data is collected using an infrared eye tracker. The eye tracking data includes at least the learner's fixation point coordinates and pupil diameter data. A high-fidelity microphone array and front-end noise reduction algorithm are used to acquire learners' verbal thought data, i.e., voice interaction data. By embedding API hooks in the teaching software, operation behavior data is recorded, which includes at least mouse clicks, hovering, dragging, and keyboard input commands.

[0011] As one implementation method, multiple analytical agents perform local silent monitoring and feature extraction on the corresponding multimodal learning data. These analytical agents are visual diagnostic agents, language diagnostic agents, and behavioral diagnostic agents. The specific process is as follows: Based on eye-tracking data, the visual diagnostic agent uses a clustering algorithm to calculate the visual attention vector and monitor the gaze entropy value. The language diagnostic agent performs speech recognition and natural language understanding on voice interaction data, and extracts key concept words; The behavioral diagnostic agent uses a sequence pattern mining algorithm to analyze the operation sequence queue and detect inefficient behavioral patterns.

[0012] As one implementation method, based on eye-tracking data, the visual diagnostic agent uses a clustering algorithm to calculate the visual attention vector and monitor the gaze entropy value. The specific process is as follows: Extract the eye-tracking coordinate sequence containing screen coordinates and timestamps from the set time sliding window; Calculate the spatiotemporal distance of each pair of eye-tracking coordinate sequence points; Based on spatiotemporal distance, the number of points in the ε-neighborhood of each point is calculated, and the core point is determined; Starting from an unvisited core point, recursively group all density-reachable points within its ε-neighborhood into the same cluster, forming a gaze cluster; All points not assigned to any cluster are marked as noise points and noise filtering is performed. Calculate the centroid coordinates and fixation duration for each fixation cluster; A visual attention vector is constructed based on the centroid coordinates and fixation duration; Based on the proportion of gaze duration for each region of interest, a visual attention analysis algorithm is used to calculate the gaze entropy value.

[0013] As one implementation method, based on the extracted features, when any analytical agent detects a local modal anomaly, it generates an asynchronous trigger signal and sends it to the central diagnostic agent. The specific process is as follows: When the gaze entropy value exceeds the preset entropy value threshold or the dynamic time warping distance between the gaze trajectory and the preset expert mode exceeds the preset distance threshold, it is judged as a visual abnormality. When a learner's intended expression conflicts with the tool's logic or matches a preset misconception template, it is judged as a language anomaly; When an operation sequence is detected to match a preset inefficient behavior pattern, or when its similarity to a preset expert path is lower than a preset path similarity threshold, it is judged as an abnormal behavior. Based on anomaly detection, an asynchronous trigger signal is generated and sent to the central diagnostic agent.

[0014] As one implementation method, the central diagnostic agent sends evidence retrieval instructions to other analytical agents based on an asynchronous trigger signal, receives and aggregates the data reported by each analytical agent, and constructs a diagnostic context. The specific process is as follows: The central diagnostic agent calculates the backtracking time window based on the asynchronous trigger signal; The central diagnostic agent broadcasts a data backtracking instruction containing a backtracking time window to other analytical agents, excluding the triggering source. The system analyzes the agent's response instructions, retrieves historical feature data within a time period from the local sliding window, and reports it along with the trigger source features to the central diagnostic agent. The central diagnostic agent performs timestamp alignment and spatial coordinate normalization on the received data to construct a diagnostic context under a unified spatiotemporal coordinate system.

[0015] As one implementation method, the central diagnostic agent maps the diagnostic context to a dynamic knowledge graph and performs triangulation reasoning to generate a final diagnostic conclusion. The specific process is as follows: Construct a dynamic knowledge graph, which includes a domain ontology subgraph, a reverse cognitive bias model subgraph, and a teaching strategy model subgraph. The central diagnostic agent maps historical feature data in the diagnostic context into evidence nodes in a dynamic knowledge graph; Retrieve candidate diagnostic hypotheses that have a path associated with the evidence nodes from the reverse cognitive bias model; Calculate the explanatory power score for each candidate diagnostic hypothesis; The candidate diagnostic hypothesis with the highest explanatory power score that exceeds the preset confirmation threshold is selected as the final diagnostic conclusion. When multiple candidate diagnostic hypotheses with similar scores that exceed a preset confirmation threshold exist, an active detection mechanism is triggered to generate discriminative questions to obtain new evidence.

[0016] As one implementation method, the teaching agent executes an adaptive tiered intervention strategy based on the final diagnostic conclusion and the learner's current cognitive load index, generating and outputting intervention content. The specific process is as follows: Set multiple progressive intervention levels; Estimating the current cognitive load index based on pupil diameter variation data and operational responses; The final diagnosis, the cumulative number of historical errors, and the cognitive load index are used as state vectors and input into a pre-trained reinforcement learning model to obtain the optimal intervention level. When the cognitive load index exceeds the preset high load threshold, select the direct teaching level.

[0017] As one implementation method, after the intervention is implemented, the learner enters a reset state of inhibition, monitors the learner's behavioral correction and records closed-loop data, and optimizes the intervention strategy by updating the reinforcement learning model. The specific process is as follows: Monitor learner behavior after intervention and record closed-loop data; Update the Q-value of the reinforcement learning model using closed-loop data to optimize the intervention strategy; We can identify typical cognitive error patterns from new diagnostic cases and dynamically update the subgraph of the reverse cognitive bias model.

[0018] A second aspect of the present invention provides a multimodal learning diagnostic system based on multi-agent and triangulation verification, comprising: The multimodal perception module is used to collect learners' multimodal learning data; The distributed analysis module is used by multiple analytical agents to perform local silent detection and feature extraction on the corresponding multimodal learning data, and cache it in a local time sliding window. Based on the extracted features, when any analytical agent detects a local modal anomaly, it generates an asynchronous trigger signal and sends it to the central diagnostic agent. The cognitive fusion module is used by the central diagnostic agent to send evidence retrieval instructions to other analytical agents based on asynchronous trigger signals, receive and aggregate data reported by each analytical agent, and construct a diagnostic context. The central diagnostic agent maps the diagnostic context to a dynamic knowledge graph and performs triangulation reasoning to generate the final diagnostic conclusion. The intervention module is used by the teaching agent to execute adaptive tiered intervention strategies based on the final diagnostic conclusion and the learner's current cognitive state, and to generate and output intervention content.

[0019] The above one or more technical solutions have the following beneficial effects: In this embodiment, a distributed multi-agent architecture is designed, including at least visual diagnostic agents, language diagnostic agents, and behavioral diagnostic agents. Employing silent monitoring and asynchronous triggering mechanisms, the system performs lightweight monitoring locally under normal conditions, without uploading data to the central node. This significantly reduces network bandwidth usage and the computational load on the central server. Compared to traditional full data upload methods, it reduces network bandwidth usage by approximately 80%-90% and server-side computational load by approximately 70%-85%. The global diagnostic process is only initiated when an anomaly is detected, achieving efficient edge-cloud collaboration.

[0020] In this embodiment, by using a historical data backtracking mechanism and maintaining a time sliding window locally in each agent, historical data within the key time window can be retrieved after an anomaly is detected. This allows for accurate capture of the causal chain of the error, effectively solving the problem of misaligned timing of multimodal data. The system can capture key contextual information before the occurrence of an abnormal event, and compared with the fixed time window method, the diagnostic accuracy is improved by about 15%-25%.

[0021] In this embodiment, a triangulation mechanism based on a dynamic knowledge graph is used to find hypotheses that can simultaneously explain multimodal evidence, ensuring that the diagnostic hypothesis can explain multimodal evidence and effectively handling semantic conflicts between modalities. The diagnostic accuracy is improved by approximately 10%-20% compared to black-box deep learning methods, and a complete chain of evidence is provided, achieving high interpretability. Furthermore, the dynamic knowledge graph has adaptive growth capabilities, automatically mining and updating the cognitive bias model from new diagnostic cases, and the system performance continuously optimizes over time.

[0022] In this embodiment, an adaptive tiered intervention strategy is used to achieve precise and personalized intervention. Based on the diagnosed type of misunderstanding and the learner's cognitive load, the optimal intervention level is dynamically selected. This approach can correct errors in a timely manner while avoiding excessive intervention that could disrupt the learning flow. Compared with a fixed intervention method, the learning efficiency is improved by about 20%-30%.

[0023] Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. Attached Figure Description

[0024] The accompanying drawings, which form part of this invention, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an improper limitation of the invention.

[0025] Figure 1 This is a flowchart of the multimodal learning diagnostic method based on multi-agent and triangulation mutual verification in this embodiment 1; Figure 2 This is a schematic diagram of the framework of the multimodal learning diagnostic system based on multi-agent and triangulation mutual verification in Embodiment 2. Figure 3 This is a schematic diagram illustrating the logical reasoning of the triangulation mechanism in this embodiment. Figure 4 This is a schematic diagram of the visual intelligent agent's silent monitoring and anomaly detection process in Embodiment 1. Detailed Implementation

[0026] It should be noted that the following detailed descriptions are exemplary and intended to provide further illustration of the invention. Unless otherwise specified, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains.

[0027] It should be noted that the terminology used herein is for the purpose of describing particular implementations only and is not intended to limit the exemplary implementations of the present invention.

[0028] Where there is no conflict, the embodiments and features in the embodiments of the present invention can be combined with each other.

[0029] Example 1 This embodiment discloses a multimodal learning diagnostic method based on multi-agent and triangulation mutual verification.

[0030] To more clearly illustrate this embodiment, the multimodal learning diagnosis implementation process based on multi-agent and triangulation verification can be specifically described as follows: Multimodal learning-based diagnostic methods based on multi-agent and triangulation verification include: S1. Collect learners' multimodal learning data; S2. Multiple analytical agents perform local silent monitoring and feature extraction on the corresponding multimodal learning data, and cache them in a local time sliding window. Based on the extracted features, when any analytical agent detects a local modal anomaly, it generates an asynchronous trigger signal and sends it to the central diagnostic agent. S3. The central diagnostic agent sends evidence retrieval instructions to other analytical agents based on asynchronous trigger signals, receives and aggregates the data reported by each analytical agent, and constructs a diagnostic context. The central diagnostic agent maps the diagnostic context to a dynamic knowledge graph and performs triangulation reasoning to generate the final diagnostic conclusion. S4. The teaching agent executes an adaptive tiered intervention strategy based on the final diagnostic conclusion and the learner's current cognitive state, and generates and outputs intervention content.

[0031] S5. After the intervention is implemented, the learner enters a state of inhibition and reset. The learner's behavior is monitored and closed-loop data is recorded. The intervention strategy is optimized by updating the reinforcement learning model.

[0032] For intelligent learning guidance systems, online education platforms, virtual learning environments, and various educational scenarios requiring real-time cognitive diagnosis, this system integrates multimodal data such as eye-tracking, voice interaction, and operational behavior to achieve accurate diagnosis and personalized intervention of learners' deep cognitive states.

[0033] like Figure 2As shown, a multi-agent learning and diagnostic system is designed based on the distributed cognitive theory architecture, logically divided into four layers: multimodal perception layer (L1), distributed analysis layer (L2), cognitive fusion layer (L3), and executive intervention layer (L4). Each layer interacts with data through a collaborative communication bus and shares a centralized dynamic knowledge graph database.

[0034] like Figure 1 , Figure 2 As shown, in step S1, multimodal learning data of learners is collected, wherein the multimodal learning data includes at least eye-tracking data, voice interaction data, and operational behavior data.

[0035] In the multimodal perception layer (L1), physical environment data, i.e., multimodal learning data, is collected and digitally mapped. The multimodal perception layer (L1) includes an eye-tracking acquisition unit, a speech acquisition unit, and an interaction log acquisition unit.

[0036] The process of collecting learners' multimodal learning data is as follows: (1) Use an infrared eye tracker to collect learner eye tracking data, wherein the eye tracking data includes at least the learner's gaze coordinates and pupil diameter data.

[0037] The eye-tracking acquisition unit is equipped with a high frame rate infrared eye tracker (sampling rate 60Hz or higher) to capture the learner's gaze coordinates and pupil diameter data in real time.

[0038] (2) High-fidelity microphone array and front-end noise reduction algorithm are used to obtain learners’ spoken thinking data, i.e., voice interaction data.

[0039] In the speech acquisition unit, a high-fidelity microphone array and front-end noise reduction algorithm are used to capture the learner's spoken thought data. Specifically, the spoken thought data is acquired through a high-fidelity microphone array, and after front-end noise reduction, speech enhancement, and speech activity detection (VAD), acoustic features and semantic keywords are extracted and structured and cached in a local sliding window for diagnostic use.

[0040] A microphone array is a data acquisition device consisting of multiple (usually 4-8) microphones arranged in a specific geometric shape (linear or circular).

[0041] (3) Record operation behavior data by embedding API hooks in teaching software. The operation behavior data includes at least mouse clicks, hovering, dragging and keyboard input commands.

[0042] In the interaction log collection unit, mouse clicks, hovering, dragging, and keyboard input commands are recorded in real time through API hooks embedded in the teaching software. Specifically, custom API hook code is embedded in the front-end interface layer or event handling layer of the teaching software and bound to the software's original event listening mechanism to ensure that all user interaction events can be captured. When an operation event occurs, the learner's mouse, keyboard, and interface interaction events are monitored in real time through API hooks, capturing key information such as event type, operation object, and timestamp, and caching it in a structured manner for local anomaly detection and historical backtracking.

[0043] After the above steps, multimodal learning data, including eye-tracking data, voice interaction data, and operational behavior data, can be obtained, providing a data foundation for subsequent operations.

[0044] like Figure 1 , Figure 2 As shown, in step S2, multiple analytical agents perform local silent monitoring and feature extraction on the corresponding multimodal learning data and cache them in a local time sliding window. Based on the extracted features, when any analytical agent detects a local modal anomaly, it generates an asynchronous trigger signal and sends it to the central diagnostic agent.

[0045] A distributed multi-agent architecture is designed, including a visual diagnostic agent, a language diagnostic agent, a behavior diagnostic agent, and a central diagnostic agent. Each analysis layer agent independently performs lightweight feature monitoring locally, and only exits its silent state and sends an asynchronous trigger signal to the central agent when a local modality anomaly is detected.

[0046] S201. Multiple analytical agents perform local silent monitoring and feature extraction on the corresponding multimodal learning data.

[0047] The distributed analysis layer (L2) includes multiple dedicated analysis agents working in parallel, namely visual diagnostic agents, language diagnostic agents, and behavioral diagnostic agents. Each agent is equipped with an independent computing unit to perform local silent monitoring and feature extraction. Each agent maintains a 10-second FIFO time sliding window locally to store preprocessed feature data and does not send any data to the central agent when no anomalies are detected.

[0048] Multiple analytical agents perform local silent monitoring and feature extraction on their respective multimodal learning data. The specific process is as follows: (1) Based on eye-tracking data, the visual diagnostic agent uses a clustering algorithm to calculate the visual attention vector and monitor the gaze entropy value.

[0049] like Figure 4As shown, the visual diagnostic agent incorporates a density-based spatial (DBSCAN) clustering algorithm and AOI mapping logic to calculate visual attention vectors in real time and monitor gaze entropy values. Specifically, the visual diagnostic agent calculates the residence time distribution of gaze points in various AOIs (such as canvas, layer panels, and toolbars) in real time, constructing visual attention vectors. The density-based spatial (DBSCAN) clustering algorithm discovers clusters by identifying high-density regions in the data space. The algorithm is based on the intuitive assumption that clusters are high-density regions separated by low-density regions in the data space. Unlike traditional algorithms such as K-means, DBSCAN does not require pre-specifying the number of clusters, can discover clusters of arbitrary shapes, and can effectively identify noise points.

[0050] Specifically, 1) Extract the eye-tracking coordinate sequence containing screen coordinates and timestamps from the set time sliding window.

[0051] Extract the eye-tracking coordinate sequence {(x_i, y_i, t_i)} from a sliding time window (set to 10 seconds), where x_i and y_i are screen coordinates (pixels) and t_i is a timestamp (milliseconds). A 10-second window can capture the complete cycle of most task operations.

[0052] 2) Calculate the spatiotemporal distance of each pair of eye-tracking coordinate sequence points.

[0053] For each pair of eye-tracking points (p, q), if the time difference between the eye-tracking points is within 200ms, then the Euclidean distance is calculated as the spatiotemporal distance. That is, when |t_p-t_q|≤200ms, the spatiotemporal distance formula is: d(p, q)=√[(x_p-x_q)²+(y_p-y_q)²]; Otherwise, the distance is set to infinity to prevent misclustering across scans, i.e., when |t_p-t_q|>200ms, d(p,q)=∞.

[0054] 3) Based on the spatiotemporal distance, calculate the number of points in the ε-neighborhood of each point and determine the core point.

[0055] Neighborhood queries are optimized using spatial indexing structures such as KD-trees. The number of points within the ε-neighborhood of each point is calculated, and if this number reaches the MinPts threshold, the point is marked as a core point. Specifically, for each point p, its ε-neighborhood N_ε(p) = {q|d(p,q)≤ε} is calculated. If |N_ε(p)|≥MinPts, then p is a core point. Using spatial indexing structures such as KD-trees can optimize neighborhood queries to O(logn).

[0056] 4) Starting from the unvisited core point, recursively group all density-reachable points in its ε-neighborhood into the same cluster to form a gaze cluster.

[0057] Starting from any unvisited core point p, recursively add all density-reachable points within its ε-neighborhood to the same cluster. Density reachability is defined as follows: if there exists a sequence of core points p_1, p_2, ..., p_n such that p_1 = p, p_n = q, and p_{i+1} ∈ N_ε(p_i), then q is density-reachable from p.

[0058] 5) Mark all points that are not assigned to any cluster as noise points and perform noise filtering.

[0059] All points not assigned to any cluster are marked as noise points. These points typically correspond to oversampling points during saccades or outliers caused by eye tracker calibration errors.

[0060] 6) Calculate the centroid coordinates and fixation duration for each fixation cluster.

[0061] The centroid is the average coordinate of all points within the cluster, and the duration is the number of points in the cluster multiplied by 16.67 ms, as shown in the formula: (x_i, _i)=(Σx_j / |C_i|,Σy_j / |C_i|), T_i=|C_i|×(1000ms / 60)≈|C_i|×16.67ms; Among them, (x_i, _i) is the centroid, and T_i is the duration.

[0062] 7) Construct a visual attention vector based on the centroid coordinates and fixation duration.

[0063] The visual attention analysis algorithm is based on the information entropy theory proposed by Claude Shannon, the father of information theory, in 1948. Information entropy is a core concept in information theory, used to quantify the uncertainty or randomness of information. The visual diagnostic agent constructs a visual attention vector based on the proportion of gaze duration at each AOI (Area of Interest), calculates the gaze entropy value using the information entropy formula, and judges the degree of attention distraction through a preset threshold to detect visual abnormalities. Specifically: First, determine the region of interest based on the centroid coordinates and fixation duration.

[0064] Based on the centroid coordinates of the gaze point obtained from DBSCAN clustering, the region of interest (AOI) to which it belongs is determined. For the Photoshop interface, four core AOIs are defined: Canvas, LayerPanel, Toolbar, and PropertyPanel. Each AOI is represented by a rectangular bounding box, and mapping can be completed by simply determining the coordinate range.

[0065] Secondly, the fixation durations of all objects belonging to the same region of interest are summed up to obtain the total fixation duration.

[0066] The duration of all fixations belonging to the same AOI is summed. For example, if a learner has three fixations in the canvas area with durations of 200ms, 150ms, and 300ms respectively, the total fixation duration on the canvas is 650ms. This step converts spatially distributed fixations into a temporally distributed attention metric.

[0067] Subsequently, the gaze duration percentage for each region of interest is calculated and normalized to obtain the visual attention vector.

[0068] Calculate the gaze duration percentage for each region of interest (AOI) to construct the attention vector V_attention, with the normalization formula as follows: w_i=T_i / ΣT_j; Where T_i is the gaze duration of AOI_i, and w_i is the percentage of gaze duration for each region of interest. The normalized attention vector satisfies the probability distribution requirement (the sum of all w_i is 1), and can be directly used for entropy calculation.

[0069] 8) Calculate the gaze entropy value using a visual attention analysis algorithm based on the proportion of gaze duration for each region of interest.

[0070] The gaze entropy value is calculated using a visual attention analysis algorithm, using the following formula: H = -Σ(w_i × log2(w_i)); Where H is the gaze entropy value, ranging from [0, log2(n)], and n is the number of AOIs. H=0 indicates that attention is completely focused on a single AOI; H=log2(n) indicates that attention is evenly distributed across all AOIs.

[0071] (2) The language diagnostic agent performs speech recognition and natural language understanding on the speech interaction data and extracts key concept words.

[0072] The language diagnostic agent integrates an ASR (Acoustic Speech Recognition) engine and a BERT-based Natural Language Understanding (NLU) model to perform entity mapping technology, mapping speech keywords to domain ontology. Specifically, the language diagnostic agent converts speech into text using ASR; extracts key concept words (such as hidden, transparent, and erase) using the BERT model; maps these to domain ontology; combines semantic relationship reasoning to detect concept conflicts and anomalies; and caches the results for diagnostic use.

[0073] In particular, concept conflict anomaly detection is achieved by judging whether there is a logical contradiction between keywords and tool-operation relationships in the domain ontology, or by matching them with preset erroneous concept templates.

[0074] (3) The behavioral diagnosis agent uses a sequence pattern mining algorithm to analyze the operation sequence queue and detect inefficient behavioral patterns.

[0075] The Generalized Sequential Pattern (GSP) algorithm is a classic sequence pattern mining algorithm, proposed in 1996 by Ramakrishnan Srikant and Rakesh Agrawal of IBM Almaden Research Center. This algorithm is an extension of the Apriori algorithm for sequence data, capable of discovering frequently occurring patterns from time series databases.

[0076] 1) The behavioral diagnosis agent performs symbolic encoding on the operational behavior data.

[0077] The original operation events are mapped to predefined symbol sequences, forming a unified encoding system: L: Click the layer thumbnail (Layer); M: Click on the mask thumbnail; B: Select the Brush tool; E: Select the Eraser tool; D: Draw / erase operation; U: Undo.

[0078] Other: More operation symbols can be added based on task requirements.

[0079] Maintain a fixed-length FIFO operation queue locally (e.g., window_size=10) to continuously store the symbol encoding sequence of the last 10 operations for use in real-time pattern mining.

[0080] 2) Based on expert knowledge and teaching experience, a library of typical inefficient behavior patterns is predefined as a matching benchmark for anomaly detection.

[0081] The typical inefficient behavior pattern library is as follows: (U,U): Continuous cancellation - indicates hesitation in decision-making or operational errors; (D,U,D): Repeat drawing - Undo - Indicates repeated attempts when dissatisfied with the result; (B,E,B): Tool switching oscillation – indicates uncertainty about tool selection; (M,L,M,L): Layer-Mask Toggle Oscillation - Indicates that the current object being edited is not clear.

[0082] The parameters are configured as follows: min_support=3, the pattern appears in at least 3 different operation sequence windows; window_size=10, the sliding window length is 10 steps; With max_gap=2, a maximum of 2 operations are skipped between sequence elements; min_length=2, the minimum pattern length is 2 (a single operation does not form a pattern).

[0083] S202. Based on the extracted features, when any analytical agent detects a local modal anomaly, it generates an asynchronous trigger signal and sends it to the central diagnostic agent.

[0084] (1) When the gaze entropy value exceeds the preset entropy value threshold or the dynamic time warping distance between the gaze trajectory and the preset expert mode exceeds the preset distance threshold, it is judged as a visual abnormality.

[0085] like Figure 4 As shown, 1) When the gaze entropy value exceeds the preset entropy threshold, it is determined to be a visual abnormality.

[0086] The degree of attentional distraction is determined by a preset entropy threshold. Specifically, based on cognitive psychology research, a diagnostic threshold is designed as follows: If H < 0.5, then the student is highly focused and in a normal learning state. If 0.5 ≤ H < 1.5, then attention is moderately distracted, which is normal multitasking. If H ≥ 1.5, then attention is excessively scattered, which may indicate being lost or confused.

[0087] 2) When the distance between the line of sight trajectory and the dynamic time warping distance of the preset expert mode exceeds the preset distance threshold, it is judged as a visual abnormality.

[0088] An abnormal signal is triggered when the line of sight deviates from the expert mode, i.e., when the DTW distance is too large.

[0089] First, build an expert model.

[0090] Domain experts perform standard learning tasks, recording their gaze transfer sequences between various Areas of Interest (AOIs) (canvas, layers panel, toolbar, etc.), forming an expert gaze trajectory sequence P_expert = [AOI1, AOI2, ..., AOI...]. n ], pre-stored into the system.

[0091] Secondly, extract the learner's trajectory.

[0092] From the current 10-second sliding window, based on the DBSCAN clustering results, extract the learner's gaze region transition sequence P_current = [AOI1', AOI2', ...,AOI...].m ').

[0093] Then, the minimum cumulative distance between the two sequences is calculated.

[0094] The DTW algorithm uses dynamic programming to find the optimal alignment path between P_current and P_expert, allowing for flexible scaling of the time axis, and calculates the minimum cumulative distance DTW(P_current, P_expert) between the two sequences. Unlike ordinary Euclidean distance, DTW can handle situations where the two sequences have different lengths and inconsistent velocities, making it suitable for time-series data such as eye-tracking trajectories.

[0095] Finally, a threshold determination is performed.

[0096] If DTW(P_current, P_expert) > θ_dtw (preset distance threshold, which can be calibrated experimentally, for example, set to 50 pixels per frame), then it is determined that the learner's gaze trajectory deviates significantly from the expert mode, triggering a visual abnormality signal.

[0097] That is, when a divergent gaze pattern or a deviation of the gaze trajectory from the expert pattern is detected, it is judged as a visual abnormality.

[0098] When an anomaly is detected, the silence state is lifted, and a visual anomaly trigger signal is generated and sent to the central diagnostic agent.

[0099] (2) When a learner’s expression intention is detected to conflict with the tool’s logic or to match a preset erroneous concept template, it is judged as a language abnormality.

[0100] Specifically, 1) Detect whether the learner's statement contradicts the tool's logic. If it does, it is judged as a language anomaly, such as claiming "use an eraser to hide the background", but there is no causal relationship between "eraser" and "hide" in the domain ontology.

[0101] 2) Check if it matches a preset misconception template. If it matches a preset misconception template, it is judged as a language anomaly, such as "a mask is to erase layers".

[0102] (3) When an operation sequence is detected to match a preset inefficient behavior pattern, or when the similarity with the preset expert path is lower than the preset path similarity threshold, it is determined to be an abnormal behavior.

[0103] When inefficient behavior patterns (such as repeated parameter oscillations or invalid repeated clicks) or operation sequences are detected with low similarity to expert paths, they are judged as abnormal behavior.

[0104] Specifically, 1) Expert path similarity comparison: Pre-store the standard operation sequence P_expert defined by domain experts, and use the edit distance algorithm to calculate the similarity Sim = 1 - EditDistance / max(len) between the operation sequence P_current and P_expert in the current sliding window. When Sim is lower than the preset threshold θ_sim (e.g., θ_sim=0.6), it is judged as an abnormal deviation from the expert path.

[0105] 2) Inefficient behavior pattern matching: A pre-built inefficient behavior pattern library (such as (U,U) continuous undo, (B,E,B) tool switching oscillation, etc.) is used to detect whether the current sequence contains a subsequence that matches the pattern library. If it matches and the number of occurrences exceeds the preset frequency threshold (such as ≥2 occurrences within 10 steps), it is determined to be an inefficient behavior anomaly of the corresponding type.

[0106] If either of the above two determination methods meets the condition, a behavior abnormality signal will be triggered, along with an abnormality type identifier and the current operation context.

[0107] (4) Based on the anomaly determination, generate an asynchronous trigger signal and send it to the central diagnostic agent.

[0108] For example, when a learner uses a black brush to paint while the "image layer" is selected, the behavior diagnostic agent detects that the operation deviates from the task objective of "non-destructive hiding" and determines it as a "destructive editing anomaly". It then immediately generates an asynchronous trigger signal and sends it to the central diagnostic agent.

[0109] The asynchronous trigger signal includes: trigger source (such as behavioral agent), trigger timestamp (T_trigger), exception type (such as destructive operation), and feature vector (such as current layer state, tool type, operation coordinates, etc.).

[0110] Following the steps outlined above, the system performs lightweight monitoring locally under normal circumstances, without uploading data to the central node, significantly reducing network bandwidth usage and the computational load on the central server. The global diagnostic process is only initiated when an anomaly is detected, achieving efficient end-edge-cloud collaboration.

[0111] like Figure 1 , Figure 2 As shown, in step S3, the central diagnostic agent sends evidence retrieval instructions to other analytical agents according to the asynchronous trigger signal, receives and aggregates the data reported by each analytical agent, and constructs a diagnostic context; the central diagnostic agent maps the diagnostic context to the dynamic knowledge graph and performs triangulation reasoning to generate the final diagnostic conclusion.

[0112] A central diagnostic agent is deployed in the cognitive fusion layer (L3) as the central control node of the system. The central diagnostic agent includes a state machine controller and a triangulation mechanism engine. The state machine controller maintains the system's operational state (silent monitoring, evidence retrieval, triangulation, intervention decision-making, etc.). The triangulation mechanism engine receives heterogeneous evidence from the distributed analysis layer (L2), executes multi-hop reasoning algorithms in the dynamic knowledge graph, and searches for a unique misconception hypothesis that can simultaneously explain the multimodal evidence.

[0113] S301. The central diagnostic agent sends evidence retrieval instructions to other analytical agents based on the asynchronous trigger signal, receives and aggregates the data reported by each analytical agent, and constructs a diagnostic context.

[0114] A historical data backtracking mechanism was designed. By maintaining a sliding time window locally in each agent, historical data within the key time window can be retrieved after an anomaly is detected, thereby accurately capturing the causal chain of the error. The specific process is as follows: (1) The central diagnostic agent calculates the backtracking time window based on the asynchronous trigger signal.

[0115] When an abnormal behavior is detected, an abnormal behavior trigger signal is generated. After receiving the trigger signal, the central diagnostic agent switches the system state to "consultation mode" and calculates the backtracking time window.

[0116] Specifically, the input is the trigger timestamp T_trigger carried in the trigger signal.

[0117] Calculation method: The backtracking window is defined as a time interval extending forward by a fixed duration Δt with T_trigger as the right endpoint, that is: Backtrack window = [T_trigger] Δt,T_trigger]; Here, Δt is the preset backtracking duration, set to 10 seconds, which corresponds to the caching duration of the local sliding window of each agent, ensuring that complete historical data can be retrieved.

[0118] Actual execution: The central diagnostic agent will [T_trigger] The time interval [10s, T_trigger] is written as a parameter into the data backtracking instruction and broadcast to other agents except the trigger source; each agent retrieves historical feature data falling within this interval from its local FIFO queue and reports it.

[0119] Each analysis layer agent maintains a FIFO queue-style time sliding window (e.g., window size of 10-30 seconds) locally, continuously storing the most recently preprocessed feature data; when any agent detects an anomaly and sends a trigger signal, the central diagnostic agent calculates the backtracking window based on the trigger timestamp and broadcasts the data backtracking instruction to other analysis agents.

[0120] (2) The central diagnostic agent broadcasts a data backtracking instruction containing the backtracking time window to other analytical agents except the trigger source.

[0121] For example, going back 10 seconds, broadcasting "evidence retrieval instructions" to the visual agent and the language agent.

[0122] (3) Analyze the agent's response instructions, retrieve historical feature data within the time period from the local sliding window, and report it along with the trigger source features to the central diagnostic agent.

[0123] When abnormal behavior is detected, the visual diagnostic agent and the language diagnostic agent respond to the instruction, retrieve historical feature data from the local sliding window for that time period, convert it into a structured observation report summary, and then upload it to the central agent. Specifically: The visual agent reported that "the gaze was not confirmed in the layer panel before the operation, and the gaze hotspots in the canvas area are in a disordered and diffuse state."

[0124] The language agent reports, with voice prompts such as "remove this background" or "make the background transparent".

[0125] The behavioral diagnostic agent acts as the trigger source, and its abnormal characteristics are reported along with the trigger signal, eliminating the need for further backtracking.

[0126] (4) The central diagnostic agent performs timestamp alignment and spatial coordinate normalization on the received data to construct a diagnostic context under a unified spatiotemporal coordinate system.

[0127] The central diagnostic agent receives data including historical feature data reported by the visual and language diagnostic agents, and abnormal features reported by the behavioral diagnostic agent.

[0128] The central intelligent agent performs spatiotemporal alignment processing on the collected multi-source heterogeneous data to construct a diagnostic context containing panoramic information before and after the abnormal triggering time. Specifically, 1) timestamp alignment: using the trigger timestamp of the asynchronous trigger signal as a reference, a linear interpolation method is used to uniformly resample eye-tracking feature data (60Hz), speech feature data (extracted after 16kHz sampling), and operation behavior logs (event-driven) to the same time axis at 50ms intervals, eliminating time misalignment caused by differences in sampling rates of various modalities.

[0129] 2) Spatial coordinate normalization: The coordinates of eye-tracking gaze point, mouse click coordinates, and predefined interface AOI area coordinates are uniformly mapped to the normalized screen coordinate system (coordinate values divided by screen width / height, with a value range of [0,1]), eliminating coordinate deviations caused by different terminal device screen resolutions and window layouts.

[0130] 3) Construct a diagnostic context. Multi-source data that has been spatiotemporally aligned are aggregated in a unified spatiotemporal coordinate system to form a diagnostic context with the trigger time as the right endpoint and containing complete context information before the trigger time, which is used for subsequent triangulation reasoning.

[0131] Through the above steps, the historical data backtracking mechanism effectively solves the problem of temporal asynchrony in multimodal data. For example, when an operational error is detected, the system can backtrack to obtain the visual gaze trajectory and voice intent a few seconds before the operation, thereby accurately determining whether the error was caused by cognitive misunderstanding or operational error.

[0132] like Figure 3 As shown, S302 and the central diagnostic agent map the diagnostic context to the dynamic knowledge graph and perform triangulation reasoning to generate the final diagnostic conclusion.

[0133] (1) Construct a dynamic knowledge graph.

[0134] Based on graph databases (such as Neo4j) for storage, dynamic knowledge graphs include three core subgraphs: domain ontology subgraph, reverse cognitive bias model subgraph, and teaching strategy model subgraph.

[0135] The domain ontology subgraph is used to store subject-specific standard concepts and relationships, such as "black_mapped to_completely transparent".

[0136] The reverse cognitive bias model subgraph is used to store typical cognitive error nodes (such as "destructive erasure misunderstanding" and "mask black-and-white reversal misunderstanding") and their hierarchical topology (surface representation layer, middle conceptual error layer, and deep mental model layer).

[0137] A teaching strategy model is used to store tiered intervention rules for different types of misunderstandings.

[0138] (2) The central diagnostic agent maps historical feature data in the diagnostic context into evidence nodes in the dynamic knowledge graph.

[0139] The collected multimodal evidence is mapped to evidence nodes in the graph, that is, behavioral evidence (currently selected original layer and using black brush), linguistic evidence (verbal intent is "hidden" or "transparent"), and visual evidence (confirmation steps of missing line of sight layer) are mapped to instance nodes in the graph.

[0140] Among them, visual evidence refers to visual attention features, namely visual attention vectors; linguistic evidence refers to linguistic semantic features, namely key concept words and domain ontology mapping; and behavioral evidence refers to behavioral sequence features, namely operation sequences and inefficient behavior patterns.

[0141] (3) Retrieve candidate diagnostic hypotheses that are associated with evidence nodes from the reverse cognitive bias model.

[0142] Retrieve candidate hypotheses (misunderstanding nodes) from the reverse cognitive bias model, including H1 (object confusion: mistakenly believing that a mask is being edited) and H2 (tool selection error: mistakenly believing that an eraser is being used).

[0143] (4) Calculate the explanatory power score for each candidate diagnostic hypothesis.

[0144] The explanatory power score for each candidate diagnostic hypothesis is calculated using the following formula: ; Where w_m is the confidence weight of the evidence modality, with behavior > language > vision; δ is a binary function, which is 1 if there is a connection path from H to E in the graph; |M(H)| represents the number of modalities supporting the hypothesis; and λ is the Occam's razor penalty term.

[0145] For H1 (object confusion), this hypothesis explains why the learner "wanted to hide (correct intention)" but "painted black (behavior consistent with mask logic, but object incorrect)" and explains why the gaze "ignores the layer panel (causing failure to notice the original image was selected)". The three are logically consistent and have the highest score.

[0146] For H2 (mistaking it for an eraser), while the behavior can be explained, it cannot explain why the learner specifically chose "pen" and selected "black" (erasers usually do not require a color selection), hence the low score.

[0147] (5) Select the candidate diagnostic hypothesis with the highest explanatory power score that exceeds the preset confirmation threshold as the final diagnostic conclusion.

[0148] For H1 (object confusion), this hypothesis explains why the learner "wanted to hide (correct intention)" but "painted black (behavior consistent with mask logic, but object incorrect)" and explains why the gaze "ignores the layer panel (causing failure to notice the original image was selected)". The three are logically consistent and have the highest score.

[0149] For H2 (mistaking it for an eraser), while the behavior can be explained, it cannot explain why the learner specifically chose "pen" and selected "black" (erasers usually do not require a color selection), hence the low score.

[0150] If the H1 (Operation Object Confusion) score is the highest and exceeds the preset confirmation threshold θ_conf, then H1 is selected as the final diagnostic conclusion: "The learner confused the layer mask with the original layer and incorrectly performed a hiding operation that should have been performed on the mask on the original layer."

[0151] (6) When there are multiple candidate diagnostic hypotheses with similar scores that exceed the preset confirmation threshold, the active detection mechanism is triggered to generate discriminative questions to obtain new evidence.

[0152] Specifically, 1) Triggering condition determination: If the scores of two or more candidate hypotheses exceed the preset confirmation threshold θ_conf (e.g., θ_conf=0.7), and the difference between the highest score and the second highest score is less than the preset similar threshold δ_sim (e.g., δ_sim=0.15), then it is determined to be a multi-hypothesis conflict state.

[0153] 2) Discrimination question generation: Based on the difference path between candidate hypotheses and existing evidence nodes in the dynamic knowledge graph, the corresponding discrimination question template is retrieved from the subgraph of the teaching strategy model, or the probing questions that can effectively distinguish different hypotheses are dynamically generated through the natural language generation module.

[0154] 3) New evidence collection: The generated probing questions are presented to learners through the teaching execution agent, and the learners' multimodal responses (voice answers understood by ASR+NLU, or selection by interface clicks, or text input) are collected as new evidence.

[0155] 4) Reason and iterate on new evidence, incorporate the new evidence nodes into the dynamic knowledge graph, re-execute triangulation reasoning, and update the explanatory power scores of each candidate hypothesis.

[0156] 5) Repeat steps 1) to 4) until the score of the unique candidate hypothesis exceeds θ_conf and the difference between it and the second highest score exceeds δ_sim, or the preset maximum number of probes N_max (e.g., 3 times) is reached, or the learner times out and does not respond; finally, select the highest score as the diagnostic conclusion and mark the confidence level (high confidence / low confidence).

[0157] Through the above steps, the logical consistency of the diagnostic conclusion is ensured by triangulation. The system does not rely on a black box model, but instead uses graph reasoning to find hypotheses that can explain multimodal evidence simultaneously, effectively handling semantic conflicts between modalities. At the same time, the complete chain of evidence makes the diagnostic results highly interpretable.

[0158] like Figure 1 , Figure 2 As shown, in step S4, the teaching agent executes an adaptive tiered intervention strategy based on the final diagnostic conclusion and the learner's current cognitive load index, and generates and outputs intervention content.

[0159] A multi-level scaffolding intervention strategy based on reinforcement learning was designed. The optimal intervention level is dynamically selected according to the diagnosed misunderstanding type (conceptual error or mental model defect) and the learner's cognitive load state to prevent over-intervention from interrupting the learning flow.

[0160] In the intervention layer (L4), an instruction execution agent is deployed, and an NLG module and interface control interface are configured to transform the abstract intervention strategies issued by the cognitive fusion layer (L3) into specific text prompts, voice guidance or interface highlighting signals.

[0161] Based on the final diagnostic conclusion and the learner's current cognitive load index, the instructional agent executes an adaptive tiered intervention strategy, generating and outputting intervention content. The specific process is as follows: (1) Set multiple progressive intervention levels.

[0162] Four progressive intervention levels are defined: F1 indirect cues (highlighting interface elements), F2 heuristic questioning (guiding self-reflection), F3 process modeling (demonstrating problem-solving steps), and F4 direct instruction (directly explaining principles).

[0163] (2) Estimate the current cognitive load index based on pupil diameter change data and operational response.

[0164] By combining changes in pupil diameter and operational reaction time, the cognitive load index (CLI) is estimated in real time. When the CLI exceeds the high load threshold, the system is forced to switch to a direct teaching mode with low cognitive consumption.

[0165] Based on ridge regression, which integrates three features—pupil diameter change rate, reaction time ratio, and number of hesitations—the cognitive load index CLI ∈ [0,1] is predicted. Ridge regression prevents overfitting through L2 regularization and is suitable for multimodal feature fusion.

[0166] Specifically, 1) Feature extraction is performed, and three types of features are calculated in real time.

[0167] f1 (rate of change of pupil diameter) = (current pupil diameter) Baseline pupil diameter) / baseline pupil diameter, with the baseline value being the average of the 10 seconds prior to the start of the task; f2 (response time ratio) = current average response time / historical average response time; f3 (number of hesitations) = number of undo operations within a unit time window (10 seconds) + number of pauses (pause is defined as an interval of no operation exceeding 2 seconds).

[0168] 2) Offline training model.

[0169] During the experimental phase, labeled data were collected, using NASA-TLX Subjective Cognitive Load Scale scores (normalized to [0,1]) as labels y, and [f1, f2, f3] as input features to train a ridge regression model. The optimization objective was: min ||y Xw‖² + λ‖w‖²; Where λ is the L2 regularization coefficient, the optimal value is selected through 5-fold cross-validation, usually λ∈[0.01, 10]); w is the weight vector obtained through training, w =[w1, w2, w3]. The bias b is obtained through training.

[0170] 3) Online prediction: The features extracted in real time are substituted into the trained model, and the results are truncated by CLI = w1f1+ w2f2+ w3f3+b to ensure CLI ∈ [0,1].

[0171] (3) The final diagnosis, the cumulative number of historical errors and the cognitive load index are used as state vectors and input into the pre-trained reinforcement learning model to obtain the optimal intervention level.

[0172] The diagnosed misunderstanding type, the cumulative number of historical errors, and the current cognitive load are used as the state vector for reinforcement learning. The Q-learning algorithm is used to optimize the selection of intervention levels in each state, resulting in a trained reinforcement learning model. Q-learning is a model-free reinforcement learning algorithm used to learn the optimal intervention strategy.

[0173] A 36-dimensional state space (4 types of misunderstandings × 3 types of error frequency × 3 types of cognitive load) and a 4-dimensional action space (no intervention, prompting, guidance, and direct teaching) were constructed. The optimal intervention level was obtained through a well-trained reinforcement learning model.

[0174] (4) When the cognitive load index exceeds the preset high load threshold, select the direct teaching level.

[0175] The preset high load threshold θ_high = 0.7. When the cognitive load index (CLI) exceeds the high load threshold, i.e. CLI>θ_high, the system will forcibly switch to the direct teaching mode with low cognitive consumption.

[0176] The central diagnostic agent queries the instructional strategy model based on the diagnosed misunderstanding type, such as "layer / mask object confusion." The system detects that this is the learner's first time making this error and that the current cognitive workload index (CLI) is at a moderate level, deciding to use an F2-level heuristic intervention.

[0177] The execution layer agent generates feedback: Notice that you want to hide the background, but you are currently painting black on the original layer; please check: are you currently selecting "Mask Thumbnail" or "Image Thumbnail"? like Figure 1 , Figure 2 As shown, in step S5, after the intervention is implemented, the inhibition period reset state is entered, the learner's behavior correction is monitored and closed-loop data is recorded, and the intervention strategy is optimized by updating the reinforcement learning model.

[0178] After the intervention is implemented, the system enters a reset state during the inhibition period, raising the anomaly detection threshold for 3 seconds. If the learner corrects their behavior in subsequent operations (by clicking on the "mask thumbnail" and repeating the operation), the system records the successful "diagnosis-intervention-correction" closed-loop data, which is used to update the Q-value of the reinforcement learning model and continuously optimize the intervention strategy.

[0179] Specifically, (1) monitor learners’ behavior after intervention and record closed-loop data.

[0180] 1) After the intervention is executed, the system enters a 3-second inhibition period reset state, temporarily raising the abnormal detection threshold of each analysis agent to 1.5 times the normal value, ignoring similar trigger signals, and giving learners time to digest the feedback.

[0181] 2) After the inhibition period ends, the system monitors the learner's subsequent actions. If the learner completes an effective correction within 30 seconds (e.g., correctly clicks the "mask thumbnail" and performs the correct operation again without recurring the same error), the closed-loop data of this "diagnosis-intervention-correction" process is recorded, including the state vector s, intervention action a, reward value r=+1, and the next state s'.

[0182] (2) Use closed-loop data to update the Q value of the reinforcement learning model and optimize the intervention strategy.

[0183] 1) Store the recorded closed-loop data in the experience playback buffer, and update the Q-value of the intervention strategy network using the Q-learning algorithm: Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]; Where α is the learning rate, and α = 0.1; γ is the discount factor, and γ = 0.9.

[0184] If the learner fails to correct the error within 60 seconds or makes the same mistake again, a negative reward of r=-0.1 is recorded for optimization.

[0185] 2) Through the continuous accumulation of closed-loop data, the reinforcement learning model continuously optimizes the selection of intervention levels under each state, thereby achieving adaptive evolution of intervention strategies.

[0186] (3) Extract typical cognitive error patterns from new diagnostic cases and dynamically update the reverse cognitive bias model subgraph.

[0187] The system periodically mines new typical cognitive error patterns from the diagnostic case library to dynamically update the subgraph of the reverse cognitive bias model. The specific process is as follows: 1) Case accumulation: After each successful diagnosis, the evidence, hypothesis and conclusion triplet is stored in the diagnostic case database.

[0188] 2) Pattern mining: Frequent pattern mining algorithms such as FP-Growth are used to extract candidate error patterns from the case library that have a frequency exceeding a preset support threshold (e.g., 5%).

[0189] 3) Typicality screening: Candidate patterns are evaluated using a triple assessment of frequency (≥10 times), confidence (≥0.7), and novelty (similarity with existing patterns <0.3) to screen out typical error patterns.

[0190] 4) Dynamic knowledge graph update: The selected typical patterns are transformed into new error pattern nodes, and causal relationship edges are established with existing evidence nodes and misunderstanding hypothesis nodes to realize the incremental evolution of the reverse cognitive bias model subgraph.

[0191] 5) Conflict and elimination: For new models that conflict with existing models, a confidence-first strategy is adopted, and outdated models that have not been used for a long time are archived and eliminated to ensure the accuracy and timeliness of the map; 6) Continuous optimization: The updated dynamic knowledge graph is used for subsequent diagnosis. The changes in diagnostic accuracy provide feedback to further optimize the mining parameters, forming a self-evolving closed-loop mechanism.

[0192] Example 2 The purpose of this embodiment is to provide a multimodal learning diagnostic system based on multi-agent and triangulation verification, including: The multimodal perception module is used to collect learners' multimodal learning data; The distributed analysis module is used by multiple analytical agents to perform local silent detection and feature extraction on the corresponding multimodal learning data. Based on the extracted features, when any analytical agent detects a local modal anomaly, it generates an asynchronous trigger signal and sends it to the central diagnostic agent. The cognitive fusion module is used by the central diagnostic agent to send evidence retrieval instructions to other analytical agents based on asynchronous trigger signals, receive and aggregate data reported by each analytical agent, and construct a diagnostic context. The central diagnostic agent maps the diagnostic context to a dynamic knowledge graph and performs triangulation reasoning to generate the final diagnostic conclusion. The intervention module is used by the teaching agent to execute adaptive tiered intervention strategies based on the final diagnostic conclusion and the learner's current cognitive state, and to generate and output intervention content.

[0193] It also includes an optimization module, which is used to reset the learner's behavior after the intervention is implemented and enter the inhibition period, monitor the learner's behavior correction and record closed-loop data, and optimize the intervention strategy by updating the reinforcement learning model.

[0194] Based on the distributed cognitive theory architecture, a multi-agent learning and diagnostic system is designed, logically divided into four layers: multimodal perception layer (L1), distributed analysis layer (L2), cognitive fusion layer (L3), and executive intervention layer (L4). Each layer interacts with data through a collaborative communication bus and shares a centralized dynamic knowledge graph database.

[0195] The multimodal perception layer (L1) is the multimodal perception module, responsible for the acquisition and digital mapping of multimodal data. It includes an eye-tracking acquisition unit, a speech acquisition unit, and an interaction log acquisition unit. The eye-tracking acquisition unit is equipped with a high-frame-rate infrared eye tracker (sampling rate 60Hz or higher) to capture the learner's gaze coordinates and pupil diameter data in real time. The speech acquisition unit uses a high-fidelity microphone array and front-end noise reduction algorithms to pick up the learner's spoken thought data. The interaction log acquisition unit records mouse clicks, hovering, dragging, and keyboard input commands in real time through API hooks embedded in the teaching software.

[0196] The distributed analysis layer (L2) is the distributed analysis module, comprising multiple dedicated analytical agents working in parallel: a visual diagnostic agent, a language diagnostic agent, and a behavior diagnostic agent. The visual diagnostic agent incorporates the DBSCAN clustering algorithm and AOI mapping logic, calculating visual attention vectors in real time and monitoring gaze entropy. When it detects gaze pattern divergence or a deviation of the gaze trajectory from the expert mode, it identifies a visual anomaly. The language diagnostic agent integrates an ASR engine and a BERT-based NLU model, performing entity mapping technology to map speech keywords to domain ontology. When it detects a conflict between the intended expression and the tool's logic, it identifies a concept conflict anomaly. The behavior diagnostic agent uses the GSP mining algorithm to maintain a fixed-length operation queue. When it detects inefficient behaviors such as repeated tool / layer switching or destructive editing modes, it identifies behavioral anomalies.

[0197] The cognitive fusion layer (L3) is the cognitive fusion module, which includes a central diagnostic agent. This agent comprises a state machine controller and a triangulation mechanism engine. The state machine controller maintains the system's operational state (silent monitoring, evidence retrieval, triangulation, intervention decisions, etc.). The triangulation mechanism engine receives heterogeneous evidence from the distributed analysis layer (L2) and executes a multi-hop reasoning algorithm within a dynamic knowledge graph to find a unique misinterpretation hypothesis that can simultaneously explain the multimodal evidence.

[0198] The cognitive fusion module also includes a dynamic knowledge graph database, which comprises three sub-graphs: a domain ontology sub-graph, a reverse cognitive bias model sub-graph, and a teaching strategy model sub-graph. The domain ontology sub-graph stores subject-specific standard concepts and relationships, such as "black_mapped to_completely transparent".

[0199] The reverse cognitive bias model subgraph is used to store typical cognitive error nodes (such as "destructive erasure misunderstanding" and "mask black-and-white reversal misunderstanding") and their hierarchical topology (surface representation layer, middle conceptual error layer, and deep mental model layer); the teaching strategy model subgraph is used to store graded intervention rules for different types of misunderstandings.

[0200] The execution intervention layer (L4) is the execution intervention module, which includes the teaching execution agent, and the teaching execution agent includes the natural language generation module.

[0201] Based on a multimodal learning diagnostic system based on multi-agent and triangulation verification, the method steps in Embodiment 1 are implemented.

[0202] Example 3 The purpose of this embodiment is to provide a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the above-described method.

[0203] Example 4 The purpose of this embodiment is to provide a computer-readable storage medium.

[0204] A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, performs the steps of the above method.

[0205] Example 5 The purpose of this embodiment is to provide a computer program product containing instructions that, when run on a computer, cause the computer to perform the methods and functions involved in any of the above embodiments.

[0206] The steps and methods involved in the apparatus of the above embodiments correspond to those in Embodiment 1. For specific implementation details, please refer to the relevant description section of Embodiment 1. The term "computer-readable storage medium" should be understood as a single medium or multiple media including one or more instruction sets; it should also be understood as including any medium capable of storing, encoding, or carrying an instruction set for execution by a processor and enabling the processor to perform any of the methods in this invention.

[0207] Those skilled in the art will understand that the modules or steps of the present invention described above can be implemented using general-purpose computer devices. Optionally, they can be implemented using computer-executable program code, thereby allowing them to be stored in a storage device for execution by a computer device, or they can be fabricated as separate integrated circuit modules, or multiple modules or steps can be fabricated as a single integrated circuit module. The present invention is not limited to any particular combination of hardware and software.

[0208] While the specific embodiments of the present invention have been described above in conjunction with the accompanying drawings, this is not intended to limit the scope of protection of the present invention. Those skilled in the art should understand that various modifications or variations that can be made by those skilled in the art without creative effort based on the technical solutions of the present invention are still within the scope of protection of the present invention.

Claims

1. A multimodal learning diagnostic method based on multi-agent and triangulation mutual verification, characterized in that, include: Collect learners' multimodal learning data; Multiple analytical agents perform local silent monitoring and feature extraction on the corresponding multimodal learning data, and cache them in a local time sliding window; Based on the extracted features, when any analytical agent detects a local modal anomaly, it generates an asynchronous trigger signal and sends it to the central diagnostic agent. The central diagnostic agent sends evidence retrieval instructions to other analytical agents based on asynchronous triggering signals, receives and aggregates the data reported by each analytical agent, and constructs a diagnostic context. The central diagnostic agent maps the diagnostic context to a dynamic knowledge graph and performs triangulation reasoning to generate the final diagnostic conclusion. Based on the final diagnostic conclusion and the learner's current cognitive state, the teaching agent executes an adaptive tiered intervention strategy and generates and outputs intervention content.

2. The multimodal learning diagnostic method based on multi-agent and triangulation verification as described in claim 1, characterized in that, Collect learners' multimodal learning data, which includes at least eye-tracking data, voice interaction data, and operational behavior data. The specific process is as follows: The learner's eye tracking data is collected using an infrared eye tracker. The eye tracking data includes at least the learner's fixation point coordinates and pupil diameter data. A high-fidelity microphone array and front-end noise reduction algorithm are used to acquire learners' verbal thought data, i.e., voice interaction data. By embedding API hooks in the teaching software, operation behavior data is recorded, which includes at least mouse clicks, hovering, dragging, and keyboard input commands.

3. The multimodal learning diagnostic method based on multi-agent and triangulation mutual verification as described in claim 1, characterized in that, Multiple analytical agents perform local silent monitoring and feature extraction on corresponding multimodal learning data. These agents include visual diagnostic agents, language diagnostic agents, and behavioral diagnostic agents. The specific process is as follows: Based on eye-tracking data, the visual diagnostic agent uses a clustering algorithm to calculate the visual attention vector and monitor the gaze entropy value. The language diagnostic agent performs speech recognition and natural language understanding on voice interaction data, and extracts key concept words; The behavioral diagnostic agent uses a sequence pattern mining algorithm to analyze the operation sequence queue and detect inefficient behavioral patterns.

4. The multimodal learning diagnostic method based on multi-agent and triangulation verification as described in claim 3, characterized in that, Based on eye-tracking data, the visual diagnostic agent uses a clustering algorithm to calculate the visual attention vector and monitor the gaze entropy value. The specific process is as follows: Extract the eye-tracking coordinate sequence containing screen coordinates and timestamps from the set time sliding window; Calculate the spatiotemporal distance of each pair of eye-tracking coordinate sequence points; Based on spatiotemporal distance, the number of points in the ε-neighborhood of each point is calculated, and the core point is determined; Starting from an unvisited core point, recursively group all density-reachable points within its ε-neighborhood into the same cluster, forming a gaze cluster; All points not assigned to any cluster are marked as noise points and noise filtering is performed. Calculate the centroid coordinates and fixation duration for each fixation cluster; A visual attention vector is constructed based on the centroid coordinates and fixation duration; Based on the proportion of gaze duration for each region of interest, a visual attention analysis algorithm is used to calculate the gaze entropy value.

5. The multimodal learning diagnostic method based on multi-agent and triangulation verification as described in claim 1, characterized in that, Based on the extracted features, when any analytical agent detects a local modal anomaly, it generates an asynchronous trigger signal and sends it to the central diagnostic agent. The specific process is as follows: When the gaze entropy value exceeds the preset entropy value threshold or the dynamic time warping distance between the gaze trajectory and the preset expert mode exceeds the preset distance threshold, it is judged as a visual abnormality. When a learner's intended expression conflicts with the tool's logic or matches a preset misconception template, it is judged as a language anomaly; When an operation sequence is detected to match a preset inefficient behavior pattern, or when its similarity to a preset expert path is lower than a preset path similarity threshold, it is judged as an abnormal behavior. Based on anomaly detection, an asynchronous trigger signal is generated and sent to the central diagnostic agent.

6. The multimodal learning diagnostic method based on multi-agent and triangulation mutual verification as described in claim 1, characterized in that, The central diagnostic agent sends evidence retrieval instructions to other analytical agents based on asynchronous trigger signals, receives and aggregates data reported by each analytical agent, and constructs a diagnostic context. The specific process is as follows: The central diagnostic agent calculates the backtracking time window based on the asynchronous trigger signal; The central diagnostic agent broadcasts a data backtracking instruction containing a backtracking time window to other analytical agents, excluding the triggering source. The system analyzes the agent's response instructions, retrieves historical feature data within a time period from the local sliding window, and reports it along with the trigger source features to the central diagnostic agent. The central diagnostic agent performs timestamp alignment and spatial coordinate normalization on the received data to construct a diagnostic context under a unified spatiotemporal coordinate system.

7. The multimodal learning diagnostic method based on multi-agent and triangulation mutual verification as described in claim 1, characterized in that, The central diagnostic agent maps the diagnostic context to a dynamic knowledge graph and performs triangulation reasoning to generate the final diagnostic conclusion. The specific process is as follows: Construct a dynamic knowledge graph, which includes a domain ontology subgraph, a reverse cognitive bias model subgraph, and a teaching strategy model subgraph. The central diagnostic agent maps historical feature data in the diagnostic context into evidence nodes in a dynamic knowledge graph; Retrieve candidate diagnostic hypotheses that have a path associated with the evidence nodes from the reverse cognitive bias model; Calculate the explanatory power score for each candidate diagnostic hypothesis; The candidate diagnostic hypothesis with the highest explanatory power score that exceeds the preset confirmation threshold is selected as the final diagnostic conclusion. When multiple candidate diagnostic hypotheses with similar scores that exceed a preset confirmation threshold exist, an active detection mechanism is triggered to generate discriminative questions to obtain new evidence.

8. The multimodal learning diagnostic method based on multi-agent and triangulation verification as described in claim 1, characterized in that, Based on the final diagnostic conclusion and the learner's current cognitive load index, the teaching agent executes an adaptive tiered intervention strategy, generates and outputs intervention content, and the specific process is as follows: Set multiple progressive intervention levels; Estimating the current cognitive load index based on pupil diameter variation data and operational responses; The final diagnosis, the cumulative number of historical errors, and the cognitive load index are used as state vectors and input into a pre-trained reinforcement learning model to obtain the optimal intervention level. When the cognitive load index exceeds the preset high load threshold, select the direct teaching level.

9. The multimodal learning diagnostic method based on multi-agent and triangulation mutual verification as described in claim 1, characterized in that, After the intervention is implemented, the learner enters a reset phase of inhibition. The learner's behavioral correction is monitored and closed-loop data is recorded. The intervention strategy is optimized by updating the reinforcement learning model. The specific process is as follows: Monitor learner behavior after intervention and record closed-loop data; Update the Q-value of the reinforcement learning model using closed-loop data to optimize the intervention strategy; We can identify typical cognitive error patterns from new diagnostic cases and dynamically update the subgraph of the reverse cognitive bias model.

10. A multimodal learning diagnostic system based on multi-agent and triangulation verification, characterized in that, include: The multimodal perception module is used to collect learners' multimodal learning data; The distributed analysis module is used by multiple analytical agents to perform local silent detection and feature extraction on the corresponding multimodal learning data, and cache it to a local time sliding window. Based on the extracted features, when any analytical agent detects a local modal anomaly, it generates an asynchronous trigger signal and sends it to the central diagnostic agent. The cognitive fusion module is used by the central diagnostic agent to send evidence retrieval instructions to other analytical agents based on asynchronous trigger signals, receive and aggregate data reported by each analytical agent, and construct a diagnostic context. The central diagnostic agent maps the diagnostic context to a dynamic knowledge graph and performs triangulation reasoning to generate the final diagnostic conclusion. The intervention module is used by the teaching agent to execute adaptive tiered intervention strategies based on the final diagnostic conclusion and the learner's current cognitive state, and to generate and output intervention content.