A method for joint diagnosis of self-regulated learning and emotional motivation

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By aligning multimodal data using cross-excitation neural differential equations and selective state-space models, and combining this with a Poincaré sphere curvature adaptive fusion module, the problem of accurate diagnosis of learners' states and emotional motivations in online intelligent teaching platforms was solved, achieving efficient and accurate joint diagnostic results.

CN122196643APending Publication Date: 2026-06-12XIAMEN UNIV OF TECH

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: XIAMEN UNIV OF TECH
Filing Date: 2026-05-14
Publication Date: 2026-06-12

Application Information

Patent Timeline

14 May 2026

Application

12 Jun 2026

Publication

CN122196643A

IPC: G06F18/24; G06F17/13; G06F18/213; G06F18/25; G06F18/27; G06N3/0499; G06N5/04; G06Q50/20

AI Tagging

Application Domain

Data processing applications Biological models

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Cloud edge-based distributed new energy power generation regulation capability identification method and system
CN122196592AData processing applications Biological models
Photovoltaic fixed support and flexible support collaborative arrangement method and system
CN122196301AData processing applications Complex mathematical operations
Processing device, processing program, and processing method
WO2026121290A1Data processing applications Software engineeringDatabase
Assisting a user in maintaining a production quality
WO2026125245A1Data processing applications
Converter station device operation and maintenance method based on spatial computing, computer device, and program product
WO2026119291A1Data processing applicationsInformation technology support system

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN122196643A_ABST

Patent Text Reader

Abstract

The application provides a self-regulated learning and emotional motivation combined diagnosis method, and relates to the technical field of intelligent education. The method comprises the following steps: obtaining original behavior data and text data; performing continuous time domain alignment by using a cross-stimulating neural ordinary differential equation to obtain an aligned sequence; inputting the aligned sequence into a selective state space model to extract behavior context features, and converting the behavior context features into continuous prompt words by low-rank mapping to guide a large language model to extract text semantic features; projecting the behavior and text features to a Poincare ball hyperbolic space, dynamically adjusting the curvature to resolve semantic conflicts, and realizing cross-modal fusion; and through backdoor intervention and counterfactual reasoning of a structural causal model, stripping external confounding factor interference, and parallel outputting self-regulated learning level and emotional motivation classification probability. The application effectively processes asynchronous multi-modal data, eliminates modal semantic contradiction and false correlation, and improves the accuracy and explainability of diagnosis.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of intelligent education technology, specifically to a combined diagnostic method for self-regulated learning and emotional motivation. Background Technology

[0002] Against the backdrop of the rapid development of online education, Massive Open Online Courses (MOOCs) and intelligent tutoring systems have become important pathways for learners to acquire knowledge. However, due to the separation of teachers and students in time and space within the online learning environment, teachers find it difficult to promptly grasp learners' learning status. Therefore, learning status diagnosis has become a core support for achieving personalized teaching and improving learning outcomes. Among these, Self-Regulated Learning (SRL) and affective motivation are key factors influencing learners' learning outcomes, and their accurate diagnosis is of great significance for teaching intervention. SRL encompasses stages such as planning, execution, monitoring, and reflection, reflecting learners' self-directed learning ability and level of control over the learning process; affective motivation reflects learners' emotional tendencies and learning motivation. The two are interconnected and jointly determine learners' learning status and learning outcomes.

[0003] Currently, joint research on SRL and affective motivation remains scarce. Existing technologies generally adopt a separate diagnostic approach, meaning SRL diagnosis and affective motivation diagnosis are conducted separately, with the two being independent and lacking effective correlation. A few technologies attempt simple result stitching, but have not yet achieved true joint modeling and integrated diagnosis. Specifically, existing technologies mainly include the following categories:

[0004] One approach is based on questionnaires. These methods rely on learner feedback, using standardized scales such as the Motivated Strategies for Learning Questionnaire (MSLQ) or the Learning and Study Strategies Inventory (LASSI). Information is collected and statistically analyzed through learner feedback. However, this method is susceptible to individual comprehension biases, recall biases, and willingness to complete questionnaires, resulting in insufficient data objectivity. Furthermore, questionnaires are often collected in phases and offline, making it difficult to capture the dynamic changes in SRL levels and affective motivation during the learning process in real time. This results in poor timeliness and makes it unsuitable for large-scale, routine diagnostic needs.

[0005] The second approach is SRL diagnostic methods based on learning behavior data. These methods focus on objective behavioral logs generated by learners during online learning. By analyzing learners' clickstreams, video viewing, and assignment submissions, they identify SRL stages and levels using methods such as behavioral rule mapping, behavioral sequence modeling, or behavioral pattern clustering. However, these methods rely solely on objective behavioral logs, neglecting to consider internal psychological states such as emotions and motivations. This makes it difficult to explain the true reasons behind the behavior, potentially leading to biased or even misjudgment results. Rule-dependent methods lack flexibility and struggle to adapt to differences in learners and learning tasks. Temporal modeling methods such as LSTM and Transformer require extensive manually labeled data, resulting in high labeling costs and long cycles, hindering large-scale application.

[0006] Thirdly, there are affective motivation diagnostic methods based on learning text data. These methods focus on the text data generated by learners during the learning process and extract affective and motivational information from the text data through affective lexicon matching, traditional machine learning (such as Term Frequency-Inverse Document Frequency (TF-IDF), n-gram (n-grammar) combined with Support Vector Machine (SVM)) or fine-tuning of pre-trained language models. However, these methods only analyze affective and motivational information from the semantics of the text, without considering learning behavior, task scenarios, and process context. This makes it difficult to accurately determine the true direction of emotions and motivations, resulting in insufficient scenario adaptability. Affective lexicons and traditional machine learning can only extract surface text features and struggle to understand complex semantics and implicit emotions. While pre-trained language models can capture deep semantics, they rely on a large number of high-quality labeled samples. Labeled data is scarce in educational scenarios, making them prone to overfitting and resulting in insufficient model stability and practicality.

[0007] Fourth, multimodal diagnostic methods based on simple fusion integrate learning behavior data and learning text data, mainly employing decision-level fusion (e.g., weighted summation after separately training behavior and text models) or feature-level concatenation fusion (e.g., directly concatenating behavior features and text features before inputting them into a classification model). Existing fusion methods do not consider the asynchronous nature of high-frequency temporal sequences in learning behavior and low-frequency discrete sequences in learning text, easily leading to information misalignment and noise interference, resulting in limited fusion effectiveness. Furthermore, the fusion structures are mostly fixed structures with fixed weights, unable to dynamically adjust modal contributions according to the learning stage and task type, lacking flexibility and rationality. More importantly, existing multimodal fusion methods only improve the diagnostic effect of a single dimension (SRL only or affective motivation only), failing to achieve joint modeling, associative reasoning, and synchronous output of SRL and affective motivation, thus failing to truly leverage the complementary advantages of multimodal approaches.

[0008] In summary, existing technologies still have significant shortcomings in terms of diagnostic comprehensiveness, accuracy of data temporal alignment, joint modeling of SRL and emotional motivation, model adaptability, and effectiveness of multimodal data fusion. Especially in scenarios where multimodal behavioral data is highly asynchronous and semantics between modalities are prone to conflict, existing deep learning models struggle to accurately trace learners' true SRL states and emotional motivations.

[0009] In view of the above, this application is hereby submitted. Summary of the Invention

[0010] This invention provides a combined diagnostic method for self-regulating learning and emotional motivation, which can at least partially improve the above-mentioned problems.

[0011] To achieve the above objectives, the present invention adopts the following technical solution: A joint diagnostic method for self-regulated learning and affective motivation, comprising: The original behavioral and textual data are acquired, and the original behavioral and textual data are aligned using the neural ordinary differential equation of cross-excitation to obtain the aligned behavioral and semantic feature sequence. The aligned behavior and semantic feature sequence is input into a selective state-space model with linear time complexity to obtain a behavior context feature vector, and the behavior context feature vector is input into a pre-defined low-rank mapping network to obtain a text semantic feature vector. To resolve the semantic conflict between behavioral context feature vectors and text semantic feature vectors, deep fusion of cross-modal features is performed to obtain the final joint feature vector. The joint feature vectors are subjected to causal interference processing using a pre-built confusion factor memory to obtain pure features. Based on these pure features, diagnostic calculations of SRL competence and affective motivation are performed simultaneously to generate diagnostic results.

[0012] In summary, addressing the technical shortcomings of existing online intelligent teaching platforms—namely, the extreme asynchrony of multimodal learning data and the ease of semantic conflict between modalities—which prevent deep learning models from accurately tracing learners' true self-regulated learning (SRL) state and emotional motivation, this invention proposes a joint diagnostic method for SRL and emotional motivation. This method aims to overcome the following deep technical challenges: 1. How to overcome the large amount of invalid zero vector padding caused by the traditional fixed sliding time window's forced data truncation in extremely irregular and discrete online learning scenarios such as "idling / lurking," avoiding forced time segmentation, and thus losslessly reconstructing the true dynamic evolution trajectory of learners' cognitive states on a continuous time axis. 2. How to break the isolated state of dual-branch feature extraction detached from behavioral context, establishing early deep perception across modalities, enabling the Large Language Model (LLM) to be controlled in real-time by continuous micro-behavioral actions at the underlying level when extracting text sentiment, thereby avoiding semantic misjudgment caused by the inability to perceive the business context at the moment of text generation. 3. When learners' objective behavioral representations and their subjective textual expressions exhibit severe semantic contradictions, how can we break free from the limitations of conventional Euclidean space linear weighting to forcibly neutralize features, accurately trace upwards, and non-linearly resolve the common underlying causes behind these two contradictory representations, thus avoiding the trivialization of feature fusion? 4. In multi-task joint diagnosis, how can we introduce interference-resistant reasoning logic to proactively eliminate emotional fluctuations caused by external environmental factors such as sudden increases in temporary task difficulty, and eliminate spurious correlations based solely on surface statistical correlations of data, thereby outputting logically supported objective evaluation results?

[0013] This method takes asynchronous behavior sequences and text sequences (i.e., raw behavior data and text data) generated in an online learning platform as input, and deeply modifies the underlying feature engineering and multimodal diagnostic architecture. Specifically, firstly, in the data access stage, cross-excitation neural ODEs are constructed, abandoning the traditional approach of forced time segmentation. Discrete text events are used as the evolution trajectory of continuous behavior dynamically modulated by instantaneous pulses, thereby losslessly restoring the true dynamic evolution initial state of the learning state in the continuous time domain. Subsequently, in the feature extraction stage, a behavior manifold-driven large-model continuous prompting mechanism is constructed. High-frequency behavior latent state vectors extracted by the Linear-Time Sequence Modeling with Selective State Spaces (Mamba) are transformed into continuous prompt words in the input layer of the Large Language Model (LLM) through low-rank mapping, enabling the extraction of text semantics to be controlled in real time by the underlying micro-actions, achieving cross-modal deep perception of fused behavioral context. Then, in the feature fusion stage, a Poincaré sphere-based approach is introduced. The hyperbolic space curvature adaptive fusion module of Ball projects bi-branch features onto a non-Euclidean geometric space. When encountering modal semantic conflicts, it dynamically increases the local negative curvature, forcing the model to trace upwards along the knowledge hierarchy structure. This precisely resolves the common underlying causes behind contradictory appearances in a non-linear manner. Finally, in the diagnostic output stage, a Structural Causal Model (SCM) routing network is constructed to eliminate false correlations. Through backdoor intervention and counterfactual reasoning mechanisms, it actively removes temporary emotional fluctuations caused by external environment or task difficulty. Ultimately, it outputs objective diagnostic results with rigorous logical support in parallel, completely opening up an automated technical closed loop from the bottom asynchronous multimodal perception to the front-end anti-interference assessment.

[0014] Compared with existing technologies, this method effectively overcomes the spatiotemporal alignment discontinuity problem of asynchronous irregular data, maintaining feature integrity and high robustness even in sparse long-tail data scenarios such as "idling" and "diving". It significantly improves the semantic perception limitations caused by modal physical isolation, achieving higher sentiment extraction accuracy for complex textual expressions such as irony and perfunctory statements. It provides a non-Euclidean geometry fusion scheme for handling modal semantic conflicts, avoiding feature neutralization and decreased discriminative power in complex scenarios where behavioral appearances contradict textual expressions. It enhances the anti-interference capability and interpretability of diagnostic assessment, stripping away temporary emotional fluctuations caused by external environment or task difficulty, allowing assessment results to better reflect the learner's intrinsic stable characteristics and reducing the misjudgment rate. It reduces model training costs, decreases reliance on expensive manually calibrated data, and alleviates multi-task gradient conflicts, ensuring the stability and convergence of high-order neural networks during joint training. The resulting structured diagnostic profile includes modal conflict levels and causal reasoning, providing more objective and richer data support for downstream adaptive teaching interventions. Attached Figure Description

[0015] Figure 1 This is a flowchart illustrating the combined diagnostic method for self-regulation learning and emotional motivation provided in this embodiment of the invention. Detailed Implementation

[0016] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.

[0017] refer to Figure 1 As shown, the first embodiment of the present invention discloses a joint diagnostic method for self-regulated learning and emotional motivation, which uses a joint diagnostic model for processing. This method can be executed by a joint diagnostic device for self-regulated learning and emotional motivation (hereinafter referred to as the diagnostic device), specifically by one or more processors within the diagnostic device, to implement the following method: S1. Obtain the original behavioral data and text data, and use the neural ordinary differential equation of cross-excitation to align the original behavioral data and text data to obtain the aligned behavioral and semantic feature sequence. Specifically, step S1 further includes: acquiring the learner's original behavioral data and text data during the learning process, and setting a continuous latent state vector to represent the learner's comprehensive cognitive evolution, wherein the learning process is a continuous time period. When it is determined that no text data is triggered at time t (i.e., the continuous evolution stage), the evolution rate (derivative) of the continuous hidden state vector is determined based on the continuous hidden state vector and behavioral data at time t, and fitting and parameterization are performed based on the evolution rate and the evolution differential equation (the continuous evolution process is performed by a deep neural network). The formula for the evolutionary differential equation is as follows: , The parameter is Multilayer deep neural networks are used to capture the complex nonlinear dynamics behind behavioral sequences and calculate the derivatives of states. Let be the continuous hidden state vector at time t, representing the current baseline of the learning state. The behavioral data at time t, It is the set of real numbers; When text data is detected at time t (i.e., when text data generated by the learner is captured at a discrete time), cross-excitation is triggered, entering the impulse jump phase. At this point, no time-dimensional segmentation is performed; instead, the existing continuous hidden state vector is instantaneously modulated and reset using an impulse update function. This mechanism allows the high-density emotion and semantics contained in the text to instantly change the trajectory of the learning state. The formula for the impulse update function is: ,in, Let i be the specific discrete timestamp of the occurrence of the i-th valid text data. The instant before the text data occurs (i.e., infinitely close to) (At that time), the implicit cognitive state accumulated by previous behaviors, For parameters The multilayer spiking neural network is used to evaluate the influence of the current text features on the previous accumulated state and to calculate the state jump offset. Let t be the i-th text data collected at time t (such as the initial word embedding of the text). This is the continuous hidden state vector at time t+1; The latest cognitive state vector (i.e., the cognitive state vector that combines behavior and semantics) is updated by absorbing instantaneous pulse information from the text and then fusing and updating it. As the new starting point for the evolutionary differential equation at time t+1, at this time It continues to determine whether there is text data triggered at the next moment, and performs fitting and parameterization based on the judgment result, or enters the pulse jumping stage; Repeat the above steps until all behavioral and textual data at each time step of the learning process have been processed. Then, use the continuous hidden state vectors from the last time step as the aligned sequence of behavioral and semantic features. .

[0018] In this embodiment, this step is mainly responsible for receiving the raw log data returned by the front-end online learning platform and solving the problem of asynchronous and irregular sampling of multimodal data in the feature access stage. In a real online learning environment, the data generated by learners exhibits modal spatiotemporal asynchrony, that is, objective operational behaviors (such as video playback and page scrolling) are high-frequency and approximately continuous in time, while subjective text expressions (such as posting comments and taking notes) are low-frequency and highly discrete. Traditional processing paradigms usually use a fixed-step sliding time window to forcibly truncate and align these two types of data. However, this approach has a fatal flaw: when there is a lack of text data in a certain time window, zero vectors must be used for crude filling, which not only causes sparsity and fragmentation in the feature space, but also seriously disrupts the continuous dynamic evolution trajectory of the learner's cognitive state on the real time axis. In order to completely solve the data discontinuity problem caused by this irregular sampling, this step abandons the traditional discrete time window segmentation strategy in the data access stage and introduces the Neural Ordinary Differential Equations (Neural ODEs) technique.

[0019] The fundamental reason for introducing neural ordinary differential equations (ODEs) is that this technique does not require the assumption that data is sampled at equal intervals. It can directly parameterize the latent state derivatives within continuous time using neural networks, thus perfectly fitting time series modeling with irregular intervals. Building on this, considering the interactive characteristics of multimodal data in educational scenarios, this method further transforms the technique into Cross-Excitation Neural ODEs. The reason for adopting a cross-excitation mechanism is that learners' implicit cognitive states are driven by continuous operational behaviors for the vast majority of the time, but their cognition or emotion often undergoes a step change at the moment of generating explicit textual expression. Therefore, this step uses high-frequency behavioral sequence features as continuous driving forces in the differential equation to smoothly deduce the state trajectory, while using low-frequency textual events as instantaneous impulses of the state to modulate the derivatives, thereby achieving a natural fusion of the two modalities in the continuous time domain.

[0020] Specifically, in this embodiment, the learner acquires raw behavioral data and raw text data generated throughout the learning process. The learning process is considered as a continuous time period, and a continuous hidden state vector is defined to represent the learner's comprehensive cognitive evolution trajectory from the start to the end of learning. This vector changes continuously over time. During the data alignment phase, data at each moment is processed sequentially in chronological order. When it is determined that no text data is triggered at the current moment t, evolution is driven solely by behavioral data. Specifically, based on the continuous hidden state vector at the current moment t and the behavioral data at the current moment, the evolution rate of the continuous hidden state vector is determined, and fitting and parameterization are performed based on this evolution rate and a preset evolutionary differential equation. This evolutionary differential equation is implemented through a parameterized multilayer deep neural network, mapping the current hidden state and behavioral data to the derivative of the hidden state, thereby describing the continuous dynamic changes of the learning state without external semantic input. When it is determined that text data is triggered at the current moment t, a cross-excitation mechanism is triggered, entering the pulse jump phase. At this time, the existing continuous hidden state vector is instantaneously modulated and reset through a dedicated pulse update function. The pulse update function is implemented by another parameterized multilayer spiking neural network. It receives the implicit cognitive state accumulated by previous behaviors just before the text event occurs, as well as the text data collected at the current moment. It calculates a jump offset and adds this offset to the hidden state of the previous state to obtain a new hidden state after the pulse update. This new hidden state absorbs the instantaneous pulse information of the text and becomes the new starting point for subsequent continuous evolution. Afterward, using this new starting point, it continues to determine whether there is text data triggering the next moment. If there is no text, it re-enters the evolutionary differential equation for continuous fitting; if there is text, it triggers another pulse jump. The above process is repeated until the behavioral and text data at each moment of the learning process have been processed. Finally, the continuous hidden state trajectory along the entire time axis is output as an aligned sequence of behavioral and semantic features. Through the aforementioned cross-excitation neural ordinary differential equation mechanism, this embodiment effectively overcomes the problem of a large number of invalid zero vectors being filled by the traditional fixed time window forcibly truncating data. It losslessly restores the true dynamic evolution trajectory of the learner's cognitive state on the continuous time axis, significantly improving the feature completeness and robustness in extreme irregular discrete distribution scenarios such as "idling" and "diving".

[0021] In short, this step uses cross-excitation neural network differential equations to model the original asynchronous logs in the continuous time domain, effectively eliminating the forced segmentation required by traditional fixed time windows and fundamentally avoiding invalid zero-vector filling caused by missing single-modal data. By using low-frequency text as the continuous evolution derivative of instantaneous pulse dynamic modulation of high-frequency behavior, this step preserves the temporal coherence and true dynamic characteristics of multimodal data to the greatest extent. In business scenarios, facing extreme long-tail data with irregular sampling such as "long-term idle" or "deep dive," this step enables the method to initiate real-time non-destructive diagnostic sampling at any specified continuous time, significantly improving the robustness of underlying feature access and the engineering feasibility of online diagnosis.

[0022] S2, the aligned behavior and semantic feature sequence is input into the Selective State Space Model (Mamba) with linear time complexity to obtain the behavior context feature vector, and the behavior context feature vector is input into the preset low-rank projection network to obtain the text semantic feature vector; Specifically, step S2 further includes: according to a fixed time step. semantic feature sequence Sampling is performed to obtain discrete behavioral feature sequences; By using learnable noise filtering gating, the noise filtering coefficient is calculated, and invalid noise behaviors (such as accidental clicks and short-term dwell times) are suppressed to obtain the denoised effective behavioral features. The formula is as follows: , This ensures that subsequent feature extraction focuses only on learning behaviors strongly correlated with the SRL stage, where... The effective behavioral features after denoising at the k-th time step are: Let be the noise filtering coefficient at the k-th time step. For element-wise multiplication, Let k be the discrete behavioral feature sequence at the k-th time step. It is the Sigmoid activation function. For learnable weights, For noise-determining sub-features, The bias for noise filtering gate; The closer it is to 0, the closer the behavior is to invalid noise, and the feature is suppressed.

[0023] Based on the discretized state transition mechanism, the effective behavioral features after denoising are recursively calculated to obtain the evolved hidden state vector, the formula of which is: ,in, Let be the evolved hidden state vector at the k-th time step. and These are all dynamic state-space matrices discretized using a zero-order hold. The parameters of these matrices change dynamically according to the current input, thus giving the model the ability to selectively filter out invalid actions. This is the evolved hidden state vector at the (k-1)th time step, used to memorize and transmit historical operation trajectories; The evolved hidden state vector is input into the pooling layer, and SRL temporal keyframe pooling is performed to calculate the SRL feature saliency score at each time step. The formula is as follows: , The SRL feature significance score at the k-th time step. It is the L2 norm. Let be the hidden state vector after evolution at the j-th time step; Based on the SRL feature significance score, select those that are greater than or equal to the selection threshold. The keyframe hidden states are identified, and a high-purity behavioral context feature vector is output through fusion pooling operations. , ,in, For splicing operations, For max pooling operation, This is for average pooling operations.

[0024] behavioral context feature vectors Dynamically transformed, continuous prompt word vectors with strictly consistent dimensions with the word embedding space of the large language model (Soft Prompt) Essentially, it translates the learner's silent actions into a mathematical context that the large model can understand. Here is the learnable weight matrix in a low-rank mapping network. is the learnable bias vector in a low-rank mapping network; To prevent catastrophic forgetting of parameters in small-sample educational scenarios, the weights of the pre-defined backbone network of the large language model are forcibly frozen. The word embedding vectors of the continuous prompts are prefix-concatenated with the word embedding vectors of the original text data in the sequence dimension to form a joint input feature matrix. This matrix is then fed into the frozen large language model for deep semantic extraction, yielding a high-dimensional text semantic feature vector. ,in, For the forward propagation function of the frozen large language model, The initial word embeddings are obtained by looking up the original text data in a table. This indicates the sequence length dimension. and Perform the splicing operation.

[0025] Preferably, the selective state space model adopts a multi-layer stacked structure, with its hidden layer dimension set to 256 and the state expansion coefficient set to 16. The specific number of layers can be flexibly adjusted according to the sequence length, preferably 4 layers, to correspond to the four stages of SRL: planning, execution, monitoring, and reflection. Specifically, for the characteristics of educational scenarios, the selective state space model undergoes three embedded reconstructions, transforming it from a general long sequence extractor into a feature extractor specific to educational behavioral context and SRL stage perception. These are: first, to address the high proportion of invalid noise in educational behavioral sequences, a learnable noise filtering gate is embedded after the Mamba input layer; second, to make the evolution of the hidden state of Mamba conform to the educational logic of the four stages of SRL (planning / execution / monitoring / reflection), the state transition mechanism of the Mamba input layer is initialized with a structured state; and third, because Mamba's global average pooling dilutes the core SRL behavioral features in long sequences, global pooling is replaced with SRL temporal keyframe pooling.

[0026] In this embodiment, this step is mainly responsible for extracting high-dimensional modal deep features while preserving the temporal dynamics of the output of the evolutionary alignment step. It also overcomes the bottleneck of modal isolation in existing technologies, achieving cross-modal early deep perception of text extraction with behavioral context intervention. In existing multimodal learning state diagnosis models, dual-branch feature extraction networks are usually in a state of absolute physical isolation, meaning that behavioral features and text features are extracted in parallel by two independent networks until the final fusion layer interacts. This late-stage fusion paradigm has a serious business logic flaw: when extracting text sentiment semantics, the large language model is completely stripped of the micro-behavioral context at the moment of text generation. For example, the "too easy" left by a learner when quickly dragging the progress bar to skip playback is completely different from the "too easy" left after repeatedly watching the same segment for a long time; the underlying cognition and emotional state are drastically different. Isolated text semantic extraction, lacking the prior constraints of action background, is highly prone to semantic misjudgment of such ironic or perfunctory expressions. To completely address this technical deficiency of being detached from behavioral context, this method constructs a behavior-driven continuous prompting mechanism for large models.

[0027] First, in the behavioral feature extraction branch, considering that the micro-operational behavioral sequences collected by online learning platforms are often extremely long, traditional Transformer models based on self-attention mechanisms face the problem of computational complexity exponentially increasing with sequence length, making it difficult to efficiently process such high-frequency, long sequences. Therefore, a selective state-space model with linear time complexity is introduced. Second, in the text feature extraction branch, in order to break down the physical isolation between modalities, this step does not directly input the text into a large language model in isolation, but uses the extracted behavioral feature vectors as prior conditions for cross-modal intervention.

[0028] Specifically, in the behavioral feature extraction branch, a selective state-space model with linear time complexity is employed. This model effectively overcomes the bottleneck of traditional self-attention models, where computational complexity increases quadratically with sequence length, significantly improving the extraction efficiency of long-sequence behavioral features when dealing with extremely long sequences of micro-operations. The input behavioral features are processed using a learnable noise filtering gating mechanism. Specifically, a noise filtering coefficient is calculated, obtained by linearly transforming the noise decision features and mapping them using the Sigmoid activation function. This noise filtering coefficient is then multiplied element-wise with the discrete behavioral feature sequence to obtain the denoised effective behavioral features. This mechanism effectively suppresses interference from invalid noise behaviors such as accidental clicks and short-term dwell times, ensuring that the subsequent model's attention focuses on real learning behaviors strongly correlated with the self-regulating learning stage, thereby improving the signal-to-noise ratio and purity of the behavioral features. After denoising, the denoised effective behavioral features are recursively calculated based on a discretized state transition mechanism. Specifically, the dynamic state space matrix, discretized by a zero-order hold, is used in conjunction with the evolved hidden state vector from the previous time step to recursively calculate the evolved hidden state vector for the current time step. The parameters of these state space matrices dynamically change according to the current input, enabling the model to selectively filter out invalid actions and achieve adaptive extraction of effective information from the educational behavior sequence. After obtaining the evolved hidden state vector, it is input into a pooling layer and further processed through a self-regulating learning temporal keyframe pooling mechanism. First, the saliency score of the self-regulating learning features at each time step is calculated. This score is obtained by calculating the ratio of the L2 norm of the current hidden state vector to the maximum L2 norm of all historical hidden state vectors. This score effectively characterizes the significance of the current moment relative to historical peaks; the closer the score is to 1, the more likely the current moment is to be a key stage of self-regulating learning. Subsequently, based on a preset screening threshold, keyframe hidden states with saliency scores greater than or equal to the threshold are selected, i.e., only time steps strongly correlated with the self-regulating learning stage are retained. Finally, by fusion pooling, the selected keyframe hidden states are simultaneously subjected to max pooling and average pooling, and the results are concatenated along the channel dimension to output a high-purity behavioral context feature vector. This mechanism effectively solves the problem that traditional global average pooling dilutes the core self-regulating learning behavioral features in long sequences, making the final extracted behavioral features more focused on key stages such as planning, execution, monitoring, and reflection, and significantly improving the representational ability of behavioral context features.

[0029] In the text feature extraction branch, the text is not directly input into the large language model in isolation. Instead, the extracted behavioral context feature vectors are used as prior conditions for cross-modal intervention. Specifically, a low-rank mapping network is constructed to dynamically transform the behavioral context feature vectors into continuous cue word vectors that are strictly consistent with the word embedding space dimension of the large language model. This transformation process is achieved through learnable weight matrices and bias vectors, essentially translating the learner's silent actions into mathematical context that the large language model can understand. Subsequently, the pre-set backbone network weights of the pre-trained large language model are forcibly frozen to avoid catastrophic forgetting of parameters in small-sample educational scenarios. The generated continuous cue word vectors and the initial word embedding matrix obtained by looking up the original text data are prefixed and concatenated in the sequence dimension to form a joint input feature matrix. This joint input feature matrix is fed into the frozen large language model for forward propagation calculation, ultimately obtaining a high-dimensional text semantic feature vector. Through the aforementioned behavior manifold-driven large-model continuous prompting mechanism, this embodiment breaks the isolation of traditional two-branch feature extraction from behavioral context, establishing early deep perception across modalities. This enables the large language model to be controlled in real-time by continuous micro-behavioral actions at the underlying level when extracting text sentiment. This mechanism allows the model to deeply analyze text semantics in conjunction with specific action contexts, significantly reducing the misjudgment rate of sentiments caused by detachment from business context, such as ironic or perfunctory texts, and ensuring that the extracted semantic features have extremely high business context fidelity.

[0030] In short, this step employs a Mamba architecture with linear time complexity for the behavior branch and a large language model with continuous behavioral prompts for the text branch. The Mamba architecture effectively overcomes the computational bottleneck of traditional Transformers when handling lengthy, high-frequency click trajectories, which results in a quadratic explosion in computational power, significantly improving the extraction efficiency of long-sequence behavioral features. Simultaneously, the cross-modal prompting mechanism breaks down traditional physical isolation, subjecting the attention computation of the large language model to the conditional constraints of the latent states of micro-actions at the underlying level. This cross-modal early perception mechanism allows the model to deeply analyze text semantics in conjunction with specific action contexts, significantly reducing the misjudgment rate of "ironic" or "perfunctory" text sentiments caused by detachment from business context, ensuring extremely high business context fidelity of the extracted semantic features.

[0031] S3 resolves the semantic conflict between the behavioral context feature vector and the text semantic feature vector, performs deep fusion of cross-modal features, and obtains the final joint feature vector. Specifically, step S3 further includes: using the Exponential Map operator to transform the behavioral context feature vector... By safely projecting onto a Poincaré sphere model (with negative curvature of -c), we obtain cross-modal behavior feature points that, after manifold mapping, precisely fall within the hyperbolic space of the Poincaré sphere. , For mapping functions, This represents the behavior context feature vector starting from the origin 0 on a Poincaré sphere with a fixed negative curvature and absolute value of c. The exponential mapping is performed, where c is the absolute value of the fixed negative curvature initially set for the current Poincaré spherical manifold ( ), The hyperbolic tangent activation function is used. Let L2 be the L2 norm of the feature vector in Euclidean space, i.e., the modulus. This mapping process preserves the directionality of the feature and performs nonlinear compression on its modulus under the hyperbolic space rule, so that the abstract educational features are given strict hierarchical coordinates.

[0032] By using the exponential mapping operator, the text semantic feature vector By safely projecting the text feature points onto the Poincaré sphere model, we obtain cross-modal text feature points that, after manifold mapping, precisely fall within the hyperbolic space of the Poincaré sphere. ; After completing the hyperbolic space projection, feature fusion was not performed immediately. Instead, a core semantic conflict metric and curvature adaptation mechanism were introduced to calculate cross-modal behavioral feature points. Cross-modal text feature points Möbius distance in hyperbolic space The Möbius distance represents the intensity of semantic conflict across modalities. Unlike the simple linear distance in Euclidean space, the Möbius distance can extremely accurately quantify the degree of semantic conflict between two modalities in the educational knowledge hierarchy network. When the Möbius distance is determined to be greater than a safe threshold (i.e., when the current learner is determined to have a serious modal discrepancy), a curvature adaptive adjustment mechanism is triggered, dynamically increasing the negative curvature of the local space to obtain a new curvature parameter (obtained by the expansion of the local hyperbolic space after conflict adaptive adjustment). , The learnable curvature sensitivity penalty coefficient is initialized to 0.01 and optimized through training. This local space refers to the current cross-modal behavioral feature points to be fused within the Poincaré sphere model. Cross-modal text feature points The occupied or adjacent local area; Using the Möbius method to analyze cross-modal behavior feature points Cross-modal text feature points Nonlinear synthesis is performed to obtain the fused hyperbolic feature points. , For Möbius method, Cross-modal behavior feature points Cross-modal text feature points European-style inner area, It is a Euclidean norm; The fused hyperbolic feature points By projecting back into Euclidean space through a logarithmic mapping, the final joint eigenvectors are obtained. For use in combined diagnostic procedures. It is the inverse hyperbolic tangent function.

[0033] When the Möbius distance is determined to be greater than the safety threshold, the curvature adaptive adjustment mechanism is triggered to dynamically increase the negative curvature of the local space. Specifically, when the Möbius distance is determined to be greater than 0.7, it is judged as high conflict (such as the behavior displaying "hanging up" → the text displaying "studying diligently"), and the local negative curvature is dynamically increased to 0.5; when the Möbius distance is determined to be less than 0.3, it is judged as low conflict, the initial curvature is maintained at 0.3, and the model is forced to trace upwards along the educational knowledge hierarchy of learning task difficulty → learning state → emotional expression to resolve surface conflicts.

[0034] In this embodiment, this step is mainly responsible for receiving behavioral feature vectors and textual semantic feature vectors. After resolving any potential serious semantic conflicts between the two, it completes the deep fusion of cross-modal features. In existing multimodal learning diagnostic techniques, the feature fusion stage almost entirely employs linear operations based on Euclidean space, such as feature concatenation, vector dot product, or attention-weighted summation based on Softmax. However, the feature targets of the planning, execution, monitoring, and reflection stages of SRL, as well as emotional motivation, are not inherently flat but possess a highly hierarchical tree structure. More challenging is that when there is a serious contradiction between the learner's objective behavioral representation and subjective textual expression, the traditional linear weighting mechanism forcibly neutralizes the two types of features located in relative positions in Euclidean space, causing the fused features to lose their discriminative power, resulting in modality collapse of the model. To fundamentally address the challenge of feature trivialization and modal semantic conflict, this step completely abandons the Euclidean space assumption and innovatively introduces the theory of hyperbolic manifolds from non-Euclidean geometry, constructing a curvature-adaptive fusion module based on the Poincaré ball. The fundamental reason for introducing hyperbolic space is that its capacity grows exponentially with its radius, and its inherent geometric properties allow it to perfectly embed and represent tree-like hierarchical data with extremely low distortion.

[0035] Specifically, the behavioral context feature vector is projected onto the Poincaré sphere model using an exponential mapping operator to obtain cross-modal behavioral feature points; simultaneously, the text semantic feature vector is also projected using the same exponential mapping operator to obtain cross-modal text feature points. This projection process preserves the directionality of the features and performs non-linear compression, giving abstract educational features strict hierarchical coordinates. The Möbius distance between the two feature points in hyperbolic space is calculated. When this distance is greater than a safety threshold, an adaptive curvature adjustment mechanism is triggered. Specifically, when the Möbius distance is greater than 0.7, it is considered high conflict, and the local negative curvature is dynamically increased to 0.5; when the Möbius distance is less than 0.3, it is considered low conflict, and the initial curvature of 0.3 remains unchanged. This mechanism forces the model to trace upwards along the educational knowledge hierarchy, non-linearly resolving the common underlying causes behind the contradiction between behavior and text. Then, the two feature points are non-linearly synthesized using Möbius addition to obtain fused hyperbolic feature points. This operation uses Euclidean inner product and Euclidean norm to perform fractional operations, ensuring that the synthesized result remains in hyperbolic space. Finally, a logarithmic mapping is used to project the fused hyperbolic feature points back into Euclidean space, yielding the final joint feature vector.

[0036] This step utilizes exponential mapping to introduce features into the Poincaré sphere hyperbolic space and dynamically adjusts the local negative curvature based on Möbius distance for feature fusion. Leveraging the natural adaptability of hyperbolic space to tree-like hierarchical structures, this mechanism effectively overcomes the "feature neutralization" and "dimensional collapse" defects of traditional Euclidean linear weighting, significantly expanding the representational capacity of heterogeneous features. When learners' objective actions and subjective comments exhibit severe semantic conflicts, the dynamic curvature evolution mechanism forces the model to trace upwards along the educational psychological hierarchy, accurately uncovering the potential common root causes behind contradictory appearances. This maintains the deep discriminative power of cross-modal features while non-linearly resolving modal contradictions, providing a data foundation for subsequent diagnosis by eliminating falsehoods.

[0037] S4 uses a pre-built confusion factor memory to perform causal interference processing on the joint feature vector to obtain pure features, and simultaneously performs diagnostic calculations on SRL ability and emotional motivation based on the pure features to generate diagnostic results.

[0038] First, a Directed Acyclic Graph (DAG) containing confounding factors is implicitly constructed at the output. This DAG defines the causal pathways between external confounding variables, learners' temporary emotional states, objective behavioral performance, and actual self-regulated learning abilities. To block the dual spurious influence of confounding variables on SRL ability assessment and affective motivation diagnosis through emotional states, the back-door adjustment criterion in causal inference is introduced.

[0039] Specifically, step S4 further includes: instead of directly calculating the conditional probability, using a pre-built confusion factor memory to analyze the joint feature vector. Causal interference processing is performed, which involves hierarchically integrating (i.e., bias-weighted) the prior distributions of all (possible) confounding variables in the confounding factor memory, forcibly removing the variance response portion of the joint feature vector caused by external environmental factors, thus obtaining the pure features after removing confounding variables. , In the treatment of causal interference, the terminator means to cut off all causal arrows pointing to the variable, so that it is not naturally affected by external confounding factors. Pure features Based on this, diagnostic calculations of SRL competence and affective motivation are performed simultaneously to obtain the neural network diagnostic probability for implementing causal intervention. Neural network diagnostic probability adjusted with backdoor ; Its approximate formula is: , ,in, To ultimately predict the true state of each dimension of objective self-regulation learning, U represents the true state of each dimension of emotional motivation in the final prediction, where U is the total number of confusion layers. and All are parameterized multilayer perceptron networks, used to output predicted values under specific confusion conditions. To stratify the u-th confusion feature in the confusion factor memory, To be and To splice, The prior probability distribution of the stratified u-th confusing feature in global educational history data; Based on the neural network diagnostic probability of implementing causal intervention Neural network diagnostic probability adjusted with backdoor It generates diagnostic results that include the learner's SRL and emotional motivational state.

[0040] In this embodiment, this step is mainly responsible for receiving the cross-modal joint feature vector, eliminating temporary emotional interference caused by external environment or task difficulty, and outputting the final objective learner SRL and emotional motivation state profile in parallel. In existing multi-task joint diagnostic networks, the fused multimodal features are usually directly fed into a parallel multilayer perceptron (MLP) classifier and regressor for result mapping. This mapping paradigm based on traditional deep learning has a fundamental logical flaw: the model can only learn the statistical correlation on the surface of the data and cannot understand the causal mechanism behind the data. In real online learning scenarios, external environmental factors such as a sudden increase in question difficulty or network lag can act as confounders, triggering learners' brief irritability or negative emotions, accompanied by high-frequency disordered clicking behavior. After observing the high correlation between such irritability and disordered behavior, traditional models are prone to misjudging that the learner has poor self-regulated learning (SRL) planning and reflection abilities. In fact, this decline in ability assessment is due to spurious correlation introduced by external confounders, rather than a degradation of the learner's true intrinsic ability. To completely eliminate this spurious correlation, structural causal models (SCM) and counterfactual reasoning mechanisms were introduced. The core purpose of introducing structural causal models is to endow diagnosis with the mathematical reasoning ability to trace the root cause and isolate interference.

[0041] Specifically, this information is input into the joint diagnostic step. First, a pre-built confusion factor memory is used to perform causal intervention processing on the joint feature vector. The prior distributions of all confusion variables in the memory are integrated hierarchically to forcibly remove the variance responses caused by external environmental factors from the joint feature vector, resulting in pure features after removing confusion variables. Subsequently, based on these pure features, diagnostic calculations for self-regulated learning ability and emotional motivation are performed simultaneously, yielding the neural network diagnostic probabilities for causal intervention and backdoor adjustment, respectively. Specifically, through two independent parameterized multilayer perceptron networks, the pure features and each confusion feature are concatenated hierarchically and predicted. Then, the probabilities are weighted and summed according to the prior probability distributions of each hierarchical layer, ultimately outputting the true state probabilities of each dimension of learner self-regulated learning and the classification probabilities of each category of emotional motivation, generating structured diagnostic results.

[0042] This step constructs a directed acyclic causal graph containing confounding factors at the diagnostic output and performs feature decoupling and counterfactual reasoning through causal inference criteria. This mechanism breaks the limitation of deep learning relying solely on superficial statistical correlations for blind fitting, and through algorithmic intervention, severs the spurious correlation paths generated by specific confounding variables (such as a sudden increase in the difficulty of temporary questions) to core indicators. This upgrades the diagnostic network into a white-box system with anti-interference capabilities, and its output self-regulated learning (SRL) evaluation score is no longer misled by learners' temporary negative emotional fluctuations, but objectively and stably reflects their true cognitive characteristics. The structured causal diagnostic profile generated in this step has strong business interpretability, effectively avoiding the downstream adaptive teaching system from issuing incorrect intervention strategies due to misjudgment.

[0043] Preferably, in this embodiment, the training steps of the joint diagnostic model are as follows: Labels for the training data are obtained, abandoning full manual annotation and instead adopting a strategy of "mixing weakly supervised soft labels with a small number of precisely labeled hard labels"; specifically, for the SRL regression task, a prior rule mapping table based on objective background business logs (such as average dwell time and click frequency of specific tools) is constructed to automatically generate continuous SRL soft labels with probability distributions; for the sentiment classification task, hard labels are generated using (a small number) manually precisely labeled anchors combined with pre-trained model automatic pseudo-labeling; this weakly supervised pipeline achieves automated construction of massive training data at extremely low cost.

[0044] This approach completely abandons the fixed weighting coefficients that require manual parameter tuning and innovatively introduces a dynamic joint loss function based on homoscedastic uncertainty. Two learnable noise variance parameters (i.e., task uncertainty indices) are introduced for both SRL regression and sentiment classification tasks. A loss function for multi-task joint optimization is designed. In each backpropagation process, the current learning difficulty and noise level of each task are automatically assessed based on maximum likelihood estimation. The observed noise parameters are adaptively learned by the joint diagnostic model during the training process for the SRL regression task. For emotion classification tasks, the observation noise parameters (i.e. uncertainty) are adaptively learned by the joint diagnostic model during training. This is the mean squared error loss (MSE Loss) for the SRL regression task. Cross-Entropy Loss for sentiment classification tasks. For the regularization penalty term of the SRL regression task, This is a regularization penalty term for the sentiment classification task, preventing the network from increasing its variance parameter indefinitely to reduce the overall loss. It is the natural logarithm function.

[0045] In this embodiment, this step is primarily responsible for parameter optimization and model training for the complex causal adaptive diagnosis described above before deployment. In the education vertical, acquiring multimodal datasets precisely labeled by psychology experts is extremely costly. Traditional strongly supervised learning often faces a cold-start dilemma of "no labels available" in this scenario. Simultaneously, it requires parallel output of SRL level probabilities (for continuous regression tasks) and classification probabilities of emotional motivation (for discrete classification tasks). The gradient magnitudes and convergence speeds differ significantly between different tasks, making it highly susceptible to gradient conflicts and negative transfer when using traditional fixed-weight multi-task loss functions. To overcome these training obstacles, this step constructs a weakly supervised hybrid label pipeline and a dynamic uncertainty optimization mechanism.

[0046] Specifically, before deploying the aforementioned joint diagnostic method, each neural network needs to be trained. For the self-regulating learning regression task, a prior rule mapping table based on objective backend business logs is constructed to automatically generate continuous soft labels with probability distributions. For the sentiment classification task, a small number of manually labeled anchor points are combined with pre-trained models to automatically generate hard labels using pseudo-labels. Through the above weak supervision strategy, this embodiment significantly reduces the dependence on expensive manually labeled data, effectively alleviating the cold start problem in educational scenarios. In the design of the loss function for multi-task joint optimization, two learnable observation noise parameters are introduced for the regression and classification tasks respectively, which are adaptively learned by the model during training. During each backpropagation, the model automatically evaluates the current learning difficulty and noise level of each task based on maximum likelihood estimation, and prevents the noise parameters from increasing indefinitely through a regularization penalty term. This dynamic joint loss function can automatically balance the gradient weights of multiple tasks, eliminate gradient conflicts and negative transfer phenomena between regression and classification tasks, avoid the trial-and-error costs of manual parameter tuning, and ensure the stability and convergence effect of model training.

[0047] This step employs a weakly supervised soft-labeling strategy based on objective log rule mapping and introduces a dynamic multi-task joint loss function based on homoscedastic uncertainty. This dynamic optimization mechanism automatically assesses the current noise level of multiple tasks during backpropagation and adaptively adjusts the gradient weights of high-noise tasks, eliminating the "gradient conflict" and "negative transfer" phenomena between regression and classification tasks from a mathematical perspective. The weakly supervised pipeline significantly reduces the financial cost of expert labeling for educational data and effectively alleviates the "cold start" problem of large models in vertical domains; while the adaptive loss function avoids the trial-and-error cost of manually tuning fixed weights, strongly ensuring efficient and stable global optimal convergence for highly complex network modules including ordinary differential equations and non-Euclidean geometry in industrial applications.

[0048] In summary, this invention provides a joint diagnostic method for self-regulated learning and affective motivation, enabling joint and interpretable diagnosis of learners' SRL and affective motivation. The overall process follows a logical closed loop of data alignment, context extraction, source tracing and fusion, and causal output. First, high-frequency objective behavior sequences and low-frequency subjective text sequences from an online learning platform are continuously accessed. Lossless alignment is achieved in the continuous time domain using the neural network constant differential equations of cross-excitation. Random text messages are used as instantaneous trigger signals to correct the learning trajectory driven by continuous operational behaviors in real time, thereby calculating a comprehensive feature state that is complete and coherent in time. Subsequently, in the feature extraction stage, the behavioral branch uses a selective state-space model (Mamba) with linear time complexity to extract action features, which are then transformed into continuous cue words through low-rank mapping. This guides the Large Language Model (LLM) to directly combine the specific operational context when understanding text, accurately identifying semantic misjudgments that may occur due to detachment from context. Following this, the extracted data... The features are projected onto the hyperbolic manifold space of the Poincaré sphere. The semantic conflict between behavior and text is quantified by calculating the Möbius distance. When the conflict is high, the curvature of the space is adaptively adjusted to enable the model to trace upwards along the educational knowledge level and accurately discover the common potential causes that lead to contradictory appearances (such as encountering learning bottlenecks), thereby fundamentally resolving data contradictions. Finally, at the diagnostic output end, a structural causal model network containing confounding factors such as the difficulty of external tasks is constructed. Using the counterfactual reasoning mechanism, the temporary emotional fluctuation interference caused by external objective factors is actively eliminated. The probability of self-regulated learning (SRL) level and the probability of emotional motivation classification with logical support are output in parallel. Finally, an objective and interference-resistant structured learner SRL and emotional motivation state profile are generated.

[0049] Compared with existing technologies, this method has the following advantages: 1. Effectively overcomes the spatiotemporal alignment discontinuity problem of asynchronous irregular data: This invention introduces a neural ordinary differential equation with cross-excitation to fit the dynamic evolution trajectory of the learning state in the continuous time domain. Compared with the crude truncation and zero vector filling of traditional sliding time windows, this method better preserves the temporal continuity of the data and significantly improves the feature completeness and robustness of the system in sparse long-tail data scenarios such as "idling / diving". 2. Improves the semantic perception limitations caused by modal physical isolation: To address the defect of traditional dual-branch networks in easily detaching from behavioral context in text analysis, this invention transforms the low-level action latent states extracted by the Mamba model with linear time complexity into continuous prompt words in the input layer of a large language model. This allows the text feature extraction process to refer to the micro-action context, which helps to alleviate the semantic ambiguity problems such as irony and perfunctory expression faced by single text analysis and improves the accuracy of sentiment extraction. 3. A non-Euclidean geometric fusion scheme for handling modal semantic conflicts is provided: Faced with complex scenarios where learner behavior contradicts textual expression, this invention guides the model to trace back and associate information along the knowledge hierarchy by projecting features into hyperbolic space and introducing a curvature adaptation mechanism. Compared to conventional Euclidean linear weighting, this mechanism avoids the decrease in discriminative power caused by feature neutralization to a certain extent, enhancing the model's feature representation ability in conflict scenarios. 4. The robustness and interpretability of the diagnostic evaluation system are enhanced: Addressing the pain point of multi-task diagnosis being susceptible to interference from external environment or task difficulty, this invention introduces a structural causal model and counterfactual reasoning mechanism at the output end, aiming to eliminate the interference of temporary emotional fluctuations caused by confounding variables. This makes the output self-regulated learning (SRL) evaluation results more reflect the learner's intrinsic stable characteristics and reduces the misjudgment rate caused by external noise. 5. Reduced model training costs and mitigated gradient conflicts in multi-task scenarios: This invention employs a weakly supervised soft-label generation strategy based on objective log statistics, reducing reliance on expensive manually labeled data. Simultaneously, it introduces a dynamic joint loss function based on homoscedastic uncertainty to automatically evaluate task noise and dynamically balance gradient weights, helping to ensure the stability and convergence of high-order neural networks during multi-task joint training. 6. Support for more refined adaptive teaching interventions: The structured learner SRL and emotional motivation state profiles generated by this invention not only include quantified diagnostic scores but also provide modality conflict levels and causal reasoning evidence. This profile can serve as standardized input to connect with downstream educational big data platforms and intervention rule bases, providing the system with richer and more objective data references to assist in developing more reasonable personalized teaching strategies.

[0050] The above description represents the preferred embodiments of the present invention. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principles of the present invention, and these improvements and modifications are also considered to be within the scope of protection of the present invention.

Claims

1. A joint diagnostic method for self-regulated learning and emotional motivation, characterized in that, Processing is performed using a combined diagnostic model, including: The original behavioral and textual data are acquired, and the original behavioral and textual data are aligned using the neural ordinary differential equation of cross-excitation to obtain the aligned behavioral and semantic feature sequence. The aligned behavior and semantic feature sequence is input into a selective state-space model with linear time complexity to obtain a behavior context feature vector, and the behavior context feature vector is input into a pre-defined low-rank mapping network to obtain a text semantic feature vector. To resolve the semantic conflict between behavioral context feature vectors and text semantic feature vectors, deep fusion of cross-modal features is performed to obtain the final joint feature vector. The joint feature vectors are subjected to causal interference processing using a pre-built confusion factor memory to obtain pure features. Based on these pure features, diagnostic calculations of SRL competence and affective motivation are performed simultaneously to generate diagnostic results.

2. The joint diagnostic method for self-regulated learning and affective motivation according to claim 1, characterized in that, The original behavioral and textual data are acquired, and then aligned using a cross-excitation neural network differential equation to obtain an aligned sequence of behavioral and semantic features. Specifically: We acquire learners’ original behavioral and textual data during the learning process and set a continuous latent state vector to represent the learners’ comprehensive cognitive evolution, where the learning process is a continuous time period. When it is determined that no text data is triggered at time t, the evolution rate of the continuous hidden state vector is determined based on the continuous hidden state vector and behavior data at time t, and fitting and parameterization are performed based on the evolution rate and evolution differential equation. The formula for the evolutionary differential equation is as follows: , The parameter is Multilayer deep neural networks, Let be the continuous hidden state vector at time t. The data represents the behavioral data at time t. When text data is detected at time t, cross-excitation is triggered, entering the pulse skipping phase. The existing continuous hidden state vector is instantaneously modulated and reset using a pulse update function, the formula of which is: ,in, Let i be the specific discrete timestamp of the occurrence of the i-th valid text data. This refers to the implicit cognitive state accumulated by prior behaviors in the instant preceding the generation of text data. For parameters Multilayer spiking neural network, Let be the i-th text data collected at time t. This is the continuous hidden state vector at time t+1; The latest cognitive state vector will be fused and updated after absorbing instantaneous pulse information from the text. As the new starting point for the evolutionary differential equation at time t+1, at this time It continues to determine whether there is text data triggered at the next moment, and performs fitting and parameterization based on the judgment result, or enters the pulse jumping stage; Repeat the above steps until all behavioral and textual data at each time step of the learning process have been processed. Then, use the continuous hidden state vectors from the last time step as the aligned sequence of behavioral and semantic features. .

3. The joint diagnostic method for self-regulated learning and affective motivation according to claim 1, characterized in that, The selective state space model adopts a multi-layer stacked structure with a hidden layer dimension of 256 and a state expansion coefficient of 16. Specifically, for the characteristics of the education scenario, the selective state space model makes three embedded reconstructions: embedding a learnable noise filtering gate after the Mamba input layer, initializing the state transition mechanism of the Mamba input layer with structured state, and replacing global pooling with SRL temporal keyframe pooling.

4. The joint diagnostic method for self-regulated learning and affective motivation according to claim 2, characterized in that, The aligned behavior and semantic feature sequence are input into a selective state-space model with linear time complexity to obtain a behavior context feature vector, specifically: Based on a fixed time step semantic feature sequence Sampling is performed to obtain discrete behavioral feature sequences; By using learnable noise filtering gating, the noise filtering coefficient is calculated, and ineffective noise behavior is suppressed to obtain the effective behavioral characteristics after denoising. The formula is as follows: , ,in, The effective behavioral features after denoising at the k-th time step are: Let be the noise filtering coefficient at the k-th time step. For element-wise multiplication, Let k be the discrete behavioral feature sequence at the k-th time step. It is the Sigmoid activation function. For learnable weights, For noise-determining sub-features, The bias for noise filtering gate; Based on the discretized state transition mechanism, the effective behavioral features after denoising are recursively calculated to obtain the evolved hidden state vector, the formula of which is: ,in, Let be the evolved hidden state vector at the k-th time step. and All are dynamic state-space matrices after discretization using a zero-order hold. This is the evolved hidden state vector at the (k-1)th time step; The evolved hidden state vector is input into the pooling layer, and SRL temporal keyframe pooling is performed to calculate the SRL feature saliency score at each time step. The formula is as follows: , The SRL feature significance score at the k-th time step. It is the L2 norm. Let be the hidden state vector after evolution at the j-th time step; Based on the SRL feature significance score, select those that are greater than or equal to the selection threshold. The keyframe hidden states are identified, and a high-purity behavioral context feature vector is output through fusion pooling operations. , ,in, For splicing operations, For max pooling operation, This is for average pooling operations.

5. The joint diagnostic method for self-regulated learning and affective motivation according to claim 4, characterized in that, The behavioral context feature vector is input into a pre-defined low-rank mapping network to obtain the text semantic feature vector, specifically: behavioral context feature vectors Dynamically transformed and continuously prompted word vectors with strictly consistent dimensions of the word embedding space of the large language model. , Here is the learnable weight matrix in a low-rank mapping network. is the learnable bias vector in a low-rank mapping network; The weights of the pre-defined backbone network of the large language model are forcibly frozen. The word embeddings of the continuous prompts and the original text data are prefix-concatenated along the sequence dimension to form a joint input feature matrix. This matrix is then fed into the frozen large language model for deep semantic extraction, yielding a high-dimensional text semantic feature vector. ,in, For the forward propagation function of the frozen large language model, The initial word embedding matrix is obtained by looking up the original text data. This indicates the sequence length dimension. and Perform the splicing operation.

6. The joint diagnostic method for self-regulated learning and affective motivation according to claim 5, characterized in that, To resolve the semantic conflict between behavioral context feature vectors and textual semantic feature vectors, deep fusion of cross-modal features is performed to obtain the final joint feature vector, specifically: By using the exponential mapping operator, the behavioral context feature vector By safely projecting these features onto the Poincaré sphere model, we obtain the cross-modal behavior feature points that, after manifold mapping, precisely fall within the hyperbolic space of the Poincaré sphere. , For mapping functions, This represents the behavior context feature vector starting from the origin 0 on a Poincaré sphere with a fixed negative curvature and absolute value of c. The exponential mapping is performed, where c is the absolute value of the fixed negative curvature initially set for the current Poincaré spherical manifold. It is the hyperbolic tangent activation function; By using the exponential mapping operator, the text semantic feature vector By safely projecting the text feature points onto the Poincaré sphere model, we obtain cross-modal text feature points that, after manifold mapping, precisely fall within the hyperbolic space of the Poincaré sphere. ; Calculate cross-modal behavior feature points Cross-modal text feature points Möbius distance in hyperbolic space When the Möbius distance is determined to be greater than the safety threshold, the curvature adaptive adjustment mechanism is triggered to dynamically increase the negative curvature of the local space, thereby obtaining new curvature parameters. , The learnable curvature sensitivity penalty coefficient refers to the local space within the Poincaré sphere model, which consists of the cross-modal behavioral feature points to be fused. Cross-modal text feature points The occupied or adjacent local area; Using the Möbius method to analyze cross-modal behavior feature points Cross-modal text feature points Nonlinear synthesis is performed to obtain the fused hyperbolic feature points. , For Möbius method, Cross-modal behavior feature points Cross-modal text feature points European-style inner area, It is a Euclidean norm; The fused hyperbolic feature points By projecting back into Euclidean space through a logarithmic mapping, the final joint eigenvectors are obtained. , It is the inverse hyperbolic tangent function.

7. The joint diagnostic method for self-regulated learning and affective motivation according to claim 6, characterized in that, When the Möbius distance is determined to be greater than the safety threshold, the curvature adaptive adjustment mechanism is triggered to dynamically increase the negative curvature of the local space. Specifically: When the Möbius distance is determined to be greater than 0.7, it is judged as a high conflict, and the local negative curvature is dynamically increased to 0.5; When the Möbius distance is determined to be less than 0.3, it is considered a low-conflict condition, and the initial curvature of 0.3 is maintained.

8. The joint diagnostic method for self-regulated learning and affective motivation according to claim 6, characterized in that, The joint feature vectors are subjected to causal interference processing using a pre-built confusion factor memory to obtain pure features. Based on these pure features, diagnostic calculations of SRL ability and affective motivation are performed simultaneously to generate diagnostic results, specifically: Using a pre-built confusion factor memory to analyze the joint feature vector Causal interference processing is performed, which involves stratified integration of the prior distributions of all confounding variables in the confounding factor memory to forcibly remove the variance response portion of the joint feature vector caused by external environmental factors, thus obtaining the pure features after removing the confounding variables. , For interference in causal interference processing; Pure features Based on this, diagnostic calculations of SRL competence and affective motivation are performed simultaneously to obtain the neural network diagnostic probability for implementing causal intervention. Neural network diagnostic probability adjusted with backdoor ; Its approximate formula is: , ,in, To ultimately predict the true state of each dimension of objective self-regulation learning, U represents the true state of each dimension of emotional motivation in the final prediction, where U is the total number of confusion layers. and All are parameterized multilayer perceptron networks. To stratify the u-th confusion feature in the confusion factor memory, To be and To splice, The prior probability distribution of the stratified u-th confusing feature in global educational history data; Based on the neural network diagnostic probability of implementing causal intervention Neural network diagnostic probability adjusted with backdoor It generates diagnostic results that include the learner's SRL and affective motivational state.

9. The joint diagnostic method for self-regulated learning and affective motivation according to claim 1, characterized in that, The specific training steps for the joint diagnostic model are as follows: Acquire labels for training data. For SRL regression tasks, construct a prior rule mapping table based on objective business logs in the background to automatically generate continuous SRL soft labels with probability distributions. For sentiment classification tasks, use manually labeled anchors combined with pre-trained models to automatically generate hard labels through pseudo-labeling. Design a loss function for multi-task joint optimization. In each backpropagation process, the current learning difficulty and noise level of each task are automatically assessed based on maximum likelihood estimation. The observed noise parameters are adaptively learned by the joint diagnostic model during the training process for the SRL regression task. For sentiment classification tasks, the observation noise parameters are adaptively learned by the joint diagnostic model during training. For the mean squared error loss of the SRL regression task, For the cross-entropy loss of the sentiment classification task, For the regularization penalty term of the SRL regression task, For the sentiment classification task, a regularization penalty term is used. It is the natural logarithm function.