Intelligent learning path planning method for vocabulary based on reinforcement learning
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HEBEI XIONGAN XIONGXIN TECHNOLOGY CO LTD
- Filing Date
- 2026-02-04
- Publication Date
- 2026-06-12
AI Technical Summary
Existing intelligent vocabulary learning path planning methods lack a deep integration of cognitive state perception and reinforcement learning, making it impossible to accurately quantify knowledge transfer gaps across grades, resulting in low learning efficiency and an inability to perform targeted optimization for the transfer-sensitive period.
By collecting learners' multi-source interactive behavior data in real time, a vocabulary knowledge graph is constructed, a comprehensive cognitive state vector is generated, a hierarchical action space and a combined reward function are set, a meta-policy network is initialized, online policy optimization and path generation are performed, and the optimal action is evaluated through a safe simulation environment, and the reward function weights are adaptively adjusted.
It achieves a deep integration of cognitive and reinforcement learning, accurately quantifies knowledge transfer gaps across grades, dynamically balances cognitive load and learning efficiency, significantly improves personalized learning outcomes, ensures the learning process is within the optimal cognitive range, and enhances decision reliability.
Smart Images

Figure CN122199206A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of reinforcement learning technology, specifically referring to a vocabulary intelligent learning path planning method based on reinforcement learning. Background Technology
[0002] With the deep integration of artificial intelligence and education, personalized learning has become an important direction for modern development; vocabulary, as a fundamental element in building language ability, directly affects reading comprehension, writing expression, and comprehensive language application skills.
[0003] However, existing intelligent vocabulary learning path planning methods still have certain shortcomings. Existing technologies lack a deep integration of cognitive state perception and reinforcement learning, and cannot accurately quantify knowledge transfer gaps across grades, resulting in path planning neglecting key cognitive gaps. Relying solely on static learning indicators fails to dynamically capture the perception of knowledge gaps in vocabulary during grade transfer, causing frequent cognitive jumps in the learning path when crossing grades, resulting in low learning efficiency and an inability to optimize for transfer-sensitive periods. In the cognitive state representation stage, only basic static indicators are used, without combining real-time interactive behavior to dynamically calculate vocabulary activation intensity, which fails to strengthen the perception of knowledge gaps across grades. In the strategy initialization stage, random weights or simple historical data are used for training, without stratified sampling of historical data and prioritizing the use of high-value data, leading to an imbalance in the distribution of training samples. Therefore, a reinforcement learning-based intelligent vocabulary learning path planning method is proposed. Summary of the Invention
[0004] The purpose of this invention is to provide a vocabulary intelligent learning path planning method based on reinforcement learning to solve the problems mentioned in the background art.
[0005] To achieve the above objectives, the present invention provides the following technical solution: a vocabulary intelligent learning path planning method based on reinforcement learning, comprising the following steps:
[0006] S1. Collect learners' multi-source interactive behavior data in real time and perform standardized preprocessing;
[0007] S2. Based on the time-series data and the preset knowledge graph, perform dynamic cognitive state representation and generate a comprehensive cognitive state vector;
[0008] S3. Pre-set hierarchical action space and combined reward function to set decision-making;
[0009] S4. Based on the comprehensive cognitive state vector and decision, pre-train on historical data to initialize the meta-policy network;
[0010] S5. Based on the initial strategy, combined with real-time user status and rewards, perform online strategy optimization and personal path generation;
[0011] S6. Before making critical decisions, simulate multiple future paths in a safe simulation environment and evaluate them through a value network to select the optimal immediate action;
[0012] S7. Based on historical and planned paths, visualize the data and adaptively adjust the reward function weights based on user feedback.
[0013] Preferably, in step S1, during the interaction process of the learning interface, the user's response to vocabulary practice, reaction time to complete a single question, and practice type are captured in real time. The user's operation trajectory, learning path behavior, and temporal structure of the learning session are tracked in real time. The raw data collected in real time is preprocessed in real time, and the preprocessed data is aggregated into a continuous sequence according to the time window. The data is divided into segments of equal length based on a single user learning session. Each segment contains a multi-dimensional data vector within the time period, generating a standardized temporal data segment sequence.
[0014] The preferred approach involves constructing a vocabulary knowledge graph: A basic vocabulary list is extracted from authoritative educational corpora and standard dictionaries, such as the CEFR level vocabulary list and English textbook corpora. Duplicate words are cleaned and their forms are standardized to form a basic vocabulary database. Based on the CEFR level standard, each word is assigned a corresponding difficulty level. Through matching with educational standard documents, the frequency of word occurrence is statistically analyzed based on a large corpus, and a standardized word frequency index is calculated to reflect the frequency of word usage in real-world language environments. Linguistic tools are used to analyze the root, prefix, and suffix structures of words, assess the difficulty of word form changes, and generate quantitative complexity indicators. The occurrence records of words in different grades are extracted from multiple versions of textbooks, and the frequency of each word in different grades is statistically analyzed. Frequency distribution is used to form cross-grade transfer data; semantic associations between words are established based on a thesaurus and a semantic relation database to construct a semantic network topology; the connection strength of words in the semantic network is calculated using semantic analysis tools to reflect the core position of words in the knowledge system; the associations between word nodes are defined based on the semantic network to construct semantic edges; words are used as nodes, and feature vectors are learned and injected into node attributes, while semantic relations are used as edges to form a directed graph structure, where node attributes include feature vectors and edge attributes include relation strength; and CEFR level, word frequency index, word formation complexity, cross-grade frequency of occurrence, and semantic network centrality are integrated into a multi-dimensional learned feature vector and injected into each word node.
[0015] Preferably, in step S2, a pre-built vocabulary knowledge graph is loaded, a learning feature vector is injected into each vocabulary node, vocabulary operation behaviors in time-series data segments are dynamically mapped to corresponding vocabulary nodes in the knowledge graph, behavior events and timestamps are recorded, a real-time behavior-vocabulary association index is established, and the activation intensity of each vocabulary node is calculated based on the mapping results. This is implemented as follows:
[0016] ,
[0017] In the formula, This represents the dynamic activation intensity of word i at time t. Represents the set of behavioral events for word i. This represents the cognitive accuracy of behavior k. This represents the reaction time of behavior k. Represents the reaction time offset constant, in units of 1 / 2π. Consistent, such as seconds, 'b' represents the time decay coefficient, and 'b' represents the base activation offset. Indicates the migration enhancement coefficient. This indicates the migration-sensitive period indicator function.
[0018] Preferably, in step S2, four core cognitive indicators—mastery, memory stability, cognitive load, and transfer readiness—are simultaneously calculated from the activation intensity. Based on the semantic relationships of the knowledge graph, a two-hop semantic diffusion is performed on the activation intensity. The diffusion weight decays according to edge strength and cognitive distance. The multi-dimensional cognitive indicators are aggregated into a basic state vector. The time sequence is processed through double exponential smoothing to generate a standardized comprehensive cognitive state vector, which is implemented as follows:
[0019] ,
[0020] In the formula, Represents the comprehensive cognitive state vector. Normalization by norm constraint This indicates the dynamically weighted degree of control. for , The historical accuracy of the word "i" is confirmed. for , Indicates the number of times word i is used correctly. This indicates the total number of times word i is used. This represents a double exponential smoothing function. Indicates level one smoothness. , Indicates the trend smoothing coefficient. .
[0021] Preferably, in S3, a hierarchical architecture of macro-action layer and micro-action layer is preset; the macro layer includes four core learning modes: word memorization, fun review, vocabulary test, and grade change; the micro layer generates an operable subset for each macro action based on dynamic mapping of comprehensive cognitive state vector; when a grade transition sensitive period is detected, the vocabulary test action is forcibly frozen and fun review is inserted; when the cognitive load index exceeds a preset threshold, the word memorization action is automatically downgraded to a low-difficulty mode and restorative fun review is inserted; based on the comprehensive cognitive state vector, a four-element reward component is constructed, including short-term mastery reward, long-term retention reward, cognitive load reward, and grade adaptation reward;
[0022] The reward priority weights are dynamically adjusted based on the progress of the learning session, and the basic components and dynamic weights are combined to construct the final reward function. Based on historical data statistics, dynamic safety constraints are preset for micro-action parameters; a dual-channel policy network decision-making is constructed: the macro-channel outputs the probability distribution of the learning mode, and the micro-channel generates specific parameter configurations.
[0023] Preferably, in step S4, complete learning session data is extracted from the system database to construct a state-action-reward triplet sequence, where the state is a comprehensive cognitive state vector, the action is a hierarchical action space decision, and the reward is a combined reward function. Based on the learning stage factor, historical data is sampled in layers, and high-value data from the transfer-sensitive period and cognitive load peak are selected first to construct a balanced training sample distribution.
[0024] Preferably, in step S4, the dual-channel neural network serves as the meta-policy network. The input layer receives the comprehensive cognitive state vector, the hidden layer employs a multi-layer structure, and the output layer generates the macroscopic action probability distribution and microscopic action parameter configuration. The network weights are randomly initialized using a Gaussian distribution. A pre-defined objective function is used based on the policy gradient.
[0025] ,
[0026] In the formula, Represents the policy gradient loss function. This represents the trainable parameter vector of the policy network. Indicates policy-based The expectation operator, where s represents the state variable and z represents the action variable. The estimated state-action value function is represented by the following: During training, learning constraint weights are introduced to dynamically adjust the proportion of the cognitive load penalty term in the loss function. A pre-set validation set evaluation mechanism is used. When the policy's performance metrics on the validation set reach a preset target threshold and the cognitive load volatility remains within a reasonable range, pre-training is considered complete, and the converged network weights are saved as the initial meta-policy. .
[0027] Preferably, in step S5, during the user learning session, a comprehensive cognitive state vector and combined rewards are continuously acquired to construct a continuous time-series data stream based on the initialized meta-policy. By using incremental reinforcement learning to update the policy network parameters online, the optimized policy is applied to the real-time state, and a two-layer decision sequence is output: the macro layer specifies the learning mode, and the micro layer generates specific execution parameters, forming a continuous sequence of personal learning paths.
[0028] Preferably, in step S6, a differentiable virtual learning environment is constructed based on the comprehensive cognitive state vector and the reward function; the virtual learning environment accurately simulates the dynamics of state transitions. And a reward generation mechanism; using an S5-optimized policy network, Monte Carlo path generation is performed on the current comprehensive cognitive state vector, generating K future paths through random sampling, each path containing an action sequence. and cumulative reward sequence The pre-trained value network is invoked to evaluate the value of the initial actions for each path, and the total value of the path is calculated. , This represents the discount factor. During the evaluation process, the constraints of each path are dynamically checked, the path with the highest value is selected from the valid paths, and the first action is extracted as an immediate decision. If all paths are invalid, a safety rollback mechanism is triggered, and the default safety action is selected.
[0029] Preferably, in step S7, the personal historical learning path and the planned future path are displayed side by side, and the changes in cognitive state are presented in the form of a timeline. Key action nodes are marked with heat maps, and the execution effect of the historical path is compared with the expected performance of the planned path in a visual way. User feedback content is analyzed in real time, high-frequency keywords are identified, and specific optimization links are located in combination with cognitive state indicators. Based on the analysis results, the reward function weight is automatically adjusted, and the reward weight distribution is updated in real time on the interface.
[0030] Specifically, the reward function weights are automatically adjusted based on real-time user feedback on the learning path, such as difficulty being too high, content being repetitive, or easy to forget. High-frequency keywords are identified through NLP. If the percentage of negative feedback / frequency of high-frequency keywords exceeds a threshold, the weight of the corresponding learning stage is adjusted. If the feedback indicates that the difficulty is too high and appears frequently, the reward weight related to the difficulty challenge is reduced. If the feedback indicates a lack of relevance and appears frequently, the reward weight related to personalized recommendations is increased.
[0031] Real-time monitoring of cognitive state indicators such as vocabulary mastery rate, forgetting rate, and learning efficiency. If these indicators deviate from preset targets by more than a threshold, weight adjustments are triggered. Historical path execution performance is compared with the expected performance of the planned path over a timeline. If the difference exceeds a threshold, adjustments are triggered to related weights such as path time allocation and action efficiency to narrow the gap between execution and planning. The reward function weights are automatically adjusted as follows:
[0032] ,
[0033] In the formula, Indicates the new weights, Indicates the old weight. Here, 'e' represents the learning rate, and 'e' represents the error. Indicates sensitivity.
[0034] The system analyzes user feedback in real time to identify high-frequency keywords; it then combines these with cognitive state indicators to pinpoint specific learning stages that need optimization; based on the analysis results, the system automatically adjusts the reward function weights to optimize the learning path; and it updates the interface in real time, allowing users to visually see the adjusted reward weight distribution.
[0035] Compared with the prior art, the beneficial effects of the present invention are:
[0036] 1. This invention achieves a deep integration of cognition and reinforcement learning by constructing a complete technical closed loop of cognitive state perception, dynamic decision-making, security verification, and adaptive optimization. It accurately quantifies knowledge transfer gaps across grades, dynamically balances cognitive load and learning efficiency, and significantly improves personalized learning outcomes.
[0037] 2. This invention achieves a refined representation of cognitive state by injecting feature vectors into a knowledge graph and dynamically calculating the activation intensity of words. It not only considers static attributes such as CEFR level and word frequency of words, but also combines behavioral events to calculate activation intensity in real time, which strengthens the perception of cross-grade knowledge gaps during the sensitive period of grade transfer. The four cognitive indicators generated simultaneously are processed by two-hop semantic diffusion and double exponential smoothing to form a standardized comprehensive cognitive state vector, enabling the system to accurately capture the learner's true cognitive state.
[0038] 3. This invention achieves the construction of a high-quality policy starting point by pre-training the initial meta-policy network with historical data; it performs hierarchical sampling of historical data based on learning stage factors, prioritizing the use of high-value data such as migration-sensitive periods and load peaks to construct a balanced training set; the dual-channel neural network design matches the action space dimension and constrains the dynamic balancing training process of weights to ensure that the policy meets the target; the verification mechanism rigorously evaluates the policy performance and cognitive load fluctuations, saves the initial policy that meets the standard, and significantly improves the practical value of the system.
[0039] 4. This invention achieves dynamic adaptation of learning paths through online strategy optimization and personal path generation; it continuously acquires real-time cognitive status and rewards, performs incremental updates based on the initialization strategy, and dynamically integrates constraint weights; the two-layer decision output forms a personalized learning path, and verifies cognitive load and transfer status in real time during the generation process; enabling the system to adaptively adjust according to the learner's real-time status, ensuring that the learning process is always in the optimal cognitive range.
[0040] 5. This invention simulates multiple future paths in a virtual environment through safe simulation and optimal action selection, evaluates their value and dynamically verifies constraints, and selects only the safe and effective optimal action; if all paths are invalid, a safe rollback mechanism is triggered; this avoids the trial-and-error costs of real-time decision-making, and greatly improves the reliability of decision-making while ensuring the learning effect of the system. Attached Figure Description
[0041] Figure 1 This is the operational flow of the reinforcement learning-based intelligent vocabulary learning path planning method of the present invention. Figure 1 ;
[0042] Figure 2 This is the operational flow of the reinforcement learning-based intelligent vocabulary learning path planning method of the present invention. Figure 2 ;
[0043] Figure 3 This is the operational flow of the reinforcement learning-based intelligent vocabulary learning path planning method of the present invention. Figure 3 ;
[0044] Figure 4 This is the operational flow of the reinforcement learning-based intelligent vocabulary learning path planning method of the present invention. Figure 4 . Detailed Implementation
[0045] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0046] Example
[0047] Please see Figures 1-4 As shown, the present invention provides a technical solution comprising the following steps:
[0048] S1. Collect learners' multi-source interactive behavior data in real time and perform standardized preprocessing;
[0049] S2. Based on the time-series data and the preset knowledge graph, perform dynamic cognitive state representation and generate a comprehensive cognitive state vector;
[0050] S3. Pre-set hierarchical action space and combined reward function to set decision-making;
[0051] S4. Based on the comprehensive cognitive state vector and decision, pre-train on historical data to initialize the meta-policy network;
[0052] S5. Based on the initial strategy, combined with real-time user status and rewards, perform online strategy optimization and personal path generation;
[0053] S6. Before making critical decisions, simulate multiple future paths in a safe simulation environment and evaluate them through a value network to select the optimal immediate action;
[0054] S7. Based on historical and planned paths, visualize the data and adaptively adjust the reward function weights based on user feedback.
[0055] In this embodiment, during the interaction process of the learning interface in S1, the user's response to the vocabulary exercise, the reaction time to complete a single question, the type of exercise are captured in real time, and the user's operation trajectory is tracked in real time, such as the click location, the duration of page stay, the duration of review interval, the learning path behavior, such as skipping questions, requesting prompts, repeating exercises, and the temporal structure of the learning session.
[0056] Specifically, real-time data preprocessing is performed on the raw data collected in real time, including removing invalid records, filling missing values, and detecting and correcting data drift. The preprocessed data is aggregated into a continuous sequence according to time windows. The data is divided into segments of equal length based on a single user learning session. Each segment contains a multi-dimensional data vector within the time period, forming a structured time-series input and generating a standardized time-series data segment sequence.
[0057] Preferably, the vocabulary knowledge graph is constructed by extracting a basic vocabulary list from authoritative educational corpora and standard dictionaries, such as the CEFR graded vocabulary list and English textbook corpora, cleaning up duplicate words and unifying word forms to form a basic vocabulary database, and assigning a corresponding difficulty level to each word according to the CEFR graded standard, such as A1 and B2. Through matching with educational standard documents, the frequency of word occurrence is statistically analyzed based on a large corpus, and a standardized word frequency index is calculated to reflect the frequency of word use in real language environments.
[0058] By analyzing the root, prefix, and suffix structures of words using linguistic tools, the difficulty of word form changes is assessed, and quantitative complexity indicators are generated. The occurrence records of words in each grade are extracted from multiple versions of textbooks, and the distribution frequency of each word in different grades is statistically analyzed to form cross-grade transfer data. Based on a thesaurus and a semantic relation database, semantic associations between words are established, such as synonyms, antonyms, and hyponyms, to construct a semantic network topology.
[0059] The connection strength of words in the semantic network is calculated using semantic analysis tools to reflect the core position of words in the knowledge system. Based on the semantic network, the associations between word nodes are defined to construct semantic edges. Words are used as nodes, and feature vectors are learned and injected into node attributes. Semantic relations are used as edges to form a directed graph structure. Node attributes include feature vectors, and edge attributes include relation strength. CEFR level, word frequency index, word formation complexity, cross-grade frequency, and semantic network centrality are integrated into a multi-dimensional learned feature vector and injected into each word node.
[0060] In this embodiment, in step S2, a pre-built vocabulary knowledge graph is loaded, and a learning feature vector is injected into each vocabulary node, including CEFR level, word frequency index, word formation complexity, cross-grade frequency of occurrence, and semantic network centrality, forming a structured semantic foundation. The vocabulary operation behavior in the time-series data segment is dynamically mapped to the corresponding vocabulary node in the knowledge graph, the behavior event and timestamp are recorded, and a real-time association index of behavior-vocabulary is established.
[0061] Specifically, based on the mapping results, the activation strength of each word node is calculated, as follows:
[0062] ,
[0063] In the formula, This represents the dynamic activation intensity of word i at time t. Represents the set of behavioral events for word i. This represents the cognitive accuracy of behavior k. This represents the reaction time of behavior k. Represents the reaction time offset constant, in units of 1 / 2π. Consistent, such as seconds, Indicates the time decay coefficient. express , 'b' represents the lexical half-life, and 'b' represents the base activation offset. Indicates the migration enhancement coefficient. for , Indicates the initial enhancement intensity of migration. Indicates the rate of decay of the migration effect. Indicates the baseline for the migration effect. Indicates the indicator function for the migration sensitive period. for , This indicates the threshold for the duration of the migration-sensitive period. The value range is the length of time, with Extending to the left (time earlier than the switch point) and to the right (time later than the switch point) from the center. The range of these factors constitutes the time interval of the migration-sensitive period. This indicates the critical time point for migration and switching. The value range is based on the actual time of the user's migration event, which is directly triggered by the user's operation. That is, when the system detects a change of grade level, such as when the user selects a new grade level on the learning interface, the current system time is recorded as the value. , This indicates the difficulty level of grade-level transfer.
[0064] Specifically, the system identifies migration-sensitive period data triggered by grade-level changes, applies a grade-level migration difficulty coefficient to weight and enhance the activation intensity of vocabulary nodes in adjacent grades, and strengthens the perception of knowledge gaps across grades.
[0065] In this embodiment, in step S2, four core cognitive indicators are simultaneously calculated from the activation intensity: mastery, memory stability, cognitive load, and transfer readiness.
[0066] Specifically, based on the semantic relationships of the knowledge graph, the activation intensity is subjected to two-hop semantic diffusion, the diffusion weight is decayed according to the edge strength and cognitive distance, and multi-dimensional cognitive indicators are aggregated as the base state vector.
[0067] By performing bi-exponential smoothing on the time series, a standardized comprehensive cognitive state vector is generated, which is implemented as follows:
[0068] ,
[0069] In the formula, Represents the comprehensive cognitive state vector. Normalization by norm constraint This represents the minimum stability norm threshold. This represents the maximum safety range threshold. This indicates the dynamically weighted degree of control. for , Indicates the historical accuracy rate of word i. for , Indicates the number of times word i is used correctly. This indicates the total number of times word i is used. This represents a double exponential smoothing function. Indicates level one smoothness. This represents the trend smoothing coefficient.
[0070] In this embodiment, S3 pre-defines a hierarchical architecture of macro-action layer and micro-action layer; the macro layer includes four core learning modes: word memorization, fun review, vocabulary test, and grade change; the micro layer generates an operable subset for each macro action based on dynamic mapping of comprehensive cognitive state vector.
[0071] Specifically, when a sensitive period for grade transition is detected, the vocabulary test is forcibly frozen and fun review is inserted; when the cognitive load index exceeds a preset threshold, the word memorization action is automatically downgraded to a low-difficulty mode and restorative fun review is inserted.
[0072] Specifically, based on the comprehensive cognitive state vector, a four-element reward component is constructed, including:
[0073] Short-term mastery reward: a positive incentive based on the rate of change in mastery;
[0074] Long-term retention reward: the long-term value based on the predicted memory retention rate after a preset number of hours;
[0075] Cognitive load reward: negative penalty based on load indicators;
[0076] Grade-level adaptation reward: During the transfer-sensitive period, a special incentive is provided based on the growth rate of vocabulary activation intensity in the new grade.
[0077] Specifically, the reward priority weights are dynamically adjusted based on the learning session progress, and the basic components and dynamic weights are combined to construct the final reward function. Based on historical data statistics, dynamic safety constraints are preset for micro-action parameters; a dual-channel policy network decision-making is constructed: the macro-channel outputs the probability distribution of the learning mode, and the micro-channel generates specific parameter configurations.
[0078] In this embodiment, in step S4, complete learning session data is extracted from the system database, and a state-action-reward triple sequence is constructed. The state is a comprehensive cognitive state vector, the action is a hierarchical action space decision, and the reward is a combined reward function. Based on the learning stage factor, historical data is sampled in layers, and high-value data from the transfer-sensitive period and the peak of cognitive load are selected first to construct a balanced training sample distribution.
[0079] In this embodiment, in step S4, the dual-channel neural network serves as the meta-policy network. The input layer receives the comprehensive cognitive state vector, the hidden layer employs a multi-layer structure, and the output layer generates the macroscopic action probability distribution and microscopic action parameter configuration. The network weights are randomly initialized using a Gaussian distribution. A pre-defined objective function is used based on the policy gradient.
[0080] ,
[0081] In the formula, This represents the policy gradient loss function, which is the optimization objective of the policy network. This represents the trainable parameter vector of the policy network. Indicates policy-based The expectation operator, where s represents the state variable and z represents the action variable. This represents the estimated state-action value function.
[0082] Specifically, during training, learning constraint weights are introduced to dynamically adjust the proportion of the cognitive load penalty term in the loss function, ensuring that the policy optimization process always conforms to the learning specification. A validation set evaluation mechanism is preset; when the policy's performance metrics on the validation set reach a preset target threshold and the cognitive load volatility remains within a reasonable range, pre-training is considered complete, and the converged network weights are saved as the initial meta-policy. .
[0083] In this embodiment, in step S5, during the user learning session, a comprehensive cognitive state vector and combined rewards are continuously acquired to construct a continuous time-series data stream based on the initialized meta-policy. The policy network parameters are updated online through incremental reinforcement learning. The update process aims to maximize the expected cumulative reward and dynamically integrates the learning constraint weights to ensure that the policy optimization always conforms to the cognitive load and grade transfer norms.
[0084] Specifically, the optimized strategy is applied to the real-time state, outputting a two-layer decision sequence: the macro layer specifies the learning mode, and the micro layer generates specific execution parameters, forming a continuous sequence of personal learning paths. During the path generation process, the cognitive load index and grade transfer status are dynamically checked. If the cognitive load index exceeds the threshold or is in a transfer-sensitive period, the path parameters are automatically adjusted.
[0085] In this embodiment, in step S6, a differentiable virtual learning environment is constructed based on the comprehensive cognitive state vector and the reward function; the virtual learning environment accurately simulates the dynamics of state transitions. And a reward generation mechanism; using the S5-optimized policy network, Monte Carlo path generation is performed on the current comprehensive cognitive state vector.
[0086] Specifically, K future paths are generated through random sampling, and each path contains a sequence of actions. and cumulative reward sequence H represents the prediction time domain length; the pre-trained value network is invoked to evaluate the value of the initial action of each path, and the total value of the path is calculated. , This represents the discount factor.
[0087] During the evaluation process, the constraints of each path are dynamically checked, the most valuable path is selected from the valid paths, and the first action is extracted as an immediate decision. If all paths are invalid, a safety rollback mechanism is triggered, and the default safety action is selected.
[0088] In this embodiment, in step S7, the personal historical learning path and the planned future path are displayed side by side. The changes in cognitive state are presented in the form of a timeline, and key action nodes are marked with heatmaps to intuitively compare the execution effect of the historical path with the expected performance of the planned path. User feedback is analyzed in real time to identify high-frequency keywords. The specific steps to be optimized are located in combination with cognitive state indicators. Based on the analysis results, the reward function weight is automatically adjusted, the reward weight distribution is updated in real time on the interface, and a brief explanation is added.
[0089] Specifically, the reward function weights are automatically adjusted based on real-time user feedback on the learning path, such as difficulty being too high, content being repetitive, or easy to forget. High-frequency keywords are identified through NLP. If the percentage of negative feedback / frequency of high-frequency keywords exceeds a threshold, the weight of the corresponding learning stage is adjusted. If the feedback indicates that the difficulty is too high and appears frequently, the reward weight related to the difficulty challenge is reduced. If the feedback indicates a lack of relevance and appears frequently, the reward weight related to personalized recommendations is increased.
[0090] Specifically, it monitors cognitive state indicators such as vocabulary mastery rate, forgetting rate, and learning efficiency in real time. If these indicators deviate from preset targets by more than a threshold, weight adjustments are triggered. It compares the historical execution results with the expected performance of the planned path over a timeline. If the difference exceeds a threshold, such as actual learning time being 20% longer than expected or actual mastery rate being 15% lower than expected, it triggers adjustments to related weights such as path time allocation and action efficiency, narrowing the gap between execution and planning. The reward function weights are automatically adjusted as follows:
[0091] ,
[0092] In the formula, Indicates the new weights, Indicates the old weight. Here, 'e' represents the learning rate, and 'e' represents the error. Indicates sensitivity.
[0093] The system analyzes user feedback in real time to identify high-frequency keywords; it then combines these with cognitive state indicators to pinpoint specific learning stages that need optimization; based on the analysis results, the system automatically adjusts the reward function weights to optimize the learning path; and it updates the interface in real time, allowing users to visually see the adjusted reward weight distribution.
[0094] In this embodiment, in a junior high school English learning scenario, the user uses a smart learning app to accumulate vocabulary. In phase S1, the system captures the user's vocabulary practice responses in real time, such as correct / incorrect answers, single-question reaction time, exercise type, and operation trajectory, such as clicking prompt buttons, page dwell time, and skipping questions. The system then performs real-time cleaning and aggregation of the raw data to generate a standardized time-series data segment sequence based on a single learning session.
[0095] In the S2 phase, the system loads a pre-built vocabulary knowledge graph. This graph integrates learning features such as vocabulary difficulty level, word frequency index, word formation complexity, cross-grade transfer distribution, and semantic network centrality based on an authoritative educational corpus. The user's vocabulary operation behavior is dynamically mapped to the corresponding node in the knowledge graph. The system records the behavior events and timestamps and calculates the activation intensity of each vocabulary node. When the system recognizes that the user has just completed the transition from primary to junior high school, triggering grade transfer, it automatically applies the transfer sensitive period indicator function to weight and enhance the activation intensity of vocabulary nodes in adjacent grades, strengthening the perception of cross-grade knowledge gaps. At the same time, the system simultaneously calculates four core indicators from the activation intensity: mastery, memory stability, cognitive load, and transfer readiness. Through two-hop semantic diffusion aggregation of indicators, it generates a standardized comprehensive cognitive state vector.
[0096] In the S3 phase, the system pre-defines a hierarchical action space: the macro layer includes four modes: word memorization, fun review, vocabulary test, and grade change; the micro layer dynamically generates subsets based on the user's overall cognitive state; when the system detects that the user is in a grade transition sensitive period, such as having just switched to junior high school, it forcibly freezes the vocabulary test action and inserts the fun review action; when the cognitive load indicator shows that the user's reaction time is too long, it automatically downgrades the word memorization action to a low-difficulty mode and inserts restorative fun review; the system constructs a four-element reward component: short-term mastery reward based on the rate of change in mastery, long-term retention reward based on the predicted future memory retention rate, cognitive load reward based on the negative penalty of the load indicator, and special incentive based on the growth rate of the new grade's vocabulary activation intensity during the transition sensitive period, and dynamically adjusts the reward priority according to the progress of the learning conversation.
[0097] In the S4 stage, the system extracts complete learning session data from the historical database, constructs a state-action-reward triplet sequence, and trains a dual-channel policy network based on the learning stage factor priority sampling of the transfer sensitive period and cognitive load peak data. After the network weights are initialized by Gaussian, the proportion of cognitive load penalty terms is dynamically adjusted through the policy gradient optimization objective to ensure that the policy optimization conforms to the learning specification, and finally saved as the initial meta-policy.
[0098] In the S5 stage, during the user's real-time learning session, the system continuously acquires the comprehensive cognitive state vector and combined rewards, and updates the policy parameters online through incremental reinforcement learning. The optimized policy outputs a two-layer decision sequence: the macro layer specifies the learning mode, and the micro layer generates specific execution parameters, forming a continuous personal learning path sequence. During the path generation process, the system dynamically checks the cognitive load status, and if the load is too high, it automatically adjusts the path parameters to avoid learning overload.
[0099] In the S6 stage, before making critical decisions, the system constructs a differentiable virtual learning environment to simulate dynamic state transitions and reward generation; using the policy network optimized in S5, multiple future paths are randomly generated, and the pre-trained value network is invoked to evaluate the value of the paths and dynamically check educational constraints; the path with the highest value is selected from the effective paths, and the first action is extracted as the immediate decision; if all paths are invalid, a safety rollback mechanism is triggered.
[0100] In the S7 phase, the system juxtaposes the user's historical learning path with their planned future path on a timeline, using heatmaps to highlight key nodes for a direct comparison of execution results and expected performance. The system analyzes user feedback in real time, identifying high-frequency keywords through NLP and using cognitive state indicators to pinpoint optimization steps. When feedback indicating excessive difficulty is detected, the system automatically lowers the weight of difficulty challenge-related rewards. When a lack of targeted feedback occurs, the system increases the weight of personalized recommendations. The adjusted reward weight distribution is updated in real time on the interface, with brief explanations such as "Due to your feedback that the difficulty was too high, the difficulty challenge weight has been lowered," ensuring the learning path dynamically adapts to the user's current learning pace and improves the learning experience.
[0101] Working principle: By capturing the user's response to vocabulary exercises, single question completion reaction time, exercise type, operation trajectory, and learning path behavior in real time on the learning interface; these raw data undergo real-time preprocessing, including removing invalid records, filling missing values, detecting and correcting data drift. The preprocessed data is aggregated into a continuous sequence according to time windows, and divided into equal-length segments based on a single learning session. Each segment contains a multi-dimensional data vector within the time period, forming a structured temporal input.
[0102] By loading a pre-built vocabulary knowledge graph, a feature vector is injected into each vocabulary node, including vocabulary difficulty identifier, usage frequency, word form change difficulty, knowledge transfer node record, and semantic network importance assessment. Vocabulary operation behaviors in time-series data segments are dynamically mapped to corresponding nodes in the knowledge graph, recording behavior events and timestamps to establish a real-time behavior-vocabulary association index. Based on the mapping results, the activation intensity of each vocabulary node is calculated, with special identification of migration-sensitive period data triggered by grade-level changes. The activation intensity of vocabulary nodes in adjacent grades is weighted and enhanced using a grade-level migration difficulty coefficient to strengthen the perception of cross-grade knowledge gaps. Simultaneously, four core cognitive indicators are calculated from the activation intensity, based on the knowledge graph language... The semantic relationships undergo two-hop semantic diffusion, with diffusion weights decaying according to edge strength and cognitive distance. Aggregated indicators form a basic state vector, and a standardized comprehensive cognitive state vector is generated through bi-exponential smoothing of the temporal sequence. A hierarchical action space is constructed, consisting of a macro-action layer and a micro-action layer. The macro-level layer includes four core learning modes: word memorization, fun review, vocabulary testing, and grade change. The micro-level layer dynamically maps the comprehensive cognitive state vector to generate operable subsets. When a grade-change sensitive period is detected, the vocabulary testing action is forcibly frozen, and a fun review action is inserted. When the cognitive load indicator exceeds a threshold, the word memorization action is automatically downgraded to a lower difficulty mode, and a restorative fun review action is inserted. The system constructs four... The meta-reward components include short-term mastery rewards based on the rate of change in mastery, long-term retention rewards based on the predicted future memory retention rate, cognitive load rewards based on negative penalties according to load indicators, and specific incentives based on the growth rate of new grade vocabulary activation intensity during the transfer-sensitive period. The reward priority weights are dynamically adjusted according to the learning session progress. The final reward function is constructed by integrating the basic components and dynamic weights, and dynamic safety constraints for preset micro-action parameters are established based on historical data statistics. A dual-channel policy network decision-making is constructed, with the macro-channel outputting the probability distribution of learning modes and the micro-channel generating specific parameter configurations. Complete historical learning session data is extracted from the database to construct a state-action-reward triplet sequence, where the state is... The system integrates cognitive state vectors, employs hierarchical action space decision-making, and uses a combined reward function. Historical data is stratified and sampled based on learning stage factors, prioritizing high-value data from transfer-sensitive periods and peak cognitive load to construct a balanced training sample distribution covering multiple scenarios. A dual-channel neural network is used as the meta-policy network. The input layer receives the integrated cognitive state vector, the hidden layer uses a multi-layer structure, and the output layer generates macroscopic action probability distributions and microscopic action parameter configurations. Network weights are randomly initialized using a Gaussian distribution. The objective function is defined based on policy gradient theory. Constraint weights are introduced during training to dynamically adjust the proportion of the cognitive load penalty term in the loss function, ensuring that policy optimization always conforms to the learning paradigm.A validation set evaluation mechanism is set up. When the performance indicators of the strategy on the validation set reach the preset target and the cognitive load volatility remains within a reasonable range, the pre-training is considered complete, and the converged network weights are saved as the initial meta-policy. Dynamic adaptation of the learning path is achieved through online policy optimization and personal path generation. The system continuously acquires real-time cognitive state and rewards, performs incremental updates based on the initial policy, and dynamically integrates constraint weights. A two-layer decision output forms a personalized learning path, and cognitive load and transfer state are verified in real-time during the generation process. The safety of key decisions is ensured through safe simulation and optimal action selection. The system simulates multiple future paths in a virtual environment, evaluates their value, and dynamically verifies constraints, selecting only safe and effective optimal actions. If all paths are ineffective, a safety rollback mechanism is triggered. A user-participatory continuous optimization loop is established through visualization and adaptive adjustment. The system displays historical and planned paths side-by-side, visually presenting changes in cognitive state using a timeline and heatmap, and embeds a lightweight feedback entry to collect user experience data. After intelligent analysis of the feedback, the reward weights are dynamically adjusted, the weight distribution is updated in real-time, and the reasons for the adjustment are explained.
[0103] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their likenesses.
[0104] The present invention and its embodiments have been described above. This description is not restrictive, and the accompanying drawings are only one embodiment of the present invention; the actual structure is not limited thereto. In conclusion, if those skilled in the art are inspired by this description and design similar structures and embodiments without departing from the spirit of the invention, such designs should fall within the protection scope of the present invention.
Claims
1. A vocabulary intelligent learning path planning method based on reinforcement learning, characterized in that, Includes the following steps: S1. Collect learners' multi-source interactive behavior data in real time and perform standardized preprocessing; S2. Based on the time-series data and the preset knowledge graph, perform dynamic cognitive state representation and generate a comprehensive cognitive state vector; S3. Pre-set hierarchical action space and combined reward function to set decision-making; S4. Based on the comprehensive cognitive state vector and decision, pre-train on historical data to initialize the meta-policy network; S5. Based on the initial strategy, combined with real-time user status and rewards, perform online strategy optimization and personal path generation; S6. Before making critical decisions, simulate multiple future paths in a safe simulation environment and evaluate them through a value network to select the optimal immediate action; S7. Based on historical and planned paths, visualize the data and adaptively adjust the reward function weights based on user feedback.
2. The vocabulary intelligent learning path planning method based on reinforcement learning according to claim 1, characterized in that: In S1, during the interaction process of the learning interface, the user's response to vocabulary practice, reaction time to complete a single question, and practice type are captured in real time. The user's operation trajectory, learning path behavior, and temporal structure of the learning session are tracked in real time. The raw data collected in real time is preprocessed in real time. The preprocessed data is aggregated into a continuous sequence according to the time window. The data is divided into segments of equal length based on a single user learning session. Each segment contains a multi-dimensional data vector within the time period, generating a standardized temporal data segment sequence.
3. The vocabulary intelligent learning path planning method based on reinforcement learning according to claim 1, characterized in that: In step S2, a pre-built vocabulary knowledge graph is loaded, a learning feature vector is injected into each vocabulary node, vocabulary operation behaviors in time-series data segments are dynamically mapped to corresponding vocabulary nodes in the knowledge graph, behavior events and timestamps are recorded, a real-time behavior-vocabulary association index is established, and the activation intensity of each vocabulary node is calculated based on the mapping results. , In the formula, This represents the dynamic activation intensity of word i at time t. Represents the set of behavioral events for word i. This represents the cognitive accuracy of behavior k. This represents the reaction time of behavior k. This represents the reaction time offset constant. 'b' represents the time decay coefficient, and 'b' represents the base activation offset. Indicates the migration enhancement coefficient. Indicates the indicator function for the migration sensitive period. This indicates the difficulty level of grade-level transfer.
4. The vocabulary intelligent learning path planning method based on reinforcement learning according to claim 3, characterized in that: In step S2, based on the semantic relationships of the knowledge graph, two-hop semantic diffusion is performed on the activation intensity. The diffusion weight decays according to the edge strength and cognitive distance. Multi-dimensional cognitive indicators are aggregated into a basic state vector. The time series is processed through double exponential smoothing to generate a standardized comprehensive cognitive state vector, which is implemented as follows: , In the formula, Represents the comprehensive cognitive state vector. Normalization by norm constraint This indicates the dynamically weighted degree of control. This represents a double exponential smoothing function. Indicates level one smoothness. This represents the trend smoothing coefficient.
5. The vocabulary intelligent learning path planning method based on reinforcement learning according to claim 1, characterized in that: In S3, a hierarchical architecture of macroscopic action layer and microscopic action layer is preset. The macro level includes four core learning modes: vocabulary memorization, fun review, vocabulary testing, and grade level switching; The micro-level generates an operable subset for each macro-action based on dynamic mapping of comprehensive cognitive state vectors. Based on the comprehensive cognitive state vector, a four-element reward component is constructed, including short-term mastery reward, long-term retention reward, cognitive load reward, and grade adaptation reward. The reward priority weights are dynamically adjusted based on the progress of the learning session, and the basic components and dynamic weights are combined to construct the final reward function. Based on historical data statistics, dynamic safety constraints are preset for micro-action parameters; a dual-channel policy network decision-making is constructed: the macro-channel outputs the probability distribution of the learning mode, and the micro-channel generates specific parameter configurations.
6. The vocabulary intelligent learning path planning method based on reinforcement learning according to claim 1, characterized in that: In step S4, complete learning session data is extracted from the system database to construct a state-action-reward triplet sequence, where the state is a comprehensive cognitive state vector, the action is a hierarchical action space decision, and the reward is a combined reward function. Based on the learning stage factor, historical data is sampled in layers, and high-value data from the transfer-sensitive period and cognitive load peak are selected first to construct a balanced training sample distribution.
7. The vocabulary intelligent learning path planning method based on reinforcement learning according to claim 6, characterized in that: In step S4, the dual-channel neural network serves as the meta-policy network; the objective function is preset based on the policy gradient. , In the formula, Represents the policy gradient loss function. This represents the trainable parameter vector of the policy network. Indicates policy-based The expectation operator, where s represents the state variable and z represents the action variable. The estimated state-action value function is represented by the following: During training, learning constraint weights are introduced to dynamically adjust the proportion of the cognitive load penalty term in the loss function. A pre-set validation set evaluation mechanism is used. When the policy's performance metrics on the validation set reach a preset target threshold and the cognitive load volatility remains within a reasonable range, pre-training is considered complete, and the converged network weights are saved as the initial meta-policy. .
8. The vocabulary intelligent learning path planning method based on reinforcement learning according to claim 1, characterized in that: In step S5, during the user learning session, a comprehensive cognitive state vector and combined rewards are continuously acquired to construct a continuous temporal data stream based on the initialized meta-policy. By using incremental reinforcement learning to update the policy network parameters online, the optimized policy is applied to the real-time state, and a two-layer decision sequence is output: the macro layer specifies the learning mode, and the micro layer generates specific execution parameters, forming a continuous sequence of personal learning paths.
9. The vocabulary intelligent learning path planning method based on reinforcement learning according to claim 1, characterized in that: In step S6, a differentiable virtual learning environment is constructed based on the comprehensive cognitive state vector and the reward function. Using the S5-optimized policy network, Monte Carlo path generation is performed on the current comprehensive cognitive state vector. K future paths are generated through random sampling, each path containing an action sequence and a cumulative reward sequence. The pre-trained value network is invoked to evaluate the value of the starting action of each path, calculate the total value of the path, and dynamically check the constraint indicators of each path during the evaluation process. The path with the highest value is selected from the effective paths, and the first action is extracted as the immediate decision. If all paths are invalid, a safety fallback mechanism is triggered, and the default safe action is selected.
10. The vocabulary intelligent learning path planning method based on reinforcement learning according to claim 1, characterized in that: In S7, the personal historical learning path and the planned future path are displayed side by side. The changes in cognitive state are presented in the form of a timeline, and key action nodes are marked with heat maps to intuitively compare the execution effect of the historical path with the expected performance of the planned path. The system analyzes user feedback in real time, identifies high-frequency keywords, and uses cognitive state indicators to pinpoint specific areas for optimization. Based on the analysis results, it automatically adjusts the reward function weights and updates the reward weight distribution in real time on the interface.