An educational large model architecture for personalized learning
By constructing an educational knowledge graph and fine-tuning the joint loss function, the limitations of educational large language models in educational applications and the challenges of deployment are addressed, enabling efficient, accurate, and low-cost generation of educational content for personalized learning.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING WISDOM RONGSHENG TECH CO LTD
- Filing Date
- 2026-03-26
- Publication Date
- 2026-06-19
AI Technical Summary
Existing educational language models lack a deep understanding of the educational knowledge system in educational applications, leading to conceptual confusion or incorrect deductions when answering questions, generating content that does not conform to the facts of the subject, failing to accurately identify students' weak knowledge areas, lacking scientific consideration in learning path planning, and reducing the usability of the technology due to high deployment costs and complex operations.
We construct an educational knowledge graph, obtain knowledge embedding vectors through graph neural networks, perform cross-attention fusion by combining a pre-trained large language model, fine-tune the teacher model using a joint loss function, perform knowledge distillation to generate a lightweight education vertical model, and provide personalized learning content by combining user learning profiles.
It enables personalized learning to run efficiently on ordinary computing devices, generating highly accurate and factual educational content, identifying learning weaknesses and providing targeted guidance, reducing deployment costs, and improving teaching reliability and efficiency.
Smart Images

Figure CN122242569A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of educational artificial intelligence technology, and more specifically, to a large-scale educational model architecture for personalized learning. Background Technology
[0002] Current intelligent assistive technologies in education face numerous deep-seated problems. While general-purpose large language models have made significant progress in natural language processing, they exhibit obvious limitations in educational applications: they lack a deep understanding of educational knowledge systems, often leading to conceptual confusion or incorrect derivations when answering questions. For example, they fail to accurately grasp the logical connections between concepts when explaining mathematical theorems, causing students to form incorrect perceptions. Furthermore, these models are prone to "illusion" phenomena, generating seemingly reasonable but actually inconsistent content, such as creating non-existent historical evidence in the explanation of causal relationships in historical events, seriously affecting the reliability of teaching. Existing systems have significant shortcomings in personalized learning support, failing to accurately identify specific weaknesses in students' knowledge networks, resulting in recommended content that does not address genuine learning obstacles. Learning path planning lacks scientific consideration of knowledge dependencies and cognitive patterns; for example, in mathematics learning, it does not consider the necessity of understanding "derivatives" for the concept of "functions," causing students to feel confused when faced with leaps in knowledge points. At the technical implementation level, the integration of knowledge and models remains at a superficial level of splicing or external invocation, failing to truly internalize structured knowledge into model capabilities, resulting in a disconnect between answers and the knowledge base. Multi-objective capability training is difficult to balance, either exhibiting good fluency but poor professionalism, or professionalism but stiff expression, making it difficult to simultaneously meet the dual needs of educational scenarios. Furthermore, the high deployment cost of large models is a significant obstacle to widespread adoption, with most educational institutions unable to afford the necessary computing resources, leading to a disconnect between advanced AI technology and grassroots educational scenarios. Teachers have found in practical use that even with the necessary deployment conditions, the complex operation and professional barriers significantly reduce the technology's usability.
[0003] In view of this, the present invention proposes an educational big data model architecture for personalized learning to solve the above problems. Summary of the Invention
[0004] To overcome the aforementioned shortcomings of existing technologies and to achieve the above objectives, this invention provides the following technical solution: a large-scale educational model architecture for personalized learning, comprising: The knowledge graph construction module is used to acquire and preprocess multi-source educational data, perform quality assessment and screening of the preprocessed educational data, extract the semantic relationships between educational entities from the screened educational data, and perform entity disambiguation and knowledge consistency verification to construct an educational knowledge graph. The embedding encoding module is used to perform message passing encoding on the educational knowledge graph through a relation type-aware graph neural network to obtain the knowledge embedding vector of each entity node; The knowledge fusion module is used to perform cross-attention fusion of knowledge embedding vectors and hidden state representations of lexical units in a pre-trained large language model to obtain knowledge-enhanced representations; the knowledge-enhanced representations are filtered according to the semantic relevance of the context, and the knowledge contribution is dynamically adjusted through gating weights to obtain the fused contextual representations; The joint fine-tuning module is used to fine-tune the pre-trained large language model based on the fused context representation, using a joint loss function that includes language model loss, knowledge alignment loss and logical consistency loss, combined with the course learning strategy, to obtain the teacher model; The knowledge distillation module is used to perform multi-level distillation training on student models with fewer parameters than teacher models, based on the soft label probability distribution, intermediate layer feature representation and real labels output by the teacher model on the training samples, to obtain lightweight education vertical models. The personalized learning service module is used to generate personalized learning content and dynamic learning paths based on a lightweight education vertical model combined with user learning profile information and knowledge forgetting curves.
[0005] The technical effects and advantages of this invention's educational big data model architecture for personalized learning: This invention significantly reduces model size through knowledge distillation technology, enabling efficient operation on ordinary computing devices. This breaks down the hardware barriers to AI education applications, making intelligent education technology affordable for more educational institutions. It can be flexibly implemented in cloud, local, or hybrid deployment modes. Regarding content quality, this architecture provides reliable knowledge constraints for the generation process through deep integration of knowledge graphs and large models. This ensures that model answers are built upon a structured knowledge system, effectively suppressing "illusion" phenomena and significantly improving the accuracy and factuality of the content. More importantly, in personalized teaching, this invention not only provides correct answers but also offers problem-solving strategies and methodologies that align with teaching principles, helping students build a systematic knowledge framework. In specific scenarios, it can accurately identify learners' weaknesses and provide targeted learning resources and guidance. Attached Figure Description
[0006] Figure 1 This is a schematic diagram of an educational big data model architecture for personalized learning according to the present invention. Detailed Implementation
[0007] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0008] This application provides an educational large-scale model architecture for personalized learning. The execution entities of the architecture include, but are not limited to, intelligent education platforms, personalized learning systems, educational resource recommendation engines, and knowledge graph-assisted learning systems, which can be regarded as general computing nodes of this application. The educational model system includes, but is not limited to, at least one of the following: a cloud-based educational large-scale model inference engine, a distributed knowledge graph construction system, and a lightweight model deployment environment.
[0009] Please see Figure 1 In this embodiment of the invention, an educational big data model architecture for personalized learning includes: The knowledge graph construction module is used to acquire and preprocess multi-source educational data, perform quality assessment and screening of the preprocessed data, extract semantic relationships between educational entities from the screened data, and perform entity disambiguation and knowledge consistency verification to construct an educational knowledge graph. The multi-source educational data includes key information such as e-textbooks, teaching syllabi, question banks, past exam questions, teaching video subtitles, and encyclopedia entries, acquired through multi-channel data acquisition interfaces. E-textbooks record the systematic knowledge system of each subject, teaching syllabi clarify the ability requirements for different grade levels, question banks and past exam questions reflect the examination methods and difficulty of knowledge points, teaching video subtitles contain the teacher's explanation logic, and encyclopedia entries provide extensive background knowledge. This data provides comprehensive raw materials for subsequent graph construction, ensuring the completeness and accuracy of the educational knowledge graph.
[0010] The embedding encoding module is used to perform message-passing encoding on the educational knowledge graph using a relation-type-aware graph neural network, obtaining a knowledge embedding vector for each entity node. The knowledge embedding vector includes the semantic representation of the entity, the directional features of the relations, and the structured representation of the knowledge hierarchy. The semantic representation describes the core meaning of educational concepts, the directional features of the relations reflect the logical order and dependencies between knowledge points, and the structured representation quantifies the organizational structure of the knowledge system. These multi-dimensional features together constitute the vectorized representation of educational knowledge, providing a mathematical foundation for subsequent knowledge fusion.
[0011] The knowledge fusion module performs cross-attention fusion between knowledge embedding vectors and the hidden state representations of lexical units in a pre-trained large language model to obtain knowledge-enhanced representations. These enhanced representations are then filtered based on contextual semantic relevance, and the knowledge contribution is dynamically adjusted using gating weights to obtain the fused contextual representation. This module first performs cross-modal mapping on the knowledge embedding vectors to make them compatible with the representation space of the large language model. Then, through a cross-attention mechanism, it deeply fuses structured knowledge with textual representations, accurately locating semantically related regions. Finally, through an adaptive gating mechanism, it dynamically adjusts the proportion of knowledge introduced based on contextual needs, providing semantically enhanced input representations for joint fine-tuning.
[0012] The joint fine-tuning module is used to fine-tune the pre-trained large language model based on the fused contextual representation, employing a joint loss function that includes language model loss, knowledge alignment loss, and logical consistency loss, combined with curriculum learning strategies, to obtain the teacher model. This module uses a multi-objective optimization method to simultaneously train the model's language comprehension and knowledge reasoning abilities, enabling the model to accurately adhere to facts and rules in the educational domain while maintaining fluent expression. This results in an expert-level teacher model that understands both language and knowledge, laying the foundation for subsequent knowledge distillation.
[0013] The knowledge distillation module is used to perform multi-level distillation training on student models, which have fewer parameters than the teacher model, based on the soft label probability distribution, intermediate layer feature representation, and true labels output by the teacher model on training samples. This results in a lightweight education-specific model. The module employs a temperature-smoothed soft label training and feature matching strategy to efficiently transfer the "hidden knowledge" of the teacher model to the smaller student model. This allows the student model to mimic the reasoning process and knowledge representation of the teacher model, while significantly reducing computational resource requirements and enabling efficient deployment.
[0014] The personalized learning service module is used to generate personalized learning content and dynamic learning paths based on a lightweight education vertical model combined with user learning profile information and knowledge forgetting curves. This module collects and analyzes user learning behavior data to construct a personalized knowledge state graph, and combines this with the Ebbinghaus forgetting curve model to predict the memory status of knowledge points, thereby generating targeted learning suggestions and optimal learning paths to provide users with a truly personalized learning experience.
[0015] In this embodiment of the invention, the detailed implementation steps for constructing an educational knowledge graph include: A pre-trained language model is used to vectorize the text in the selected educational data, obtaining the contextual semantic vector for each word. Vectorization encoding is a fundamental step in knowledge extraction, transforming text into a machine-processable numerical representation. The encoding process employs a BERT-like pre-trained language model. After word segmentation of the input text, the model's multi-layer Transformer structure generates word representations that consider contextual information. The contextual semantic vector for each word is typically 768 or 1024 dimensions, containing deep semantic information about the word within a specific context. For educational texts, using a language model further fine-tuned on educational corpora can more accurately capture the semantics of specialized terms and concepts, improving the accuracy of subsequent entity recognition. The encoding results serve as feature inputs for subsequent entity recognition and relation extraction, providing a semantic foundation for graph construction.
[0016] The contextual semantic vector is input into a bidirectional long short-term memory network (LSTM), and the output feature vector sequence is globally decoded through a conditional random field (CRF) layer to identify educational entities and their category labels in the text. Entity recognition is the core step in graph construction, responsible for identifying educational concepts and terms from the text. The recognition process adopts a BiLSTM-CRF model structure. First, a bidirectional LSTM is used to capture long-distance dependencies in the sequence. Then, a CRF layer is used to consider the transition constraints between labels to achieve globally optimal sequence labeling. For educational texts, entity categories include various types such as concepts, theorems, formulas, people, and events. A refined labeling scheme can improve the structure of the graph. The model is trained using a large-scale labeled dataset containing core concepts and knowledge points from various disciplines, and the parameters are optimized using a cross-entropy loss function. Recognition accuracy is typically evaluated using the F1 score. The entity recognition accuracy for different disciplines generally reaches over 85%, providing a high-quality node set for subsequent graph construction.
[0017] Entity disambiguation is performed on identified educational entities, merging different expressions of the same concept into a unified entity. Entity disambiguation is a crucial step in ensuring graph consistency, reducing redundancy and confusion by merging synonymous entities. The process first constructs feature representations of entities, including contextual semantic vectors, lexical features, and co-occurrence patterns; then, it calculates the similarity between entity pairs to identify potential synonymous relationships; finally, it merges highly similar entities into a unified concept using clustering algorithms or graph alignment techniques. For common ambiguities in the education domain, such as the same concept expressed in different ways (e.g., "Pythagorean theorem" and "Geometry"), a combination of domain dictionary assistance and contextual similarity can effectively resolve these issues. Disambiguation significantly reduces graph redundancy, improves the standardization and consistency of knowledge representation, and provides a clear entity foundation for subsequent relation extraction.
[0018] After disambiguation, entity pairs are constructed for educational entities, and an attention-based relation classification model is used to determine the semantic relationship type between each entity pair. Relation extraction is a key step in constructing graph edges, determining the semantic connections and logical relationships between entities. The extraction process first identifies sentences containing entity pairs from the text as candidate relation samples; then, an attention-enhanced classification model is used to determine the relationship type between entity pairs. The model structure uses a pre-trained language model as the encoder, highlighting entity positions through entity tagging and position encoding, then capturing the interaction features between entities through a multi-head attention mechanism, and finally outputting the relation type probability through a multi-layer classifier. Relation types in the education domain include various semantic relations such as "belongs to," "premise," "applies to," "contains," and "leads to," covering various logical connections between knowledge points. The relation classification model is trained on large-scale labeled data and optimized using a cross-entropy loss function, achieving an average relation extraction accuracy of approximately 80%, effectively constructing the connection structure between entities in the graph.
[0019] Knowledge consistency verification is performed on extracted triples to detect and eliminate contradictory relation descriptions. Knowledge consistency verification is a necessary step to ensure the quality of the graph, eliminating potential knowledge contradictions through logical rule verification. The verification process first defines a series of domain constraint rules, such as mutual exclusion, transitivity, and symmetry of relations; then, these rules are applied to all extracted triples to detect instances of constraint violations; finally, contradictory triples are retained or removed based on evidence support and credibility. In practice, graph-based reasoning algorithms, such as path reasoning or rule mining, are used to automatically discover inconsistency patterns in the graph. For the specific characteristics of educational knowledge, such as its hierarchical structure and logical dependencies, specific consistency rules are designed, such as preventing circular dependencies (A is a prerequisite for B, and B is a prerequisite for A). Consistency verification significantly improves the reliability and logical rationality of the graph, providing a high-quality knowledge foundation for subsequent knowledge applications.
[0020] Disambiguated educational entities are used as nodes, validated semantic relationships as edges, and each node is labeled with a knowledge difficulty level to construct an educational knowledge graph. Graph construction is the final step in integrating the results of the preceding steps, forming a structured knowledge representation. The construction process first maps entities and relationships to nodes and edges in the graph database; then, attribute information is added to nodes and edges, such as the entity's definition, attributes, examples, and difficulty level; finally, an index structure is built to optimize query efficiency. The labeling of knowledge difficulty levels is based on multiple factors, including the grade level specified in the teaching syllabus, the frequency and difficulty of historical exam questions, and the abstract complexity of concepts, typically divided into three levels (beginner, intermediate, advanced) or five more detailed levels. The final constructed educational knowledge graph not only contains static knowledge points but also defines the complex logical connections and difficulty levels between them, providing a structured knowledge foundation for subsequent embedding encoding and model training, and serving as the core support for the entire educational large-scale model architecture.
[0021] In this embodiment of the invention, the detailed implementation steps for obtaining the knowledge embedding vector of each entity node through a relation type-aware graph neural network include: Initialize the initial embedding vector for each entity node in the educational knowledge graph, and initialize the relation embedding vector for each relation type. Initialization is the starting point for graph neural network encoding, providing the basic representation for subsequent message passing. The initialization process uses multiple feature sources: for entity nodes, initial representations are generated by combining semantic vectors extracted from a pre-trained language model and structural features in the knowledge graph (such as degree centrality); for relation types, dedicated relation vectors are created based on the semantic and statistical properties of relation names (such as frequency and connection patterns). The initial embedding vector dimension is typically set to 128-256, which is sufficient to capture complex semantic and structural features. To improve the quality of the initial representations, knowledge graph embedding methods such as TransE or RotatE can be used to pre-train the initial vectors to capture the basic semantics of entities and relations. Although these initial vectors do not yet contain graph structure information, they provide a good starting point for subsequent graph neural networks, accelerate training convergence, and improve the quality of the final embeddings.
[0022] For each entity node, all its neighbor nodes and their corresponding connection types are obtained. The embedding vectors of the neighbor nodes are then transformed based on the relationship embedding vectors of the connection types to obtain a relationship-aware neighbor representation. Relationship-aware transformation is the core innovation of relationship-type-aware graph neural networks, improving representational capability through differentiated processing of neighbor information based on different relationships. The transformation process first collects all neighbors and their corresponding relationship types for each node; then, for each neighbor node, a specific transformation function is used to process the neighbor's embedding vector based on its connection relationship type. The transformation function is designed as a combination of a relationship-specific linear transformation and a non-linear activation function, with the following formula: ;in, For the purpose of establishing a relationship Transformed Neighbors The expression, For the neighbors The current embedding vector, For relationship A specific transformation matrix, For relationship The bias vector, It is a non-linear activation function (such as ReLU or tanh).
[0023] This relation-specific transformation allows the model to differentiate neighbor information based on different relation types. For example, the knowledge features conveyed by "premise" and "application" relations are different and require different transformations to capture correctly. The transformed neighbor representation retains the original semantic information while incorporating the directionality and semantics of the relation, providing rich feature input for subsequent attention aggregation.
[0024] The attention weights between entity nodes and each relation-aware neighbor representation are calculated. Based on these attention weights, all relation-aware neighbor representations are weighted and aggregated to obtain a neighbor aggregation vector. Attention aggregation is a key step in adaptively integrating neighbor information, improving representation accuracy by learning neighbor importance weights. The aggregation process first calculates the attention score between the center node and each transformed neighbor, quantifying the importance of neighbors to the current node; then, these scores are weighted and averaged to obtain the final neighbor aggregation vector. The attention score calculation uses the core mechanism of Graph Attention Network (GAT), with the following formula: ; in, For nodes To the neighbors Attention weights For nodes Embedded vector, This is the neighbor representation after relation transformation. Here is the attention parameter matrix. Let || be the attention vector, and || denotes vector concatenation. For nodes The set of neighbors.
[0025] The formula for calculating the aggregate vector is: ; in, For nodes The neighbor aggregation vector is the weighted sum of all transformed neighbor representations.
[0026] This attention mechanism allows the model to automatically identify and emphasize important neighboring nodes. For example, for the concept of a "function," core related concepts such as "domain" and "range" receive higher attention weights, while peripheral related concepts are less affected. Attention aggregation effectively improves the targeting and efficiency of message passing, enabling the model to capture complex knowledge structures.
[0027] The updated node embedding vector is obtained by concatenating the neighbor aggregation vector with the entity node's own embedding vector and performing a nonlinear transformation. Node update is a key step in fusing self-information and neighbor information, generating richer node representations through reasonable feature combinations and transformations. The update process first concatenates the node's own embedding vector with the neighbor aggregation vector to form a joint representation containing both self-information and contextual information; then, a nonlinear transformation is performed using a multilayer perceptron (MLP) to fuse and compress features; finally, residual connections are applied to preserve the original information and alleviate the gradient problem in deep graph neural networks. The update formula is: ;in, For nodes Updated embedding vector, For nodes The current embedding vector, For the neighbor aggregation vector, To update the transformation matrix, For bias vectors, It is a non-linear activation function, such as ReLU.
[0028] Residual connection (plus) This mechanism ensures that the original node information is not lost even after multiple layers of transmission, while allowing the model to flexibly learn the balance between old and new information. This update mechanism enables node representations to retain their own semantic features while incorporating structural context information, forming a more comprehensive knowledge representation and providing a solid foundation for knowledge reasoning and application.
[0029] After repeatedly performing transformation, aggregation, and nonlinear transformation steps to a predetermined number of layers, a knowledge embedding vector for each entity node is obtained. Multi-layer iteration is a key mechanism for expanding the receptive field and capturing higher-order relationships, constructing a globally perceptive node representation through repeated message passing. The iterative process organizes the above three steps (relation-aware transformation, attention aggregation, and node update) into a graph neural network layer, and then stacks multiple such layers to achieve multi-hop message passing in the graph. Typically, 2-4 layers are stacked, enabling each node to perceive the knowledge structure within a 2-4 hop neighborhood. During multi-layer passing, to prevent oversmoothing, each layer can use different parameter configurations to adjust the strength of the relation transformation and the scope of attention aggregation. The output of the final layer serves as the final knowledge embedding vector for each node, with dimensions typically remaining the same as the initialization (128-256), but the content has been integrated with rich structured knowledge information. These embedding vectors not only encode the semantics of the entity but also contain the entity's position, relationship, and importance in the knowledge graph, providing a high-quality knowledge representation for subsequent integration with large language models.
[0030] In this embodiment of the invention, the detailed implementation steps for obtaining knowledge-enhanced representations include: The hidden state representation of the currently processed word in the pre-trained large language model is used as the query vector. The hidden state representation is an intermediate result of the large language model's text processing, containing the contextual semantic information of the word. When the model processes the input text, each word is processed through a multi-layer Transformer structure, generating rich feature representations. For each word... Its hidden state represents These are typically vectors of 768 to 4096 dimensions, with the specific dimension depending on the model size (e.g., BERT-base is 768 dimensions, while GPT-3-like models can reach 4096 dimensions). These vectors encode the semantic information of words in the current sentence context, including their word meaning, grammatical role, and contextual relationships. Using these hidden states as query vectors means that we will actively seek relevant knowledge graph information based on the current text processing state, ensuring that the introduced knowledge is closely related to the currently processed content. The quality of the query vectors directly affects the accuracy of subsequent knowledge retrieval; therefore, using deep representations from pre-trained large language models can provide a richer semantic foundation.
[0031] The knowledge embedding vectors of all entity nodes in the educational knowledge graph are used as key vectors and value vectors, respectively. These knowledge embedding vectors are the result of graph neural network encoding and contain semantic and structural information about the entities. These vectors serve as keys and values in the attention mechanism, enabling the model to locate and extract relevant information from the knowledge graph. In practice, to ensure dimensionality compatibility, the knowledge embedding vectors are typically linearly projected onto the same dimensional space as the query vector. The key vectors are used to calculate the relevance to the query, while the value vectors provide the actual knowledge content. For each entity in the educational knowledge graph... Its knowledge embedding vector The key vector is obtained after projection transformation. Sum value vector The formula is: ; ;in, and These are the projection matrices of the keys and values, respectively, which transform the original knowledge embedding vector into a representation suitable for attention computation.
[0032] This dual-use knowledge embedding (as key and value) allows for the identification of relevant knowledge while preserving a complete representation of the knowledge, providing comprehensive knowledge information for subsequent fusion.
[0033] The dot product of the query vector and each key vector is calculated. The result is then normalized after dividing by the square root of the vector's dimension to obtain the attention score for each entity node. Attention calculation is a core step in determining knowledge relevance, quantifying the matching degree between text and knowledge through vector similarity. The calculation process employs a scaled dot product attention mechanism. First, the inner product of the query vector and each key vector is calculated to measure their similarity; then, scaling is applied to prevent the gradient vanishing problem at high dimensionality; finally, normalization is performed using the softmax function to obtain the attention score distribution. The formula is: ; in, For query vector For entities Attention score For entities The key vector, Let be the dimension of the key vector. This represents the total number of entities in the knowledge graph.
[0034] scaling factor The dot product results are normalized to ensure that the attention score in high-dimensional space does not cause the softmax gradient to vanish due to excessively large dot product values. This attention calculation mechanism enables the model to automatically identify the knowledge entities most relevant to the currently processed content without manually setting matching rules, exhibiting high flexibility and adaptability. For example, when processing text about "photosynthesis," it can automatically focus on concepts related to photosynthesis in the knowledge graph, such as "chlorophyll" and "carbon dioxide."
[0035] The value vectors of all entity nodes are weighted and summed based on the attention scores to obtain the knowledge-enhanced representation corresponding to the current term. Knowledge aggregation is the final step in generating the knowledge-enhanced representation, forming a comprehensive representation by weighted combination of related knowledge. The aggregation process uses the attention scores calculated in the previous step to perform a weighted average of the value vectors of all entities; high-scoring entities contribute more information, while low-scoring entities have less impact. The formula is: ;in, For knowledge enhancement representation, For entities Attention score For entities The value vector, This represents the total number of entities.
[0036] This weighted aggregation mechanism ensures that the final knowledge-enhanced representation primarily includes the knowledge most relevant to the current text, while also retaining the proportional contribution of other related knowledge, resulting in a comprehensive yet focused knowledge representation. The dimension of the knowledge-enhanced representation is typically the same as the query vector, facilitating subsequent fusion operations. This dynamic retrieval and aggregation mechanism adaptively extracts the most relevant information from the knowledge graph for each input term, forming precise knowledge enhancement and laying the foundation for subsequent knowledge fusion.
[0037] In this embodiment of the invention, the detailed implementation steps for filtering knowledge-enhanced representations based on contextual semantic relevance and dynamically adjusting knowledge contribution include: The cosine similarity between the hidden state representation of the current word and the knowledge-enhanced representation is calculated as the semantic relevance score. Semantic relevance calculation is a crucial step in determining whether knowledge is applicable to the current context, quantifying the fit between knowledge and text through vector similarity. The calculation uses cosine similarity, a classic metric for measuring the consistency of vector direction, unaffected by vector length, and suitable for semantic comparison. The formula is: ;in, Hidden state With knowledge enhancement representation The similarity score ranges from [-1, 1]. and They are vectors and The L2 norm.
[0038] A similarity value closer to 1 indicates that the two vectors are more aligned, meaning the current text is more semantically relevant to the retrieved knowledge; a value closer to 0 indicates almost no relevance; and a value closer to -1 indicates opposite semantic directions. This similarity calculation effectively identifies the degree of fit between knowledge and context, providing a quantitative standard for subsequent knowledge filtering and avoiding the introduction of irrelevant knowledge that could interfere with model understanding. For example, when dealing with a mathematical problem, it can distinguish whether physics knowledge needs to be introduced, even when there is some overlap in keywords.
[0039] When the semantic relevance score is below a preset relevance threshold, the knowledge-enhanced representation is set to zero; when the semantic relevance score is above or equal to the preset relevance threshold, the knowledge-enhanced representation is retained. Knowledge filtering is an important mechanism to prevent interference from irrelevant knowledge, cutting off the influence of low-relevance knowledge through a hard threshold. The filtering process compares the similarity score calculated in the previous step with the preset threshold to make a binary decision. Preset relevance threshold The value is typically set between 0.3 and 0.5. This range filters out most irrelevant knowledge while retaining weakly relevant but potentially useful information. The filtering operation can be represented as: ;in, To enhance the representation of filtered knowledge, It is a zero vector.
[0040] This hard filtering mechanism effectively avoids interference from low-relevance knowledge, preventing the model from becoming biased or misleading due to the introduction of incorrect knowledge. For example, when processing content related to "the sum of the interior angles of a triangle," if knowledge of "quadrilateral" is retrieved but the similarity is insufficient, this knowledge will be filtered out to ensure the accuracy of knowledge application. This mechanism is particularly suitable for educational scenarios because educational content typically requires precise knowledge support, rather than vague associations.
[0041] The hidden state representation of the current word is concatenated with the filtered knowledge-enhanced representation. A linear transformation layer and an activation function are then applied to output a gating value between zero and one. Gating computation is a crucial step in dynamically adjusting knowledge contribution, adaptively controlling the knowledge fusion ratio by learning contextual requirements. The computation process first concatenates the hidden state with the filtered knowledge representation to form a joint feature vector; then, a linear transformation and a sigmoid activation function are used to map the joint feature vector to a gating value between 0 and 1. The formula is: ;in, This is the gate value, and its range is [0,1]. This represents the hidden state of the current word element. To enhance the representation of filtered knowledge, The gated transformation matrix, For bias vectors, For the sigmoid function, This indicates vector concatenation.
[0042] The sigmoid function ensures that the output value is strictly controlled within the [0,1] interval, giving the gating value a clear probabilistic interpretation: values close to 1 indicate a high reliance on knowledge-enhanced representations, while values close to 0 indicate that the original language model representation should be primarily preserved. This gating mechanism can make different decisions in different situations, such as increasing knowledge contribution when explaining concepts and reducing knowledge intervention when generating fluent text, thus achieving contextual adaptability in knowledge application.
[0043] The knowledge-enhanced representation is multiplied by a gating value, and the hidden state representation of the current word is multiplied by one minus the gating value. The two products are then added together to obtain the fused context representation. Knowledge fusion is a crucial step in generating the final representation, achieving a balanced integration of knowledge and the language model through weighted combination. The fusion process uses linear interpolation, weighting the knowledge representation and the original representation according to the gating value to form the final fused representation. The formula is: ;in, For the fused context representation, This is the gate value. To enhance the representation of filtered knowledge, This represents the original hidden state.
[0044] This linear fusion mechanism maintains computational efficiency while achieving a smooth transition between knowledge and the original representation, avoiding the incoherence issues that may arise from direct substitution. The advantage of gated fusion lies in its preservation of the fluency and generality of the original language model while introducing the precision and specialization of structured knowledge. This allows the model to adaptively adjust the contribution ratio of the two information sources according to different contextual needs. For example, when explaining specialized concepts, the influence of the knowledge graph is increased, while when generating coherent text, the original language capabilities are relied upon more. The final fused representation will serve as the basis for joint fine-tuning, enabling the model to simultaneously master both linguistic expression and professional knowledge.
[0045] In this embodiment of the invention, the detailed implementation steps of the method for calculating the joint loss function include: The cross-entropy between the probability distribution of the predicted next word on the fused context representation and the actual word is calculated as the language model loss. The language model loss is a core objective in maintaining the model's basic language capabilities, and its expressive power is trained through a standard sequence prediction task. The calculation process follows traditional language model training methods. First, the fused context representation is mapped to a probability distribution the size of the vocabulary through a linear layer and a softmax function, representing the model's prediction of the next word. Then, the cross-entropy between this distribution and the one-hot encoding of the actual next word is calculated to quantify the accuracy of the prediction. The formula is: ;in, For language model loss, Given the preceding lexicon and input In this case, the model predicts the next word. The probability, The sequence length is given.
[0046] In practice, negative log-likelihood loss is typically used, where the negative logarithm of the prediction at each position is taken and then summed over the entire sequence. This loss encourages the model to generate content that conforms to linguistic rules and contextual coherence, preserving the core capabilities of the original large language model. Language model loss is a fundamental component of the joint loss function, ensuring that the model incorporates knowledge without sacrificing the fluency and generality of the language.
[0047] Entity pairs and their relation labels that match the educational knowledge graph are extracted from the training data. The cross-entropy between the model's predicted probability of the relationship between entity pairs and the true relation label is calculated as the knowledge alignment loss. Knowledge alignment loss is a key objective in guiding the model to learn structured knowledge, and the relation prediction task strengthens the model's grasp of facts in the knowledge graph. The calculation process first identifies text fragments from the training data that correspond to triples (head entity, relation, tail entity) in the knowledge graph; then, a relation prediction task is constructed, allowing the model to predict the tail entity based on the head entity and relation, or predict the relationship between them based on the head and tail entities; finally, the cross-entropy loss between the prediction result and the true label is calculated. The relation prediction formula is: ;in, For a given head entity Tail-end entity At that time, the relationship The predicted probability, This is a relation prediction function, typically implemented as a neural network layer.
[0048] The formula for knowledge alignment loss is: ;in, For knowledge alignment loss, This is the set of triples that appear in the training data.
[0049] This type of loss directly supervises the model's learning of entity relationships defined in the knowledge graph, enabling the model to internalize structured knowledge and reducing the likelihood of generating incorrect facts or logical relationships. Knowledge alignment loss is a crucial guarantee for the model's ability to acquire professional knowledge, enabling it to provide accurate conceptual explanations and relational descriptions in educational tasks.
[0050] For samples in the training data containing logical reasoning chains, the sequence matching loss between the model's output reasoning steps and the standard reasoning steps is calculated as the logical consistency loss. Logical consistency loss is a dedicated objective for training the model to perform canonical reasoning, improving the model's logical ability through supervised reasoning. The calculation process first selects samples from the training data containing explicit reasoning steps, such as mathematical proofs, scientific explanations, or logical arguments; then, the model generates a complete reasoning chain, including each reasoning step; finally, the sequence matching loss between the generated reasoning chain and the standard answer is calculated to evaluate the correctness and completeness of the reasoning process. The matching loss uses a variant of edit distance, considering the consistency of step order and content, and the formula is: ;in, For logical consistency loss, It is a sequence distance function. This is the sequence of inference steps for the i-th sample predicted by the model. For the corresponding reference reasoning step sequence, This represents the number of inference samples.
[0051] Sequence distance function This could be a modified edit distance or a negative ROUGE score, ensuring that both the accuracy of the steps and the logical order between them are considered. This loss directly supervises the model to learn correct reasoning patterns, enabling it to provide educationally sound problem-solving processes and logical explanations, rather than simply providing the correct answer. Logical consistency loss is key to the model acquiring educational reasoning ability, allowing it to guide students step-by-step in understanding complex concepts, much like a teacher.
[0052] The language model loss, knowledge alignment loss, and logical consistency loss are weighted and summed to obtain the joint loss function value. The joint loss function is a comprehensive expression of multi-objective training, balancing the learning priorities of different abilities through weights. The function design uses a linear weighted combination, integrating the three sub-losses into a single optimization objective according to preset weights. The formula is: ;in, For the joint loss function value, , and These are the weight coefficients for the three sub-losses, satisfying... .
[0053] Weighting coefficients are typically set based on the importance and difficulty of the task; for example, they can be initially set to... , , Then, performance is adjusted using a validation set. In the early stages of training, the weight of the language model loss can be increased to ensure the stability of basic language abilities; as training progresses, the proportion of knowledge alignment and logical consistency losses is gradually increased to strengthen the learning of specialized abilities. This multi-task joint optimization strategy enables the model to develop different capabilities in a balanced way, maintaining the fluent expression of a general language model while acquiring the knowledge reserves and reasoning skills of educational experts, forming an all-around educational assistant. The joint loss function is the core guideline for model optimization, directly determining the final teacher model's capability structure and performance balance.
[0054] In this embodiment of the invention, the detailed implementation steps of the course learning strategy include: Based on the difficulty level of each knowledge point in the educational knowledge graph, the training samples are divided into multiple difficulty levels. Difficulty stratification is a fundamental step in course learning, simulating the gradual learning process of humans by organizing training data through knowledge structure. The stratification process first extracts the difficulty label of each knowledge point from the knowledge graph; these labels may come from the syllabus, expert assessments, or historical exam statistics. Then, the training samples are analyzed to identify the main knowledge points involved in each sample and their difficulty levels. Finally, based on the highest difficulty level or average difficulty of the knowledge points in the sample, it is assigned to the corresponding difficulty level. Typically, 3-5 levels are divided, such as basic, beginner, intermediate, advanced, and expert levels, with each level containing a set of training samples that meet the corresponding difficulty standard. Difficulty assessment can comprehensively consider various factors, such as the abstractness of the knowledge points, the complexity of dependencies, and the diversity of application scenarios. This knowledge structure-based sample organization method provides the model with a clear learning path, enabling it to learn like a student, starting from simple concepts and gradually mastering complex knowledge to form a systematic knowledge system.
[0055] In the initial stages of fine-tuning training, low-difficulty training samples are used to update model parameters. As training epochs increase, higher-difficulty training samples are gradually introduced. Progressive training is the core mechanism of the course, optimizing the learning process through data arrangement with increasing difficulty. The training arrangement follows the principle of "from easy to difficult." First, several epochs are conducted using the most basic sample set to establish a basic domain cognitive framework for the model. Then, higher-difficulty samples are gradually introduced to expand the breadth and depth of the model's knowledge. Specifically, a epoch-based switching rule can be set, such as using only basic samples in the first 5 epochs, introducing intermediate samples in epochs 6-10, and intermediate samples in epochs 11-15, and so on. Alternatively, a smoother transition strategy can be adopted, gradually increasing the proportion of high-difficulty samples by changing the mixing ratio of samples of different difficulties. For example, the function p(t) can be used to control the sampling probability of samples at each difficulty level at training step t. ;in, For the first Each difficulty level in steps The sampling probability, This is a difficulty weight (usually positively correlated with difficulty). This represents the total number of difficulty levels. This represents the total number of training steps.
[0056] This progressive training strategy significantly improves the efficiency and stability of model learning, preventing the model from falling into suboptimal solutions due to overly complex samples in the early stages, and also avoiding the problem of stagnation in the later stages of training due to overly simple samples.
[0057] The decision to proceed to the next difficulty level is based on the model's accuracy on the validation set at the current difficulty level. Dynamic adjustment is an adaptive mechanism for course learning, controlling the learning progress through performance feedback to ensure the model truly masters the current knowledge. The adjustment process establishes a performance checkpoint system. At each preset evaluation period, the model is tested on the validation set at the current difficulty level, calculating key metrics such as accuracy, F1 score, or a custom comprehensive score. These metrics are then compared to preset thresholds to determine whether the model is allowed to proceed to the next difficulty level. The decision rule can be expressed as: ;in, The decision on whether to proceed to the next stage. For the model in the first Accuracy on the difficulty level validation set The corresponding threshold (usually set between 0.7 and 0.9).
[0058] If the model's performance does not meet the requirements, training continues at the current difficulty level, potentially employing more granular learning rate adjustments or sample resampling strategies, until the performance meets the requirements. Once the requirements are met, samples from the next difficulty level are allowed to be introduced. This ability-based progress control makes the training process more personalized and efficient, ensuring that the model builds a solid foundation at each difficulty level, avoiding knowledge gaps or capability discontinuities, and ultimately forming a teacher model with comprehensive knowledge and balanced capabilities.
[0059] In this embodiment of the invention, the detailed implementation steps of multi-level distillation training include: Training samples are input into the teacher model to obtain the raw scores of the teacher model's output layer for all candidate words. These raw scores are then normalized by dividing by a preset temperature parameter to obtain the soft-label probability distribution. Soft-label generation is the starting point of knowledge distillation, extracting the teacher model's "hidden knowledge" through temperature scaling. The generation process first prepares diverse training samples, including domain text, question-answer pairs, and reasoning cases. These samples are then input into the teacher model, recording the prediction scores (usually logits values without softmax) of the output layer for each word in the vocabulary. Finally, temperature scaling and softmax transformation are applied to obtain a smooth probability distribution. The formula is: ; in, For the teacher model to input Predictive terms The probability of soft tags, For the corresponding raw score, This is the temperature parameter (usually set between 1 and 5). This refers to the size of the vocabulary.
[0060] Temperature parameters The smoothness of the distribution was controlled. The larger the value, the smoother the distribution and the smaller the difference in probability among the various words; The closer the value is to 1, the closer it is to the original distribution. Higher temperatures help capture the relative relationships and subtle differences between terms, revealing the teacher model's "hesitation" and "preference" for suboptimal choices—information completely ignored in hard labels but crucial for the student model's learning. For example, for a math problem, the teacher model might provide soft labels {correct answer: 0.8, approximate answer 1: 0.1, approximate answer 2: 0.05, others: 0.05}, which contains richer knowledge information than a single correct answer {correct answer: 1.0, others: 0.0}.
[0061] The cross-entropy between the student model's output probability distribution and the ground truth labels is calculated as the hard label loss; the KL divergence between the student model's output probability distribution and the soft label probability distribution is calculated as the soft label loss. Label loss serves as the dual objective of distillation training, improving student model performance through dual supervision of ground truth and soft labels. The calculation process consists of two parts: the hard label loss uses standard cross-entropy to measure the degree of matching between the student model and the ground truth labels; the soft label loss uses KL divergence to measure the difference between the student model's output distribution and the teacher model's output distribution. The formula for hard label loss is: ;in, For hard label loss, One-hot encoding for the real label. This represents the output probability distribution of the student model.
[0062] The formula for soft tag loss is: ;in, For soft label loss, Let KL divergence be the KL divergence. and Teacher and student models were respectively tested at the same temperature. The output distribution under the following conditions The term is used to balance the gradient size.
[0063] The KL divergence formula is: This measures the difference between two probability distributions; the smaller the value, the more similar the distributions are.
[0064] The dual-loss design ensures that the student model maintains basic prediction accuracy while learning the decision-making patterns and knowledge structure of the teacher model, thus fully inheriting the teacher model's capabilities. This combined hardware and software supervision approach is a unique advantage of knowledge distillation compared to ordinary training, enabling the maximum preservation of the original model's performance and characteristics even with a significantly reduced model size.
[0065] Multiple corresponding intermediate layers are selected for the teacher and student models. The mean squared error between the feature representations of these intermediate layers is calculated as the intermediate layer distillation loss. Intermediate layer distillation is an important strategy for deep knowledge transfer, promoting the student model's learning of the teacher model's internal representations through feature matching. The selection process first determines the hierarchical mapping relationship between the teacher and student models, usually based on relative position or functional similarity. For example, each Transformer layer of the student model can be mapped to a corresponding layer of the teacher model or a proportionally selected layer. Then, the output features of these layers are extracted from both models. Finally, the mean squared error between corresponding features is calculated as the matching target for the intermediate layers. Since the feature dimensions of the teacher and student models may differ, linear projection is usually required to adjust the dimensions. The intermediate layer loss formula is: ;in, For the first Intermediate layer loss, and The student and teacher models are respectively Layer feature representation, Adjust the matrix for dimension. It is the square norm.
[0066] The total intermediate layer loss is the weighted sum of all selected layers: ;in, For the selected set of layers, These are the weighting coefficients for each layer.
[0067] Intermediate layer distillation directly supervises the student model's learning of the teacher model's internal feature representations and processing patterns, rather than just the final output. This makes knowledge transfer more comprehensive and in-depth, helping the student model better understand and imitate the teacher model's reasoning process. For example, when solving mathematical problems, it's not enough to just give the correct answer; one must also grasp the problem-solving approach and steps. Intermediate layer distillation facilitates the transfer of this deeper level of ability.
[0068] The distillation loss is obtained by weighted summation of the hard-label loss, soft-label loss, and intermediate-layer distillation loss. The parameters of the student model are updated based on this distillation loss until convergence, resulting in a lightweight education-specific model. Distillation training is an optimization process that integrates multiple supervisory signals, guiding the student model to learn efficiently through a comprehensive loss. The training process first designs a comprehensive distillation loss, combining the aforementioned three losses into a single optimization objective according to weights; then, standard gradient descent or its variants (such as Adam) are used to iteratively update the student model parameters; finally, the training progress is monitored through validation set performance until the model converges or reaches the preset number of training epochs. The formula for the comprehensive distillation loss is: ;in, For total distillation losses, , and Let be the weighting coefficient, satisfying .
[0069] Weighting coefficients are usually set based on experience, such as , , This can be adjusted through small-scale experiments. During training, learning rate scheduling strategies, such as cosine annealing or linear decay, can be employed to make training more stable and efficient. The goal of distillation training is to make the student model approach the performance of the teacher model as closely as possible while reducing the number of parameters (typically by 80-95%). Through multi-level, multi-objective knowledge transfer, the student model can inherit most of the capabilities of the teacher model while significantly reducing computational overhead, ultimately forming a lightweight and efficient education-specific model that provides a practical foundation for personalized learning services.
[0070] In this embodiment of the invention, the detailed implementation steps of the method for obtaining user learning profile information include: Acquiring users' historical learning interaction records, including quiz records, learning duration records, and review intervals, is a fundamental step in building user profiles. This multi-dimensional data collection provides a comprehensive understanding of user learning behavior. The acquisition process begins by designing comprehensive data collection points covering all stages of the user's learning process. Then, user interaction information is recorded and stored in real time to ensure data integrity and timeliness. Finally, the data is cleaned and standardized for subsequent analysis. Quiz records include question IDs, knowledge point tags, user answers, correctness, answer time, and thinking time, reflecting the user's knowledge mastery. Learning duration records include learning session IDs, start and end times, learning content identifiers, and interaction frequency, reflecting the user's learning engagement. Review intervals record the time distribution of users repeatedly learning the same knowledge points, such as the interval between the first learning and the first review, or the interval between the first and second reviews, reflecting the user's learning strategies and memory patterns. These multi-dimensional data collectively constitute a digital profile of user learning behavior, providing a comprehensive behavioral foundation for subsequent knowledge status analysis and enabling a deep understanding of users' learning characteristics and needs at the behavioral level.
[0071] Based on the user's answer records, the accuracy rate of each knowledge point is statistically analyzed. Knowledge points with an accuracy rate below a preset threshold are marked as weak knowledge points. Knowledge point assessment is a crucial step in identifying learning weaknesses, using accuracy rate analysis to pinpoint areas that need improvement. The assessment process first groups the user's answer records by knowledge point and calculates the historical accuracy rate for each knowledge point. Then, these accuracy rates are compared with preset thresholds to identify the user's weaknesses. Finally, these weak knowledge points are prioritized to provide direction for subsequent personalized learning guidance. The accuracy rate calculation formula is: ;in, For users in knowledge points The accuracy rate The number of questions answered correctly. This represents the total number of attempts.
[0072] The criteria for identifying weak knowledge points are: ;in, Representing knowledge points Is it a weak point? The preset accuracy threshold is usually set between 0.6 and 0.7.
[0073] To account for the impact of sample size, confidence interval adjustment can be introduced to reduce the certainty of judgment for knowledge points with fewer attempts. This performance-based knowledge point assessment method can accurately identify users' learning difficulties and provide targeted learning support, avoiding a one-size-fits-all learning path and forming a fundamental step in personalized learning.
[0074] Based on the weak knowledge points, the system queries the educational knowledge graph for their predecessor and successor knowledge points, constructing a user-specific knowledge state subgraph. Knowledge association analysis is a deep-seated step in understanding the root causes of learning difficulties, revealing the dependency structure between knowledge points through graph queries. The analysis process first locates the identified weak knowledge points in the educational knowledge graph; then, it queries the predecessor nodes (prerequisite knowledge) and successor nodes (application extensions) of these knowledge points; finally, based on these relationships, a personalized subgraph centered on the weak knowledge points is constructed. The subgraph construction algorithm uses breadth-first search, starting from the weak knowledge point and expanding forward and backward by 2-3 hops, while pruning edges based on their importance to ensure the compactness and relevance of the subgraph. Prerequisite knowledge points represent the foundation required to understand the current knowledge; insufficient mastery of these prerequisites is often the root cause of difficulty with the current knowledge point. Successor knowledge points reflect the application scenarios and development directions of the current knowledge, playing an important role in stimulating learning motivation and building knowledge connections. This personal knowledge state subgraph includes both the user's knowledge gaps and the key knowledge framework needed to address those gaps, providing a structured knowledge map for generating personalized learning paths and enabling the development of targeted learning strategies from the knowledge structure level.
[0075] By fitting a knowledge forgetting curve based on the changes in review intervals and historical accuracy, the forgetting coefficient for each knowledge point is calculated. Forgetting pattern analysis is a key step in optimizing learning efficiency, quantifying the user's memory characteristics through a mathematical model. The analysis process first collects user test performance on the same knowledge point at different time points, constructing data pairs of {time interval, accuracy}; then, the Ebbinghaus forgetting curve model is used to fit these data to obtain personalized forgetting parameters; finally, based on these parameters, the forgetting coefficient for each knowledge point is calculated to predict future memory status. The basic form of the Ebbinghaus forgetting curve model is: ;in, For the time elapsed Memory retention rate afterward This is a parameter for memory strength.
[0076] The improved model for educational scenarios can be represented as: ;in, For users to understand knowledge points In time Memory retention rate afterward Based on the basic memory level (usually set at around 0.2). This is the parameter for the memory strength of this knowledge point.
[0077] The forgetting factor can be defined as the reciprocal of memory strength: Forgetting coefficient The larger the value, the more interested the user is in the knowledge point. The faster you forget, the more frequently you need to review.
[0078] By analyzing a large amount of user data, a model is established to show the relationship between the difficulty of knowledge points and the average forgetting coefficient, providing initial predictions for new knowledge points. This memory model, based on cognitive science, can scientifically predict the retention status of knowledge, optimize review timing and frequency, and maximize learning efficiency. It is a core technology for time-efficient personalized learning.
[0079] The user's personal knowledge state subgraph and the forgetting coefficient of each knowledge point are encoded into a user learning profile vector, serving as user learning profile information. Profile vectorization is a comprehensive step in understanding the user, achieving a digital representation of the user's state through multi-dimensional feature encoding. The encoding process first designs a comprehensive feature architecture, including dimensions of knowledge mastery, learning progress, cognitive habits, and time efficiency; then, user data is mapped onto these dimensions to generate a structured feature representation; finally, dimensionality reduction and normalization are used to form a compact learning profile vector. The encoding of the knowledge state subgraph uses graph neural network technology to compress the graph structure information into a fixed-length vector representation; the encoding of the forgetting coefficient is achieved through statistical aggregation and distribution parameterization. The complete user learning profile vector can be represented as: Where U is the user learning profile vector, This is an embedded representation of the knowledge state subgraph. The distribution of the forgetting coefficient is represented by... For the learning progress vector, These are characteristics of learning habits.
[0080] Vector dimensions are typically 128-256, sufficient to capture the complexity and diversity of user learning characteristics. To improve the interpretability of vectors, self-supervised learning methods can be employed, enabling the encoding process to both retain original information and predict future user learning performance. This multi-dimensional, structured user representation allows for a comprehensive understanding of the user's learning state and characteristics, providing an accurate user model for subsequent personalized recommendations and path planning, thus forming the foundation for achieving deep personalization.
[0081] In this embodiment of the invention, the detailed implementation steps for generating personalized learning content and dynamic learning paths include: The user's current learning request is concatenated with the user's learning profile vector, serving as the input to the lightweight education vertical model. Request processing is the starting point for personalized services, forming a comprehensive input by combining immediate needs and historical features. The processing first parses the user's current request, extracting its core intent and content elements; then, this immediate information is structurally concatenated with the user's learning profile vector; finally, feature encoding converts it into an input format that the model can process. The concatenation method uses feature block combinations to ensure clear boundaries between different types of information, facilitating model processing. The complete input representation can be expressed as: ;in, For complete model input, This is the vector representation of the current learning request. Learn user profile vectors This provides contextual information about the current learning environment.
[0082] To improve the model's understanding accuracy, the encoding of the current request is typically processed using a pre-trained language model to capture the semantic structure and intent information of the request. This input approach, which integrates immediate needs and historical features, enables the model to understand and respond to the current request in a targeted manner with a full understanding of the user's background. It balances the needs of personalization and immediacy, providing a comprehensive information foundation for generating high-quality personalized content.
[0083] The lightweight education vertical model generates answers tailored to users' weak knowledge points based on input. Content generation is the core step in personalized service, using a professional model to output learning materials adapted to user characteristics. The generation process employs a lightweight model trained with knowledge distillation, reasoning based on comprehensive input information to form professional answers specific to user needs. The model generation strategy uses controlled autoregressive decoding, adjusting the creativity and determinism of the output through temperature parameters, while introducing knowledge graph-based constraints to ensure the accuracy and educational relevance of the content. The generation of answer content pays special attention to users' weak knowledge points, guiding the model to focus on these key areas through additional attention weights. To improve the teaching effectiveness of the answers, a multi-layered expression strategy is employed, comprehensively using concept explanations, analogies, examples, and application extensions to make the content both in-depth and broad, professional yet easy to understand. For different types of questions, the model also uses different answer templates and structures; for example, problem-solving questions emphasize step breakdown and method explanation, concept questions emphasize definition and relationship clarification, and application questions emphasize scenario recreation and practical value. This intelligent and differentiated content generation mechanism can provide professional and personalized learning support, effectively meeting users' specific learning needs, and is the core embodiment of the educational value of the model.
[0084] Based on the predecessor dependencies and forgetting coefficients of weak knowledge points in the user's personal knowledge state subgraph, the recommended learning order of knowledge points is determined, and this order is converted into a dynamic learning path. Path planning is a key step in optimizing learning effectiveness, and a scientifically sound knowledge acquisition sequence is designed using graph algorithms. The planning process first constructs a learning target network centered on weak knowledge points based on the user's knowledge state subgraph; then, combining the predecessor dependencies and forgetting coefficients between knowledge points, an improved topology sorting algorithm is applied to generate the optimal learning order; finally, the abstract knowledge point sequence is transformed into a concrete and executable learning path, including learning resources and activity suggestions for each step. The optimization objective is to simultaneously satisfy knowledge dependency constraints and minimize the risk of forgetting, which is expressed as: ;in, The order in which to learn the knowledge points. A collection of knowledge points to be learned. For knowledge points Importance weight, Forgetting coefficient, For in sequence Learning knowledge points The time required.
[0085] The constraints are: ;in, For a set of knowledge-dependent edges, Representing knowledge points yes The precursor, Representing knowledge points Position in the sequence.
[0086] This comprehensive learning path planning, which takes into account both knowledge structure and cognitive patterns, ensures that the learning process follows the inherent logic of knowledge while adapting to the user's memory characteristics, maximizing learning efficiency and effectiveness. The dynamic learning path not only includes what to learn and in what order, but also suggestions on when and how to learn. It is a holistic learning plan that provides users with clear guidance and support, making self-directed learning more efficient and focused.
[0087] The solution content and dynamic learning path are returned to the user, and the user's learning profile information is updated based on user feedback. Feedback is the final step in closed-loop optimization, continuously improving the architecture's personalization capabilities through user interaction. The feedback process first presents the generated solution content and learning path in a user-friendly manner, ensuring the information is clear and easy to understand; then, it collects user interaction data, including content clicks, dwell time, completion status, and explicit evaluations; finally, based on this feedback information, the user's learning profile is updated, achieving continuous model optimization. Feedback collection employs a multi-channel strategy, combining explicit feedback (such as ratings and likes) and implicit feedback (such as completion rates and repeat views) to comprehensively capture the user's true reactions to the content. Profile updates use an incremental learning approach, fusing new data with historical data using time decay weights to ensure the profile reflects the latest state while retaining key historical features. The update formula is: ;in, and These are the user profile vectors before and after the update, respectively. The change in the profile is calculated based on the new feedback. This is the smoothing factor (usually set between 0.7 and 0.9).
[0088] This feedback-based closed-loop optimization mechanism can learn and improve from every user interaction, continuously adjust and improve personalized strategies, forming a virtuous cycle of understanding users better and providing more precise services. It is a key guarantee for continuously improving user experience and learning effectiveness, and also reflects adaptive learning capabilities and long-term value.
[0089] This invention achieves the end-to-end construction and application of a large-scale educational model through knowledge graph construction, embedded encoding, knowledge fusion, joint fine-tuning, knowledge distillation, and personalized learning services. The joint fine-tuning mechanism of the knowledge graph and large-scale model in this invention can deeply internalize structured knowledge into the model parameters, effectively reducing illusion phenomena; at the same time, multi-level knowledge distillation achieves model lightweighting, significantly reducing deployment costs and providing reliable, efficient, and intelligent technical support for personalized education.
[0090] The above are merely preferred embodiments of the present invention and are not intended to limit the present invention. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing embodiments or make equivalent substitutions for some of the technical features. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
[0091] It should be noted that all formulas in this manual are calculated by removing dimensions and taking their numerical values. The formulas are derived from software simulations based on a large amount of collected data to obtain the most recent real-world results. The preset parameters and thresholds in the formulas are set by those skilled in the art according to the actual situation.
[0092] Although embodiments of the invention have been shown and described, those skilled in the art will understand that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.
Claims
1. An educational large model architecture for personalized learning, characterized in that, include: The graph construction module is used to acquire multi-source educational data and preprocess it, and to conduct quality assessment and screening of the preprocessed educational data. Extract the semantic relationships between educational entities from the screened educational data, and perform entity disambiguation and knowledge consistency verification to construct an educational knowledge graph; The embedding encoding module is used to perform message passing encoding on the educational knowledge graph through a relation type-aware graph neural network to obtain the knowledge embedding vector of each entity node; The knowledge fusion module is used to perform cross-attention fusion between the knowledge embedding vector and the hidden state representation of the word in the pre-trained large language model to obtain the knowledge-enhanced representation; The knowledge-enhanced representation is filtered based on the semantic relevance of the context, and the knowledge contribution is dynamically adjusted through gating weights to obtain the fused context representation. The joint fine-tuning module is used to fine-tune the pre-trained large language model based on the fused context representation, using a joint loss function that includes language model loss, knowledge alignment loss and logical consistency loss, combined with the course learning strategy, to obtain the teacher model. The knowledge distillation module is used to perform multi-level distillation training on a student model with fewer parameters than the teacher model, based on the soft label probability distribution, intermediate layer feature representation and real label output by the teacher model on the training samples, to obtain a lightweight education vertical model. The personalized learning service module is used to generate personalized learning content and dynamic learning paths based on the lightweight education vertical model, combined with user learning profile information and knowledge forgetting curves.
2. The educational big data model architecture for personalized learning according to claim 1, characterized in that, The construction of the educational knowledge graph includes: A pre-trained language model is used to vectorize the text in the selected educational data to obtain the context semantic vector of each word. The context semantic vector is input into a bidirectional long short-term memory network, and the output feature vector sequence is globally decoded through a conditional random field layer to identify educational entities and their category labels in the text. The identified educational entities are disambiguated to merge different representations of the same concept into a unified entity. After disambiguation, entity pairs are constructed for educational entities, and a relation classification model based on an attention mechanism is used to determine the semantic relationship type between each entity pair. Perform knowledge consistency verification on the extracted triples, and detect and eliminate contradictory relationship descriptions; By using disambiguated educational entities as nodes and verified semantic relationships as edges, and labeling each node with a knowledge difficulty level, an educational knowledge graph is constructed.
3. The educational big data model architecture for personalized learning according to claim 1, characterized in that, The method of obtaining the knowledge embedding vector for each entity node through a relation type-aware graph neural network includes: Initialize the initial embedding vector for each entity node in the educational knowledge graph, and initialize the relation embedding vector for each relation type; For each entity node, obtain all its neighbor nodes and their corresponding connection relationship types. Transform the embedding vectors of the neighbor nodes according to the relationship embedding vectors of the connection relationship types to obtain the relationship-aware neighbor representation. Calculate the attention weight between the entity node and each relation-aware neighbor representation, and perform weighted aggregation on all relation-aware neighbor representations according to the attention weight to obtain the neighbor aggregation vector; The neighbor aggregation vector is concatenated with the entity node's own embedding vector and then subjected to a nonlinear transformation to obtain the updated node embedding vector. After repeating the transformation, aggregation, and nonlinear transformation steps for a preset number of layers, the knowledge embedding vector of each entity node is obtained.
4. The educational big data model architecture for personalized learning according to claim 1, characterized in that, The knowledge acquisition augmentation representation includes: Use the hidden state representation of the currently processed word in the pre-trained large language model as the query vector; The knowledge embedding vectors of all entity nodes in the educational knowledge graph are used as key vectors and value vectors, respectively. Calculate the dot product between the query vector and each key vector, divide the dot product result by the square root of the vector dimension, and then normalize it to obtain the attention score for each entity node. The value vectors of all entity nodes are weighted and summed based on the attention scores to obtain the knowledge-enhanced representation corresponding to the current lexical.
5. The educational big data model architecture for personalized learning according to claim 1, characterized in that, The process of filtering knowledge-enhanced representations based on contextual semantic relevance and dynamically adjusting knowledge contribution includes: Calculate the cosine similarity between the hidden state representation of the current word and the knowledge-enhanced representation, and use it as a semantic relevance score; When the semantic relevance score is lower than a preset relevance threshold, the knowledge-enhanced representation is set to zero; when the semantic relevance score is higher than or equal to the preset relevance threshold, the knowledge-enhanced representation is retained. The hidden state representation of the current word is concatenated with the filtered knowledge-enhanced representation, and a gating value with a range of zero to one is output through a linear transformation layer and an activation function. The knowledge-enhanced representation is multiplied by the gating value, the hidden state representation of the current lexical is multiplied by one and the gating value is subtracted, and the two product results are added together to obtain the fused context representation.
6. The educational big data model architecture for personalized learning according to claim 1, characterized in that, The method for calculating the joint loss function includes: The cross-entropy between the probability distribution of the predicted next word in the fused context representation of the pre-trained large language model and the real word is calculated and used as the language model loss. Extract entity pairs and their relation labels that match the educational knowledge graph from the training data, and calculate the cross-entropy between the model's predicted probability of the relationship between entity pairs and the actual relation label as the knowledge alignment loss. For samples in the training data that contain logical reasoning chains, calculate the sequence matching loss between the reasoning steps output by the model and the standard reasoning steps, and use it as the logical consistency loss. The joint loss function value is obtained by weighted summing of the language model loss, the knowledge alignment loss, and the logical consistency loss.
7. The educational big data model architecture for personalized learning according to claim 1, characterized in that, The course learning strategies include: Based on the difficulty level of each knowledge point in the educational knowledge graph, the training samples are divided into multiple difficulty levels; In the early stages of fine-tuning training, low-difficulty training samples are used to update model parameters. As the training rounds increase, higher-difficulty training samples are gradually introduced. Whether to proceed to the next difficulty level for training is determined by the model's accuracy on the validation set at the current difficulty level.
8. The educational big data model architecture for personalized learning according to claim 1, characterized in that, The multi-level distillation training includes: The training samples are input into the teacher model to obtain the original scores of the teacher model output layer for all candidate words. The original scores are divided by a preset temperature parameter and then normalized to obtain the soft label probability distribution. Calculate the cross-entropy between the output probability distribution of the student model and the true label as the hard label loss; calculate the KL divergence between the output probability distribution of the student model and the soft label probability distribution as the soft label loss. Multiple corresponding intermediate layers of the teacher model and student model are selected, and the mean square error between the feature representations of the corresponding intermediate layers is calculated as the intermediate layer distillation loss. The distillation loss is obtained by weighted summing of the hard label loss, the soft label loss, and the intermediate layer distillation loss. The parameters of the student model are then updated based on the distillation loss until convergence, resulting in a lightweight education vertical model.
9. The educational big data model architecture for personalized learning according to claim 1, characterized in that, The method for obtaining user learning profile information includes: Obtain the user's historical learning interaction records, which include answer records, learning duration records, and review intervals; Based on the answer records, the user's accuracy rate on each knowledge point is statistically analyzed, and knowledge points with an accuracy rate lower than a preset accuracy rate threshold are marked as weak knowledge points. Based on the weak knowledge points, query their predecessor and successor knowledge points in the educational knowledge graph to construct a subgraph of the user's personal knowledge status. Based on the knowledge forgetting curve fitted by the review time interval and the change in historical accuracy, the forgetting coefficient of the user for each knowledge point is calculated. The user's personal knowledge state subgraph and the forgetting coefficient of each knowledge point are encoded into a user learning profile vector, which serves as user learning profile information.
10. The educational big data model architecture for personalized learning according to claim 9, characterized in that, The generation of personalized learning content and dynamic learning paths includes: The user's current learning request is concatenated with the user's learning profile vector and used as the input to the lightweight education vertical model; The lightweight education vertical model generates answers to the user's weak knowledge points based on the input. Based on the predecessor dependencies of weak knowledge points in the user's personal knowledge state subgraph and the forgetting coefficient, the recommended learning order of knowledge points is determined, and the order of knowledge points is converted into a dynamic learning path. The solution and the dynamic learning path are returned to the user, and the user learning profile information is updated based on user feedback.