Method and system for standardizing extraction of educational knowledge atoms based on human-in-the-loop

By defining a unified standard for knowledge atom extraction and a deep learning model, combined with natural language processing technology, the problem of inconsistent knowledge atom extraction in educational knowledge graphs has been solved, achieving efficient and accurate knowledge atom extraction, improving the quality and completeness of knowledge graphs, and supporting intelligent applications in the education field.

CN122242448APending Publication Date: 2026-06-19Chinese People's Liberation Army Cyberspace Force Information Engineering University

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
Chinese People's Liberation Army Cyberspace Force Information Engineering University
Filing Date
2026-02-25
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

The existing educational knowledge graphs lack a unified standard for extracting knowledge atoms. Existing extraction methods have limitations in accuracy, efficiency, comprehensiveness, and interpretability, making it difficult to fully explore the potential semantics and complex knowledge structures in educational texts.

Method used

We adopt a human-in-the-loop (HIL) method for standardized extraction of educational knowledge atoms. By defining a unified standard for knowledge atom extraction, we combine word vector models, dependency parsing, semantic role labeling, improved TF-IDF methods, and deep learning models to construct a model that combines CNN and RNN. We also introduce an attention mechanism to achieve efficient and accurate extraction of knowledge atoms. The results are validated through manual review and evaluation by domain experts.

Benefits of technology

It enables efficient and accurate extraction of knowledge atoms, improves the quality and completeness of knowledge graphs, reduces ambiguity and inconsistency, enriches the content of knowledge graphs, and supports intelligent applications in the education field.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122242448A_ABST
    Figure CN122242448A_ABST
Patent Text Reader

Abstract

This invention relates to the field of educational knowledge graph construction technology, and particularly to a human-in-the-loop (HIL) method and system for standardized extraction of educational knowledge atoms. The method defines a knowledge atom as the smallest semantically independent knowledge unit in the educational field, and sets granularity standards for knowledge atoms based on pre-defined granularity principles and the needs of educational stages and teaching objectives. It performs source-level structuring, data cleaning and standardization, word segmentation, part-of-speech tagging, and named entity recognition on educational texts. Based on a word vector model, words in the text are mapped to low-dimensional vectors, and sentence structure information is obtained by combining dependency parsing and semantic role labeling. A deep learning model combining CNN and RNN is constructed, and an attention mechanism is introduced into the model. The trained model is used to extract candidate knowledge atoms from the educational texts, and these are then filtered based on the knowledge atom extraction criteria. This invention achieves efficient and accurate extraction of educational knowledge atoms.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of educational knowledge graph construction technology, and in particular to a method and system for the standardized extraction of educational knowledge atoms based on human-in-the-loop. Background Technology

[0002] With the rapid development of information technology, the education field is undergoing profound changes. Educational knowledge graphs, as an emerging technology, provide powerful support for educational informatization and demonstrate enormous potential in promoting educational innovation and improving educational quality, gradually becoming a hot topic in educational research and application. In today's digital age, educational data is exploding, and how to effectively organize, manage, and utilize this data has become a key issue. Knowledge graphs describe concepts, entities, and their relationships in the education field in a structured form, integrating fragmented knowledge into an organic whole, providing a new perspective and method for the processing and analysis of educational data. By constructing educational knowledge graphs, various educational resources, such as textbooks, courseware, test questions, and videos, can be linked with knowledge points to form a vast knowledge network, enabling rapid knowledge retrieval, intelligent recommendation, and in-depth analysis, thereby greatly promoting the process of educational informatization and improving the efficiency of educational resource utilization.

[0003] Knowledge atoms, as the most basic units of knowledge in a knowledge graph, directly impact the quality and application effectiveness of the knowledge graph through the accuracy and quality of their extraction. Currently, there are a series of problems in knowledge atom extraction.

[0004] The lack of unified standards for knowledge atom extraction means that different research and applications often define and extract knowledge atoms based on their own needs and understanding. This leads to differences in the granularity and semantic expression of the extracted knowledge atoms, making it difficult to effectively integrate and share knowledge from different sources. When extracting knowledge atoms from Chinese language textbooks, some researchers may treat a paragraph as a knowledge atom, while others may treat a sentence or even a single word. This difference makes subsequent knowledge integration and application difficult.

[0005] Existing methods for knowledge atom extraction have limitations. Rule-based methods, while highly accurate, rely heavily on manually written rules, resulting in inefficiency and difficulty in handling complex and varied educational texts. Statistical methods, capable of processing large-scale data, are susceptible to data noise, leading to concerns about the reliability of extraction results. Deep learning-based methods offer performance improvements but require extensive labeled data for training, and their interpretability is poor. When extracting knowledge points from educational literature, rule-based methods may miss important points due to incomplete rules; statistical methods may misclassify frequently occurring but non-critical words as knowledge atoms; and while deep learning-based methods can automatically learn the characteristics of knowledge atoms, it is difficult to explain why the model extracts these specific atoms.

[0006] Furthermore, existing extraction methods often fail to fully uncover the potential information in the rich semantic relationships and complex knowledge structures of the education field, resulting in insufficiently comprehensive and accurate extracted knowledge atoms, which cannot meet the demand for high-quality knowledge in educational knowledge graphs. Summary of the Invention

[0007] To address the lack of unified standards for knowledge atom extraction in existing educational knowledge graphs, the limitations of existing extraction methods in terms of accuracy, efficiency, comprehensiveness, and interpretability, and the difficulty in fully exploring the potential semantics and complex knowledge structures in educational texts, this invention proposes a human-in-the-loop standardized method and system for extracting educational knowledge atoms. By unifying knowledge atom extraction standards and optimizing extraction methods and models, this method achieves efficient and accurate extraction of educational knowledge atoms, thereby improving the quality of educational knowledge graphs and the effectiveness of intelligent applications in the education field.

[0008] To achieve the above objectives, the technical solution adopted is:

[0009] This invention provides a method for the atomic standardization extraction of educational knowledge based on human-in-the-loop, comprising the following steps:

[0010] S1: Define the knowledge atom extraction standard: Define the knowledge atom as the smallest knowledge unit with independent semantics in the field of education, and set the granularity standard of the knowledge atom according to the educational stage and teaching objectives based on the preset granularity definition principle.

[0011] S2: Text preprocessing: The educational texts are processed through source structuring, data cleaning and standardization, word segmentation, part-of-speech tagging and named entity recognition to obtain structured text data;

[0012] S3: Feature Representation: First, words in the text are mapped into low-dimensional vectors based on the word vector model, and sentence structure information is obtained by combining dependency parsing and semantic role labeling; then, the improved TF-IDF method, which integrates ontology concept similarity and concept co-occurrence characteristics, is used to optimize feature representation.

[0013] S4: Knowledge Atom Extraction Model Construction: Construct a deep learning model that combines CNN and RNN, introduce an attention mechanism into the model, and train the model using labeled educational text data;

[0014] S5: Knowledge Atom Extraction and Screening: Extract candidate knowledge atoms from educational texts using the trained model and screen them based on the knowledge atom extraction criteria in step S1.

[0015] S6: Knowledge Atom Verification and Update: Verify the selected knowledge atoms through manual review, comparison with the existing knowledge base, or evaluation by domain experts, and establish a dynamic update mechanism for knowledge atoms.

[0016] According to the human-in-the-loop educational knowledge atom standardization extraction method of the present invention, the granularity definition principle in step S1 further includes: the knowledge atom should fully express an independent knowledge meaning and can be understood without additional explanation; the granularity of the knowledge atom is adaptively adjusted according to the educational stage and teaching objectives.

[0017] According to the present invention, the human-in-the-loop educational knowledge atomic standardization extraction method further includes, in step S2, text preprocessing, the following steps: source structuring processing: using the Python libraries pdfplumber and python-docx to cut the structure of PDF and Word documents respectively, while removing meaningless pages; data cleaning and standardization: removing noisy data and unifying text format, encoding, and capitalization; word segmentation: using dictionary-based, statistical model-based, or deep learning-based word segmentation methods for Chinese text, and using spaces and punctuation marks for English text; part-of-speech tagging: using a rule-based or statistical model-based part-of-speech tagger to tag the part of speech of each word; and named entity recognition: using a deep learning-based named entity recognition model to identify terminology named entities in the text.

[0018] According to the present invention, the method for extracting the atomic normalization of educational knowledge based on human-in-the-loop is further described in step S3, which specifically includes the following steps: First, the Word2Vec word vector model is used to map the words in the text into low-dimensional vectors. Then, the text is subjected to dependency parsing to extract subject-verb-object and attributive-adverbial-complement relationships. Next, semantic role labeling is used to assign agent, patient, time, and place roles to each predicate. Finally, the improved TF-IDF method, which integrates ontology concept similarity and concept co-occurrence characteristics, is used to optimize the feature expression.

[0019] According to the present invention, the improved TF-IDF method for extracting educational knowledge atoms based on human-in-the-loop learning further includes: calculating the shortest path and the depth of the nearest common parent node of a concept in the ontology hierarchical network to obtain the semantic similarity of the concept; constructing a set of similar concepts based on word frequency threshold and similarity threshold, and adjusting the feature word frequency; constructing a word co-occurrence matrix, and extracting high co-occurrence rate word pairs to form a co-occurrence feature set; and fusing the adjusted set of similar concepts and the co-occurrence feature set to obtain the final feature set.

[0020] According to the present invention, the human-in-the-loop educational knowledge atom normalization extraction method further includes the following steps: In the knowledge atom extraction model described in step S4, the CNN extracts local text features through convolutional kernels of various sizes, and the RNN uses LSTM or GRU structures to capture text context information and semantic dependencies; the outputs of the CNN and RNN are concatenated or weighted and merged; the attention mechanism assigns weights according to the importance of the text to enhance attention to key information.

[0021] According to the present invention, the method for atomic normalization extraction of educational knowledge based on human-in-the-loop is further described in step S4. During the training process, cross-entropy is used as the loss function, and the optimization algorithm adopts stochastic gradient descent, Adagrad, Adadelta or Adam. The learning rate scheduling adopts a cosine decay strategy. Dropout, L1 or L2 regularization is used to prevent overfitting.

[0022] According to the human-in-the-loop educational knowledge atom standardization extraction method of the present invention, the knowledge atom screening in step S5 further includes: screening the knowledge atoms extracted by the model according to the knowledge atom extraction criteria, and removing candidate knowledge atoms that are semantically incomplete, vaguely expressed, or irrelevant to the field of education.

[0023] Furthermore, the present invention also provides a human-in-the-loop-based system for the atomic standardization extraction of educational knowledge, comprising:

[0024] The text preprocessing module is used to perform source structuring processing, data cleaning and standardization, word segmentation, part-of-speech tagging and named entity recognition on educational texts to obtain structured text data.

[0025] The feature representation module first maps words in the text into low-dimensional vectors based on a word vector model, and then obtains sentence structure information by combining dependency parsing and semantic role labeling; then it optimizes the feature representation by using an improved TF-IDF method that integrates ontology concept similarity and concept co-occurrence characteristics.

[0026] The model building module is used to build a deep learning model that combines CNN and RNN, introduces an attention mechanism into the model, and trains the model using labeled educational text data.

[0027] The knowledge atom extraction and filtering module is used to extract candidate knowledge atoms from educational texts using the trained model and to filter them based on knowledge atom extraction criteria.

[0028] The knowledge atom verification and update module is used to verify the selected knowledge atoms through manual review, comparison with the existing knowledge base, or evaluation by domain experts, and to establish a dynamic update mechanism for knowledge atoms.

[0029] The beneficial effects achieved by adopting the above technical solution are:

[0030] This invention proposes a human-in-the-loop (HIL) method for standardized extraction of educational knowledge atoms. In terms of improving knowledge graph quality, the unified and clear extraction standards ensure consistency in granularity and semantic expression among extracted knowledge atoms, reducing ambiguity and inconsistency, thereby enhancing the accuracy and reliability of the knowledge graph. By employing advanced natural language processing and machine learning techniques, it is possible to more comprehensively and deeply mine knowledge from educational texts, enriching the content of the knowledge graph and improving its completeness. Utilizing a model structure combining convolutional neural networks and recurrent neural networks, along with attention mechanisms and other techniques, it can accurately capture key knowledge from texts and extract high-quality knowledge atoms, laying a solid foundation for constructing high-quality educational knowledge graphs. Attached Figure Description

[0031] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings of the embodiments of the present invention will be briefly described below. The drawings are merely illustrative of some embodiments of the present invention and are not intended to limit the scope of the present invention to all embodiments.

[0032] Figure 1 This is a flowchart illustrating the human-in-the-loop method for atom-standardized extraction of educational knowledge according to an embodiment of the present invention.

[0033] Figure 2 This is a schematic diagram of the knowledge atom extraction model according to an embodiment of the present invention. Detailed Implementation

[0034] The exemplary solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Unless otherwise defined, the technical or scientific terms used in this invention should have the ordinary meaning understood by one of ordinary skill in the art.

[0035] This invention discloses a method for atomic normalization extraction of educational knowledge based on human-in-the-loop. Figure 1 The flowchart illustrates the entire process from text preprocessing to knowledge atom selection and verification, clearly presenting the logical relationships and data flow between each step. The text preprocessing step includes sub-steps such as cleaning, word segmentation, part-of-speech tagging, and named entity recognition. Each sub-step has clearly defined inputs and outputs; for example, the cleaning sub-step removes noisy data and outputs standardized text; the word segmentation sub-step divides the text into words, providing a foundation for subsequent processing. The feature representation step utilizes techniques such as word vector models and syntactic analysis to extract semantic and syntactic features from the text. These features serve as input to the knowledge atom extraction model, which, after processing, outputs candidate knowledge atoms. Finally, the knowledge atom selection and verification step filters and verifies the candidate knowledge atoms according to set criteria, ensuring the accuracy and reliability of the finally extracted knowledge atoms. The specific process is as follows:

[0036] Step S1: Define the knowledge atom extraction standard: Define the knowledge atom as the smallest knowledge unit with independent semantics in the field of education, and set the granularity standard of the knowledge atom according to the educational stage and teaching objectives based on the preset granularity definition principle.

[0037] Definition of a knowledge atom: A knowledge atom is defined as the smallest unit of knowledge with independent semantics in the field of education. These knowledge atoms can accurately express a specific concept, fact, principle, rule, or other knowledge element, and are the cornerstone of constructing an educational knowledge graph. In mathematics, the Pythagorean theorem is a knowledge atom, clearly expressing the specific knowledge of the quantitative relationship between the three sides of a right triangle; in language arts, the definition and characteristics of the rhetorical device of metaphor can also be considered a knowledge atom.

[0038] Granularity Definition: The granularity definition of knowledge atoms follows these principles: it should be able to fully express an independent knowledge meaning without being overly detailed, leading to fragmentation and redundancy. The standard for describing knowledge points is to accurately convey the core content and be understandable without additional explanation. In history, when describing the "Industrial Revolution," its key elements—such as its starting time, major inventions, and significant socio-economic impact—are considered as a single knowledge atom, rather than treating each invention's details as a separate atom. This ensures the integrity of the knowledge while avoiding over-detailing. Furthermore, the granularity of knowledge atoms should be appropriately adjusted according to the needs of different educational stages and teaching objectives. In basic education, the granularity of knowledge atoms is relatively coarse, focusing on the transmission of fundamental knowledge; in higher education, the granularity can be finer to meet the needs of in-depth professional learning.

[0039] Step S2: Text preprocessing: The educational text is processed by source structuring, data cleaning and standardization, word segmentation, part-of-speech tagging and named entity recognition to obtain structured text data.

[0040] First, the educational texts undergo source structuring. The Python libraries pdfplumber and python-docx are used to segment the PDF and Word document structures, respectively, while removing meaningless pages such as the PDF preface. Next, data cleaning and normalization are performed to remove noisy data, such as irrelevant punctuation, special characters, and garbled text. Simultaneously, the text is normalized, unifying the format and encoding, converting the text to a uniform lowercase form to eliminate inconsistencies caused by capitalization differences. Following this, word segmentation is performed using natural language processing techniques to divide the continuous text sequence into individual words or phrases. For English text, simple word segmentation using spaces and punctuation marks is sufficient. For Chinese text, since there are no obvious separators between Chinese words, dictionary-based segmentation methods, statistical model-based segmentation methods, or deep learning-based segmentation methods are employed, such as the Harbin Institute of Technology LTP segmentation tool and Jieba segmentation. Next, part-of-speech tagging is performed to label each word with its part of speech, such as noun, verb, adjective, adverb, etc., to facilitate subsequent analysis of the text's grammatical structure and semantics. Rule-based part-of-speech taggers and statistical model-based part-of-speech taggers, such as Hidden Markov Model (HMM) part-of-speech taggers and maximum entropy part-of-speech taggers, can be used. Finally, named entity recognition is performed to identify named entities in the text, providing a foundation for knowledge atom extraction. A deep learning-based named entity recognition model, such as the BiLSTM-CRF model, is employed. This model utilizes a Bidirectional Long Short-Term Memory (BiLSTM) network to learn the contextual information of the text and combines it with a Conditional Random Field (CRF) to accurately identify entity boundaries, laying a solid foundation for knowledge atom extraction.

[0041] Step S3: Feature Representation: First, the words in the text are mapped into low-dimensional vectors based on the word vector model, and the sentence structure information is obtained by combining dependency parsing and semantic role labeling; then, the improved TF-IDF method, which integrates ontology concept similarity and concept co-occurrence characteristics, is used to optimize the feature representation.

[0042] First, the Word2Vec word vector model is used to map words in the text into low-dimensional vectors, capturing semantic similarity and semantic relationships between words, providing semantic-level feature representations for knowledge atom extraction. For example, the word vector model is first trained on a large-scale educational corpus to obtain low-dimensional vectors for each word, making semantically similar words closer together in the vector space. Then, dependency parsing is performed on the text to extract subject-verb-object, attributive, adverbial, and complement relationships between words. Then, semantic role labeling (SRL) is combined to assign agent, patient, time, and place roles to each predicate. This preserves the semantic vectors of words and clarifies the structural information of sentences, providing complete feature representations for knowledge atom extraction. For example, in the sentence "The teacher is giving a lecture to the students in the classroom," the word vector stage obtains vectors for "teacher," "classroom," "student," and "lecture." Dependency parsing reveals that "teacher" is the subject, "lecture" is the predicate, "student" is the object, and "in the classroom" is the adverbial. SRL labeling further confirms that the agent is the teacher, the patient is the student, and the place is the classroom. Therefore, knowledge atoms (teachers, lectures, students) and their corresponding vector representations can be extracted using subsequent knowledge atom extraction models, providing reliable semantic features for similarity matching, clustering, or knowledge graph construction.

[0043] Furthermore, an improved TF-IDF method that integrates ontology concept similarity and concept co-occurrence characteristics can be used to optimize feature representation. This method first leverages the ontology's standardized description of concepts and relationships within the domain, employing the Lietal method to calculate semantic similarity by combining the shortest path and nearest common parent node depth of concepts in the ontology hierarchy. It then filters text elements by setting word frequency and similarity thresholds to construct a set of similar concepts, and adjusts the word frequency of feature elements based on semantic similarity to eliminate redundancy and enhance feature independence. Next, to address the issue of important low-frequency features being ignored, a co-occurrence matrix is ​​constructed by calculating word co-occurrence rates (based on the number of times a word appears alone and the number of times it co-occurs), and high-co-occurrence-rate word pairs are extracted to form a co-occurrence feature set. Finally, the adjusted set of similar concepts and the co-occurrence feature set are merged, and low-weight features with similarity exceeding 90% are removed to obtain the final feature set.

[0044] Step S4: Knowledge Atom Extraction Model Construction: Construct a deep learning model that combines CNN and RNN, introduce an attention mechanism into the model, and train the model using labeled educational text data.

[0045] Figure 2This diagram illustrates the structure of a knowledge atom extraction model combining Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), detailing the structure and connections of each layer. The input layer receives preprocessed and feature-extracted text data. Convolutional layers use kernels of different sizes to perform convolution operations on the text, extracting local features. Pooling layers downsample the convolution results to reduce the amount of data. Recurrent layers employ RNN structures, such as LSTM or GRU, to capture the contextual information of the text. Fully connected layers map the output of the recurrent layers to obtain the final prediction result. The attention mechanism layer plays a crucial role in the model, assigning different weights based on the importance of the text, highlighting key information, and improving the model's accuracy.

[0046] CNNs excel at extracting local features by sliding convolutional kernels of varying sizes across text sequences to quickly capture local n-gram semantic features. RNNs (such as LSTM / GRU) recursively process sequence data in the temporal dimension, capturing contextual information and semantic dependencies within the text. Concatenating or weighting the features of both in a fusion layer allows the model to achieve both fine-grained local awareness and global semantic coherence. Introducing an attention mechanism during the model's decoding stage allows it to automatically focus on key information relevant to knowledge atom extraction, improving its ability to capture important information and enhancing performance. When processing the sentence "In mathematics, the Pythagorean theorem is a very important theorem that describes the relationship between the three sides of a right triangle," the attention mechanism allows the model to pay closer attention to key information such as "Pythagorean theorem," "right triangle," and "relationship between the three sides," thereby significantly improving the accuracy and robustness of knowledge atom extraction.

[0047] During model training, a large amount of labeled educational text data is used as the training set. Cross-entropy is chosen as the loss function, and optimization algorithms such as stochastic gradient descent (SGD), Adagrad, Adadelta, and Adam are employed. A cosine decay strategy is used for learning rate scheduling to accelerate convergence in the early stages and refine parameters in the later stages, enabling the model to accurately learn the features and patterns of knowledge atoms. In the early stages of model training, a larger learning rate can be set to accelerate convergence; as training progresses, the learning rate is gradually reduced to prevent the model from oscillating around the optimal solution. Simultaneously, to prevent overfitting, regularization techniques such as L1 regularization, L2 regularization, and Dropout are used to limit the model's complexity and improve its generalization ability.

[0048] Step S5: Knowledge Atom Extraction and Screening: Extract candidate knowledge atoms from the educational text using the trained model, and screen them based on the knowledge atom extraction criteria in Step S1.

[0049] Based on pre-defined knowledge atom extraction criteria, the knowledge atoms extracted by the model are screened, and candidate knowledge atoms that do not meet the criteria are removed, such as those with incomplete semantics, vague expressions, or content unrelated to the education field.

[0050] Step S6: Knowledge Atom Verification and Update: Verify the selected knowledge atoms through manual review, comparison with the existing knowledge base, or evaluation by domain experts, and establish a dynamic update mechanism for knowledge atoms.

[0051] Extracted knowledge atoms are verified through manual review, comparison with existing knowledge bases, or evaluation by domain experts to ensure their accuracy and reliability. For knowledge atoms that are controversial or uncertain, domain experts can be organized to discuss and ultimately determine whether they should be included as valid knowledge atoms in the educational knowledge graph.

[0052] Establish a dynamic knowledge atom update mechanism: As the field of education develops and new knowledge emerges, the extracted knowledge atoms should be updated and supplemented in a timely manner to ensure the timeliness and completeness of the knowledge graph. New educational text data should be collected periodically, processed using the extraction method described in this scheme, and the newly extracted knowledge atoms should be integrated with the existing knowledge graph to achieve dynamic updates.

[0053] The extraction method employed in this scheme can extract high-quality knowledge atoms, ensuring the accuracy and reliability of the knowledge graph. In supporting education and teaching, intelligent teaching systems based on high-quality knowledge graphs can provide students with personalized learning paths and precise learning resource recommendations based on their learning progress and knowledge mastery, utilizing the knowledge connections and reasoning capabilities within the knowledge graph. This meets the learning needs of different students and improves learning outcomes. When students are learning mathematical functions, the system can recommend suitable learning materials and exercises based on the connections between function knowledge within the knowledge graph and the student's previous learning records, helping them better master function knowledge. Teachers can use knowledge graphs to systematically organize and analyze teaching content, understand the structure and context of the subject's knowledge system, thereby optimizing teaching plans and improving teaching quality. Teachers can quickly find connections between knowledge points through knowledge graphs, design more reasonable teaching processes, and guide students to build a complete knowledge system. Knowledge graphs can also provide rich data support for educational research, helping researchers analyze the knowledge structure and development trends in the field of education from a macro perspective, discover new research questions and directions, and promote academic research and innovative development in the field of education. Researchers can use knowledge graphs to analyze a large amount of educational literature, uncover research hotspots and cutting-edge trends in the field, and provide reference for the formulation of education policies and education reform.

[0054] Corresponding to the above method, the present invention also discloses a human-in-the-loop-based system for the atomic normalization extraction of educational knowledge, comprising:

[0055] The text preprocessing module is used to perform source structuring processing, data cleaning and standardization, word segmentation, part-of-speech tagging and named entity recognition on educational texts to obtain structured text data.

[0056] The feature representation module first maps words in the text into low-dimensional vectors based on a word vector model, and then obtains sentence structure information by combining dependency parsing and semantic role labeling; then it optimizes the feature representation by using an improved TF-IDF method that integrates ontology concept similarity and concept co-occurrence characteristics.

[0057] The model building module is used to build a deep learning model that combines CNN and RNN, introduces an attention mechanism into the model, and trains the model using labeled educational text data.

[0058] The knowledge atom extraction and filtering module is used to extract candidate knowledge atoms from educational texts using the trained model and to filter them based on knowledge atom extraction criteria.

[0059] The knowledge atom verification and update module is used to verify the selected knowledge atoms through manual review, comparison with the existing knowledge base, or evaluation by domain experts, and to establish a dynamic update mechanism for knowledge atoms.

[0060] In summary, this invention aims to overcome the problems of inconsistent standards and limitations in extraction methods for knowledge atoms in existing educational knowledge graphs. It establishes a unified, scientific, and reasonable standard for knowledge atom extraction and provides an efficient and accurate standardized method for extracting educational knowledge atoms. This improves the quality of educational knowledge graph construction, provides a solid knowledge foundation for intelligent applications in education, promotes the innovation and development of teaching models, and enhances the efficiency and effectiveness of education. Through the method of this invention, key knowledge in educational texts can be more accurately extracted, enabling deep integration and effective utilization of knowledge, and meeting the urgent need for high-quality knowledge graphs in the education field.

[0061] The following is a specific example to better understand this solution.

[0062] Taking the extraction of knowledge atoms in advanced mathematics as an example, textbooks, teaching materials, and past exam questions are collected as text data sources. First, text preprocessing is performed, using Python's regular expression library to remove special characters and garbled text, such as removing strange symbols generated by format conversion. The Jieba word segmentation tool is used to segment the Chinese text; for example, "function is an important mathematical concept" is segmented into "function," "is," "a kind of," "important," "of," "mathematics," and "concept." A part-of-speech tagging tool based on a Hidden Markov Model is used to tag the segmented results, determining the part of speech of each word, such as "function" as a noun and "is" as a verb. Finally, a BiLSTM-CRF model is used for named entity recognition, identifying mathematical terms, formulas, and other entities in the text, such as identifying mathematical concepts like "Pythagorean theorem" and "arithmetic sequence" as entities.

[0063] In the feature extraction stage, the Word2Vec model is used to train mathematical text and generate word vectors, mapping each mathematical word to a low-dimensional vector to capture the semantic relationships between words. For example, the word vectors of "function" and "mapping" are close in the vector space, reflecting their semantic relevance. Dependency parsing tools are used to analyze the syntactic structure of sentences, extracting subject-verb-object, attributive, adverbial, and complement relationships. For the sentence "In the Cartesian coordinate system, we can draw function graphs," analysis reveals that "we" is the subject, "draw" is the predicate, "function graph" is the object, and "in the Cartesian coordinate system" is the adverbial. Semantic role labeling technology is used to determine the semantic roles of predicates in sentences. For example, in "Xiaoming proved the Pythagorean theorem," "Xiaoming" is the agent of the action "proving," and "Pythagorean theorem" is the patient.

[0064] A knowledge atom extraction model is constructed, employing a combination of CNN and RNN structures. The model uses multiple convolutional layers with varying kernel sizes, such as 3×1 and 5×1 kernels, to perform convolution operations on the text and extract local features. After the convolutional layer outputs, pooling layers are used for downsampling to reduce the amount of data. LSTM is used as the recurrent layer to capture the contextual information of the text. Fully connected layers map the outputs of the recurrent layers to obtain the prediction results. During model training, a large amount of labeled mathematical text data is used as the training set, with an initial learning rate of 0.001. The Adam optimization algorithm is used to iteratively update the model parameters, reducing the learning rate to 0.9 times every 10 epochs. To prevent overfitting, a Dropout layer is added before the fully connected layers, with a Dropout probability of 0.5. By introducing an attention mechanism, the model can automatically focus on key information related to mathematical knowledge atoms when processing text. For example, when processing the sentence "Trigonometric functions include sine, cosine, and tangent functions", the attention mechanism makes the model pay more attention to key concepts such as "trigonometric functions", "sine function", "cosine function", and "tangent function".

[0065] After model training, knowledge atoms are extracted from the test text. Extracted knowledge atoms include concepts such as "definition of a function," "general term formula of an arithmetic sequence," and "standard equation of an ellipse." Based on pre-defined knowledge atom extraction criteria, the results are filtered to remove content that does not meet the criteria, such as vague statements like "there are some interesting things in mathematics." The filtered knowledge atoms are verified by comparing them with mathematics textbooks and professional mathematical knowledge bases to ensure their accuracy and reliability. For controversial knowledge atoms, such as "the ε-δ definition of a limit," experts in the field of mathematics are organized to discuss and evaluate them, ultimately determining whether they should be included as valid knowledge atoms in the educational knowledge graph.

[0066] Finally, it should be noted that the above-described embodiments are merely specific implementations of the present invention, used to illustrate the technical solutions of the present invention, and not to limit it. The scope of protection of the present invention is not limited thereto. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that any person skilled in the art can still modify or easily conceive of changes to the technical solutions described in the foregoing embodiments within the technical scope disclosed in the present invention, or make equivalent substitutions for some of the technical features; and these modifications, changes, or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should all be covered within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

1. A method for atomic standardization extraction of educational knowledge based on human-in-the-loop learning, characterized in that, Includes the following steps: S1: Define the knowledge atom extraction standard: Define the knowledge atom as the smallest knowledge unit with independent semantics in the field of education, and set the granularity standard of the knowledge atom according to the educational stage and teaching objectives based on the preset granularity definition principle. S2: Text preprocessing: The educational texts are processed through source structuring, data cleaning and standardization, word segmentation, part-of-speech tagging and named entity recognition to obtain structured text data; S3: Feature Representation: First, words in the text are mapped into low-dimensional vectors based on the word vector model, and sentence structure information is obtained by combining dependency parsing and semantic role labeling; then, the improved TF-IDF method, which integrates ontology concept similarity and concept co-occurrence characteristics, is used to optimize feature representation. S4: Knowledge Atom Extraction Model Construction: Construct a deep learning model that combines CNN and RNN, introduce an attention mechanism into the model, and train the model using labeled educational text data; S5: Knowledge Atom Extraction and Screening: Extract candidate knowledge atoms from educational texts using the trained model and screen them based on the knowledge atom extraction criteria in step S1. S6: Knowledge Atom Verification and Update: Verify the selected knowledge atoms through manual review, comparison with the existing knowledge base, or evaluation by domain experts, and establish a dynamic update mechanism for knowledge atoms.

2. The human-in-the-loop based educational knowledge atom standardization extraction method according to claim 1, wherein, The granularity definition principles mentioned in step S1 include: a knowledge atom should fully express an independent knowledge meaning and be understandable without additional explanation; and the granularity of the knowledge atom should be adaptively adjusted according to the educational stage and teaching objectives.

3. The human-in-the-loop based educational knowledge atom standardization extraction method according to claim 1, wherein, Step S2, text preprocessing, specifically includes: source structuring processing, using the Python libraries pdfplumber and python-docx to segment the structure of PDF and Word documents respectively, while removing meaningless pages; data cleaning and normalization to remove noisy data and unify text format, encoding, and capitalization; word segmentation processing, using dictionary-based, statistical model-based, or deep learning-based word segmentation methods for Chinese text, and using spaces and punctuation marks for English text; part-of-speech tagging, using a rule-based or statistical model-based part-of-speech tagger to tag the part of speech of each segmented word; and named entity recognition, using a deep learning-based named entity recognition model to identify terminology named entities in the text.

4. The human-in-the-loop based educational knowledge atom standardization extraction method according to claim 1, wherein, Step S3, feature representation, specifically includes: first, using the Word2Vec word vector model to map words in the text into low-dimensional vectors; then, performing dependency parsing on the text to extract subject-verb-object and attributive-adverbial-complement relationships; next, combining semantic role labeling to assign agent, patient, time, and place roles to each predicate; and finally, using an improved TF-IDF method that integrates ontology concept similarity and concept co-occurrence characteristics to optimize feature representation.

5. The human-in-the-loop based educational knowledge atom standardization extraction method according to claim 4, wherein, The improved TF-IDF method that integrates ontology concept similarity and concept co-occurrence characteristics specifically includes: calculating the shortest path and the depth of the nearest common parent node of a concept in the ontology hierarchical network to obtain the semantic similarity of the concept; constructing a set of similar concepts based on word frequency thresholds and similarity thresholds, and adjusting the feature word frequencies; constructing a word co-occurrence matrix, and extracting high co-occurrence rate word pairs to form a co-occurrence feature set; and fusing the adjusted set of similar concepts and the co-occurrence feature set to obtain the final feature set.

6. The method for atomic standardization extraction of educational knowledge based on human-in-the-loop as described in claim 1, characterized in that, In the knowledge atom extraction model described in step S4, the CNN extracts local text features through convolutional kernels of various sizes, and the RNN uses LSTM or GRU structures to capture text context information and semantic dependencies; the outputs of the CNN and RNN are concatenated or weighted and merged. Attention mechanisms assign weights based on text importance, enhancing focus on key information.

7. The method for atomic standardization extraction of educational knowledge based on human-in-the-loop as described in claim 1, characterized in that, In step S4, during model training, cross-entropy is used as the loss function, and the optimization algorithm employs stochastic gradient descent, Adagrad, Adadelta, or Adam. The learning rate is scheduled using a cosine decay strategy, and Dropout, L1, or L2 regularization is used to prevent overfitting.

8. The method for atomic standardization extraction of educational knowledge based on human-in-the-loop as described in claim 1, characterized in that, Step S5, knowledge atom screening, includes: screening the knowledge atoms extracted by the model according to the knowledge atom extraction criteria, and removing candidate knowledge atoms that are semantically incomplete, vaguely expressed, or irrelevant to the education field.

9. A human-in-the-loop-based system for the atomic standardization of educational knowledge extraction, characterized in that, include: The text preprocessing module is used to perform source structuring processing, data cleaning and standardization, word segmentation, part-of-speech tagging and named entity recognition on educational texts to obtain structured text data. The feature representation module first maps words in the text into low-dimensional vectors based on a word vector model, and then obtains sentence structure information by combining dependency parsing and semantic role labeling; then it optimizes the feature representation by using an improved TF-IDF method that integrates ontology concept similarity and concept co-occurrence characteristics. The model building module is used to build a deep learning model that combines CNN and RNN, introduces an attention mechanism into the model, and trains the model using labeled educational text data. The knowledge atom extraction and filtering module is used to extract candidate knowledge atoms from educational texts using the trained model and to filter them based on knowledge atom extraction criteria. The knowledge atom verification and update module is used to verify the selected knowledge atoms through manual review, comparison with the existing knowledge base, or evaluation by domain experts, and to establish a dynamic update mechanism for knowledge atoms.