A problem-based generation education domain knowledge base search optimization method and device
By generating questions in the educational knowledge base and performing semantic matching, the problem of low recall rate in the education field is solved, thereby improving user experience and learning efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHEJIANG LAB
- Filing Date
- 2023-09-28
- Publication Date
- 2026-06-23
Smart Images

Figure CN117540063B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of educational knowledge bases, search, and question generation technologies, and in particular to a method and apparatus for optimizing the search of educational knowledge bases based on question generation. Background Technology
[0002] Education is the foundation of a nation. With the advancement of educational informatization, the intelligentization of education guided by internet technology is entering a period of rapid development. Education and knowledge are naturally linked, and knowledge graphs, as a key technological foundation of cognitive intelligence, play a decisive role in the intelligentization of education. Among these applications, knowledge search and question answering—intelligent applications based on educational knowledge graphs—are a crucial link in the intelligentization of education and one of the most relied-upon functions for both educators and learners. Currently, knowledge base search in the educational field, limited to knowledge graph triplet data, often fails to meet users' actual needs. For example, when learners encounter unfamiliar knowledge areas, inaccurate search terms often lead to difficulty in retrieving results, directly resulting in unmet user needs. Therefore, effectively improving the search performance of knowledge bases will greatly enhance the user experience, assisting learners in better learning and educators in more convenient lesson preparation.
[0003] Common knowledge base search optimization methods, such as directly using fuzzy search, query suggestions, or semantic matching, have limited effectiveness in the precision-critical field of education. Other approaches combine existing search technologies with knowledge bases, such as: 1. Matching search terms with processed strings from knowledge base content; 2. Calculating similarity between search terms and knowledge in the database after unified encoding using an optimized semantic model; 3. Utilizing the existing graph structure of the knowledge base for multi-hop auxiliary methods. These solutions still focus on improving recall within the current knowledge content and have limited effectiveness for ambiguity, polysemy, and sparsity issues.
[0004] Due to the need for standardized knowledge representation, knowledge bases in the education field often contain only concise and unified terminology. Current methods are limited to matching or similarity calculation of text within the knowledge base, which can handle precise terminology search needs. However, they often struggle to match actual user search terms, such as question-based search terms commonly used by learners. Simply optimizing matching performance with the above methods is insufficient to significantly address this pain point and meet user needs. This method, while optimizing matching performance, considers using question generation technology to expand the search database, greatly improving recall.
[0005] Other solutions include dynamic knowledge bases that attempt to improve low recall based on user feedback. However, these solutions sacrifice user experience in the early stages, and due to user inertia, they require long-term accumulation with minimal results. Summary of the Invention
[0006] The purpose of this invention is to address the shortcomings of existing technologies by providing a method and apparatus for optimizing knowledge base search in the education field based on question generation, thereby improving the problem of low recall rate of knowledge base search results based on question generation technology.
[0007] The objective of this invention is achieved through the following technical solution: a question-generated knowledge base search optimization method for the education field, comprising the following steps:
[0008] (1) Obtain educational domain texts by parsing domain corpora related to the education knowledge base;
[0009] (2) Use educational texts to perform pre-trained language model transfer learning to obtain a semantic model;
[0010] (3) Design fixed question-answer pair templates based on structured text information in the knowledge base to obtain question-answer pairs in the knowledge base;
[0011] (4) Train a question generation model using knowledge base question-answer pair data and open-source Chinese question-answer pair data, and deploy a question generation reasoning service;
[0012] (5) Generate question-answer pairs to expand the knowledge base: Use the question generation reasoning service to process texts in the education field, generate questions, and obtain question-answer pair data; use the semantic model to semantically encode the entity node text in the structured text information of the knowledge base and the question text in the question-answer pair at the same time, build a vector library, and perform semantic similarity calculation after the user queries and inputs;
[0013] (6) Online semantic matching to retrieve the best results: The semantic model is used to encode the user query online, and the query code is used to retrieve the result with the highest semantic similarity in the vector library; if it is the text of the knowledge base entity node, the structured text information corresponding to the knowledge base entity node is returned; if it is the question text, the corresponding answer text is returned.
[0014] Further, step (1) specifically involves: obtaining textbooks, courseware, and professional-related web page content for the corresponding subject area; performing OCR recognition or layout analysis on the textbooks and courseware; performing web page content crawling and parsing to obtain complete text content; removing invalid characters, duplicate content, and syntactically incoherent corpora to obtain educational text.
[0015] Further, step (2) specifically involves: using pre-trained language models for natural language understanding such as BERT, ELECTRA, and ALBert, conducting MLM and MPNetd pre-training tasks with educational texts, performing transfer learning experiments, and obtaining a semantic model.
[0016] Furthermore, in step (3), the structured text information is triplet structured data.
[0017] Furthermore, by utilizing the relationships or attributes between entities in the triplet structure data, fixed question-answer pair templates can be designed.
[0018] Furthermore, step (3) also includes: using the pre- and post-relationships of knowledge points to design questions and obtain knowledge base question-answer pairs.
[0019] Furthermore, step (4) includes the following sub-steps:
[0020] (4.1) Search for open-source question-answer pairs corpora, process the different corpora in terms of format, convert them into the question-answer format consistent with the knowledge base question-answer pairs and save them, and merge them with the knowledge base question-answer pairs as training samples for question-answer pairs.
[0021] (4.2) Enhance the training samples of question-answer pairs: construct negative samples by using the training samples of question-answer pairs in step (4.1) as positive sample data, construct difficult negative samples by using contrastive learning methods, and construct simple negative samples by replacing keywords or changing the word order of samples; increase the proportion of positive and negative sample data of question-answer pairs in the knowledge base to improve domain capabilities.
[0022] (4.3) Construct a prompt template, learn the question generation task based on the pre-trained language model of the generation class, divide the enhanced training samples into training set and evaluation set, input the model in the format of SEP character plus answer text plus CLS character plus prompt template character, and use the question text as the output result. Use the training set to learn the model and the evaluation set to evaluate the model to obtain the optimal question generation model.
[0023] (4.4) Construct a question generation service using the optimal question generation model combined with a score threshold; specifically: the optimal question generation model outputs the question text and corresponding score, and filters according to the score threshold. If the score is greater than or equal to the score threshold, it is retained; otherwise, it is removed; the score threshold is 0.7 to 0.9.
[0024] Furthermore, in step (4.3), the generated pre-trained language model is a GPT, T5, or BART model.
[0025] An educational domain knowledge base search optimization device based on question generation includes one or more processors for implementing the aforementioned educational domain knowledge base search optimization method based on question generation.
[0026] A computer-readable storage medium having a program stored thereon, which, when executed by a processor, is used to implement the above-described problem-based knowledge base search optimization method for the education domain.
[0027] The beneficial effects of this invention are: it greatly improves the recall rate of search results returned by users, enhances learning efficiency, and improves user experience. Attached Figure Description
[0028] To more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0029] Figure 1 This is a schematic diagram of the overall framework of the method of the present invention;
[0030] Figure 2 Build and inference structure diagrams for problem generation services;
[0031] Figure 3 A schematic diagram of the internal structure of the problem-generating model.
[0032] Figure 4 This is a diagram showing the overall architecture of the online search module.
[0033] Figure 5 This is a hardware structure diagram of the present invention. Detailed Implementation
[0034] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numerals in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatuses and methods consistent with some aspects of the invention as detailed in the appended claims.
[0035] The present invention will now be described in detail with reference to the accompanying drawings. Unless otherwise specified, the features of the following embodiments and implementations can be combined with each other.
[0036] This invention processes readily available domain-specific corpora in the education field to obtain domain-specific text. Then, it utilizes question generation technology to expand the education domain knowledge base. Simultaneously, it optimizes retrieval performance using a semantic model of the data source, and further employs data augmentation and threshold filtering strategies to effectively address the low recall rate issue in education domain knowledge base searches.
[0037] This invention provides a question-generated knowledge base search optimization method for the education field, such as... Figure 1 As shown, it includes the following steps:
[0038] (1) Obtain educational domain texts by parsing domain corpora related to the education knowledge base;
[0039] Taking robotics as an example, we explore resources within the field. We collect textbooks used in robotics-related courses; text format is used directly, while PDF textbooks are extracted using layout analysis or OCR techniques. Other resources, such as content from professional robotics websites, encyclopedia pages, and school intranet educational resources, are also collected. Downloadable content (PowerPoint presentations, videos, etc.) is extracted using parsing techniques, while non-downloadable content is crawled. After obtaining the raw text, we perform a series of text cleaning and analysis steps to remove invalid characters, duplicate content, and syntactically incoherent corpora, compiling the final text for the robotics field.
[0040] (2) Use educational texts to perform pre-trained language model transfer learning to obtain a semantic model;
[0041] Based on a pre-trained language model, a semantic model with better performance in the robotics field is obtained using transfer learning. Using the obtained robotics-related text, a Chinese sentence-BERT model is used as the backbone model for fine-tuning. Sufficient training is performed using different pre-training tasks such as MLM and MPNet, and different hyperparameters. The resulting semantic model demonstrates superior semantic understanding capabilities in the robotics field. The testing method involves using a small, manually prepared word and sentence similarity test set, such as comparing the relative similarity scores of entity words related to robot dynamics with those related to robot statics and rigid bodies, to test the different trained semantic models.
[0042] (3) Design fixed question-answer pair templates based on structured text information in the knowledge base to obtain question-answer pairs in the knowledge base;
[0043] The robot knowledge base primarily consists of triplet data from a constructed robotics knowledge graph, mainly containing relationships between different entities and their corresponding attribute information. For example, entities of the theoretical knowledge point class have the attribute "definition," generating the question: "What is the definition of this entity?" The corresponding attribute value then becomes the answer, resulting in attribute-based question-and-answer pairs. Entities of the theoretical knowledge point class and practical skills-based knowledge points have a supporting relationship, generating the question: "What relationship exists between these entities?" The corresponding relationship category then becomes the answer, resulting in relationship-based question-and-answer pairs. Furthermore, by utilizing the pre- and post-requirement relationships of knowledge points, a question pattern for the prerequisite knowledge points can be obtained. For example, the entity "rigid body attributes" is a prerequisite knowledge point for the entity "rigid body dynamics," leading to the question: "What prerequisite knowledge points are included in robot dynamics?" The answer is: "rigid body dynamics, robot motion." Other structured information can be similarly used to design question-and-answer pair templates, ultimately obtaining the knowledge base question-and-answer pairs.
[0044] (4) Train a question generation model using knowledge base question-answer pair data and open-source Chinese question-answer pair data, and deploy a question generation reasoning service;
[0045] The study selected open-source question-and-answer (QA) corpora, acquiring high-quality Chinese open-source QA corpora similar to those in the robotics field. The most commonly used dataset in the QA domain, SQuAD-2.0, was used to learn the semantic logic of QA pairs, and the webQA dataset was used to learn real user questioning styles. These were then combined with the constructed robotics knowledge base QA data to form the training samples. Further, the training samples were augmented. The training samples were used as positive samples to construct negative samples. Contrastive learning methods were used to construct difficult negative samples, such as using PGD (Projected Gradient Descent) for adversarial training. Rule-based defenses were also employed, such as replacing key entities in the question text and adjusting the word order of short sentences in the answer text to construct simple negative samples. Additionally, since the number of QA texts in the robotics field is relatively small compared to open-source corpora, the proportion of positive and negative samples in the knowledge base QA data was increased to enhance domain-specific capabilities.
[0046] Preferably, the p-tuning method is used to train the problem generation model. The survey selected open-source generative pre-trained models GPT3, T5, or BART as the backbone model for experiments, such as... Figure 2 As shown, a multi-layered transformer-like encoder-decoder structure is used, with the `answer` field as input and the corresponding `question` as output text. The model learns from real users' questioning habits and performs question generation tasks based on a pre-trained large-scale generative model. The augmented training samples are divided into training and evaluation sets. Using a prompt template method, the input to the model consists of a SEP character, the answer text, a CLS character, and the prompt template character. The SEP character plus the question text is output as the label. The training set is used for model learning, and comparative experiments are conducted with multiple different models and parameters to obtain several question generation models for evaluation. The evaluation set is used for model evaluation, calculating BLUE and ROUGE metrics, and selecting the optimal question generation model.
[0047] A question generation and reasoning service is built based on the reasoning capabilities of the question generation model and related rules. The model reasoning module is constructed and deployed, and its effectiveness is tested using texts from the educational domain. Multiple question texts and their corresponding scores are obtained, and the scores and corresponding performance metrics are evaluated. The generation rule scheme is adjusted, such as setting a score threshold (the threshold ranges from 0.7 to 0.9, depending on the score and performance), and removing results below the threshold. Figure 3The diagram shows the overall architecture of the question generation service, which includes offline model training and online inference. Offline, samples are built using open-source datasets and knowledge base datasets for model training experiments. After obtaining the optimal model, the service is deployed online to encode and build a database of robotics-related text and knowledge base nodes, which then serves as the retrieval library for the semantic engine.
[0048] (5) Generate question-answer pairs to expand the knowledge base: Use the constructed question generation and reasoning service to process factual statements in educational texts, generate questions, and obtain question-answer pair data; use the semantic model obtained in step (2) to semantically encode the entity node text in the structured text information of the knowledge base and the question text in the question-answer pairs to construct a vector library. Figure 4 As shown in the overall online search architecture, the offline part uses a semantic model to encode and build a database, while the online part encodes the user's query input and performs a high-dimensional dense vector retrieval in the complete vector database, returning the result with the highest cosine semantic similarity.
[0049] (6) Online semantic matching to retrieve the best results: The semantic model is used to encode the user query online, and the query encoding is used to retrieve the result with the highest semantic similarity in the vector knowledge base. If it is knowledge base node text, the structured text information corresponding to the knowledge base node is returned; if it is question text, the corresponding answer text field is returned.
[0050] Corresponding to the aforementioned embodiment of a question-generated knowledge base search optimization method for the education field, the present invention also provides an embodiment of a question-generated knowledge base search optimization device for the education field.
[0051] See Figure 5 The present invention provides an educational domain knowledge base search optimization device based on question generation, comprising one or more processors for implementing an educational domain knowledge base search optimization method based on question generation as described in the above embodiments.
[0052] The processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor.
[0053] An embodiment of the question-generated knowledge base search optimization device for the education field according to the present invention can be applied to any device with data processing capabilities, such as a computer. The device embodiment can be implemented in software, hardware, or a combination of both. Taking software implementation as an example, as a logical device, it is formed by the processor of any data processing device loading the corresponding computer program instructions from non-volatile memory into memory for execution. From a hardware perspective, such as... Figure 5 The diagram shown is a hardware structure diagram of any device with data processing capabilities, including a question-generated knowledge base search optimization device for the education field according to the present invention. (Except for...) Figure 5 In addition to the processor, memory, network interface, and non-volatile memory shown, any data processing device in the embodiment may also include other hardware depending on the actual function of the data processing device, which will not be described in detail here.
[0054] The specific implementation process of the functions and roles of each unit in the above device can be found in the implementation process of the corresponding steps in the above method, and will not be repeated here.
[0055] For the device embodiments, since they basically correspond to the method embodiments, the relevant parts can be referred to in the description of the method embodiments. The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of the present invention according to actual needs. Those skilled in the art can understand and implement this without creative effort.
[0056] This invention also provides a computer-readable storage medium storing a program thereon, which, when executed by a processor, implements a question-based knowledge base search optimization method for the education field as described in the above embodiments.
[0057] The computer-readable storage medium can be an internal storage unit of any data processing device described in any of the foregoing embodiments, such as a hard disk or memory. The computer-readable storage medium can also be any data processing device, such as a plug-in hard disk, smart media card (SMC), SD card, flash card, etc., equipped on the device. Furthermore, the computer-readable storage medium can include both internal storage units of any data processing device and external storage devices. The computer-readable storage medium is used to store the computer program and other programs and data required by the data processing device, and can also be used to temporarily store data that has been output or will be output.
[0058] The above embodiments are only used to illustrate the design concept and features of the present invention, and their purpose is to enable those skilled in the art to understand the content of the present invention and implement it accordingly. The protection scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes or modifications made based on the principles and design ideas disclosed in the present invention are within the protection scope of the present invention.
Claims
1. A method for optimizing knowledge base search in the education field based on question generation, characterized in that, Includes the following steps: (1) Obtain educational domain texts by parsing domain corpora related to the education knowledge base; (2) Use educational texts to perform pre-trained language model transfer learning to obtain a semantic model; (3) Design fixed question-answer pair templates based on structured text information in the knowledge base to obtain question-answer pairs in the knowledge base; In step (3), the structured text information is triple structured data. Using the relationships or attributes between entities in the triple structured data, a fixed question-and-answer pair template is designed. Step (3) also includes: using the pre- and post-relationships of knowledge points to design questions and obtain knowledge base question-answer pairs; (4) Train a question generation model using knowledge base question-answer pairs and open-source Chinese question-answer pairs, and deploy a question generation reasoning service; Step (4) includes the following sub-steps: (4.1) Search for open-source question-answer pair corpora, process the different corpora in terms of format, convert them into the question-answer format consistent with the knowledge base question-answer pairs and save them, and merge them with the knowledge base question-answer pairs as training samples for question-answer pairs; (4.2) Enhance the training samples of question-answer pairs: construct negative samples using the training samples of question-answer pairs in step (4.1) as positive sample data, construct difficult negative samples using contrastive learning methods, and construct simple negative samples by replacing keywords or changing the word order of samples; increase the proportion of positive and negative sample data of question-answer pairs in the knowledge base to improve domain capabilities; (4.3) Construct a prompt template, learn the question generation task based on the pre-trained language model of the generation class, divide the enhanced training samples into training set and evaluation set, input the model in the format of SEP character plus answer text plus CLS character plus prompt template character, and use the question text as the output result. Use the training set to learn the model and the evaluation set to evaluate the model to obtain the optimal question generation model. (4.4) Construct a question generation service using the optimal question generation model combined with a score threshold; specifically: the optimal question generation model outputs the question text and corresponding score, and filters according to the score threshold. If the score is greater than or equal to the score threshold, it is retained; otherwise, it is removed; the score threshold is 0.7~0.
9. (5) Generate question-answer pairs to expand the knowledge base: Use the question generation reasoning service to process texts in the education field, generate questions, and obtain question-answer pair data; use the semantic model to semantically encode the entity node texts in the structured text information of the knowledge base and the question texts in the question-answer pairs simultaneously, construct a vector library, and perform semantic similarity calculation after the user queries and inputs; (6) Online semantic matching to retrieve the best results: The semantic model is used to encode the user query online, and the query code is used to retrieve the result with the highest semantic similarity in the vector library; if it is the text of the knowledge base entity node, the structured text information corresponding to the knowledge base entity node is returned; if it is the question text, the corresponding answer text is returned.
2. The method for optimizing knowledge base search in the education field based on question generation according to claim 1, characterized in that, The specific steps (1) are as follows: obtain textbooks, courseware and professional web page content of the corresponding discipline direction, perform OCR recognition or layout analysis on textbooks and courseware, perform web page content crawling and parsing, obtain complete text content, remove invalid characters, duplicate content and syntactically incomprehensible corpus, and obtain educational field text.
3. The method for optimizing knowledge base search in the education field based on question generation according to claim 1, characterized in that, The specific steps (2) are as follows: based on BERT, ELECTRA, and ALBert natural language understanding pre-trained language models, use educational texts to perform MLM and MPNetd pre-training tasks, conduct transfer learning experiments, and obtain semantic models.
4. The method for optimizing knowledge base search in the education field based on question generation according to claim 1, characterized in that, In step (4.3), the generated pre-trained language model is a GPT, T5, or BART model.
5. A knowledge base search optimization device for the education field based on question generation, characterized in that, It includes one or more processors for implementing the question-generated knowledge base search optimization method for the education domain according to any one of claims 1-4.
6. A computer-readable storage medium having a program stored thereon, characterized in that, When executed by the processor, the program is used to implement the question-generated knowledge base search optimization method for the education domain as described in any one of claims 1-4.