A multimodal teaching content recommendation system and methods of establishing and using the same

CN119917648BActive Publication Date: 2026-06-26EAST CHINA UNIV OF SCI & TECH +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
EAST CHINA UNIV OF SCI & TECH
Filing Date
2024-12-13
Publication Date
2026-06-26

Smart Images

  • Figure CN119917648B_ABST
    Figure CN119917648B_ABST
Patent Text Reader

Abstract

The present application relates to the present application provides a kind of multimodal teaching content recommendation system establishment method, comprising: constructing knowledge base and generating category label;Meanwhile, according to the text content of each slice, training set and test set are prepared, vocabulary table is constructed and text coding is carried out on training set and test set;The encoder architecture of Transformer is trained;Establish the user portrait function based on input question and historical dialogue, generate user portrait;Integrate the multimodal teaching content recommendation system, so that it can retrieve and output course resources from the knowledge base according to the input question and the historical dialogue to obtain the relevant slice and the slice of interest of the user.The present application also provides corresponding system and use methodThe method of the present application combines the text understanding and generation capability of large model with the classification and retrieval capability of encoder architecture on user semantics, which can effectively improve the retrieval accuracy of user question information and recommend exclusive learning route for users.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of information processing technology and recommendation systems, and in particular to a multimodal teaching content recommendation system and its establishment and usage methods. Background Technology

[0002] In recent years, large language models have revolutionized multiple fields, with a particularly strong presence in Natural Language Processing (NLP). These models, pre-trained on massive datasets, learn rich linguistic features and world knowledge, enabling them to understand and generate unprecedentedly complex text. Their capabilities extend beyond basic text analysis to complex semantic understanding and generation tasks, providing unprecedented opportunities for industries including educational technology, especially in the automated extraction of instructional content, knowledge structuring, and interactive learning assistance.

[0003] Furthermore, in the field of education, traditional "one-size-fits-all" teaching methods are increasingly unable to meet the ever-diversifying learning needs. Modern educational philosophies emphasize "individualized instruction," meaning providing personalized teaching content based on each student's abilities, interests, and learning pace. This approach requires educators not only to understand each student's specific needs but also to be able to provide corresponding teaching resources, which presents significant challenges in practice. Therefore, there is an urgent need to leverage technology to achieve personalized retrieval and recommendation of teaching content to improve teaching effectiveness and student satisfaction.

[0004] Currently, combining large language model technology to construct a private domain course resource knowledge base and workflow, and branching out and outputting answers to user questions, can effectively integrate the information and materials of the course itself, enhancing the accuracy of course knowledge answers. However, relying solely on keywords or the capabilities of the large model itself may lead to the workflow failing to correctly connect to the segmented knowledge base, thus relying on the capabilities of the large model itself or even misinterpreting the information to output answers to users, which does not meet the true requirements of the course content. A preprocessing method based on large model data balancing, summary generation, and chi-square test can effectively construct a vocabulary, assist in text encoding, and then use neural networks for classification to locate the retrieval content of the question, thereby effectively solving the problem of large model misinterpretation.

[0005] Most existing recommender systems employ collaborative filtering or content-based recommendation methods. However, with the widespread use of user-generated content (UGC) and the development of large-scale models, effectively utilizing historical dialogue information and generating personalized recommendations through deep learning algorithms has become a crucial issue. Traditional collaborative filtering may face sparsity problems, while personalized recommendations based on user dialogue content can more accurately understand user needs and provide relevant content.

[0006] Therefore, it is necessary to provide a new multimodal teaching content recommendation system that recommends relevant knowledge to users based on their historical information of questions. Summary of the Invention

[0007] The purpose of this invention is to provide a multimodal teaching content recommendation system, its establishment method, and its usage method, so as to effectively improve the retrieval accuracy of user question information through the combination of deep learning and recommendation technology.

[0008] To achieve the above objectives, the present invention provides a method for establishing a multimodal teaching content recommendation system, comprising:

[0009] S100: Prepare course resources;

[0010] S200: Construct a knowledge base with multiple slices based on course resources and generate category labels for identifying slices. The knowledge base is used to output course resources corresponding to the slices to the user based on the category labels. At the same time, prepare training sets and test sets based on the text content of each slice, construct a vocabulary list, and encode the training sets and test sets into text.

[0011] S300: Using the text encoding result as input sample, train the encoder architecture of Transformer to obtain a trained neural network model. The neural network model is used to identify the user's input question to obtain the category label of the question-related slice.

[0012] S400: Establish a user profile generation function. The user profile generation function is set to extract user preference information based on the user's input questions and historical dialogues, and generate a user profile based on the user preference information. The user profile includes slices that the user is interested in.

[0013] S500: Integrates a knowledge base, neural network model, and user profile generation function into a multimodal teaching content recommendation system. This system generates question-related segments and segments that the user is interested in based on the user's input questions and historical dialogues, and retrieves and outputs corresponding course resources from the knowledge base.

[0014] Step S100 includes:

[0015] Provide course materials and video recordings in PDF or Markdown format as course resources;

[0016] Based on gpt-4-vision-preview and the Tongyi Listening and Comprehension Model, images in course materials and course explanation videos are used to generate image text descriptions and video text summaries.

[0017] Step S200 further includes:

[0018] S210: The textual materials of the teaching materials are segmented and organized according to chapters and sections to form a course text knowledge base that is easy to manage and retrieve; then, an image knowledge base and a video knowledge base are established, in which the relational data of the image knowledge base and the video knowledge base includes extracted image text descriptions and video text summaries, as well as image addresses and video addresses; then, the chapters and sections are combined as category labels for the prediction model.

[0019] S230: Prepare training and testing sets based on the text content of each slice, and construct a vocabulary list based on high-frequency words in the text content; wherein, the training set includes the segmented text and corresponding category labels; the testing set includes test questions and category labels corresponding to each test question; and perform text encoding on the segmented text and test questions based on the vocabulary list.

[0020] After step S210 and before step S230, step S220 is also included: using a large language model to perform text equalization on the text content of each chapter and section, and generating summary text; and / or after step S230, step S240 is also included: padding the training samples and test samples obtained after text encoding to ensure consistent length.

[0021] Step S300 specifically includes:

[0022] S320: Configure and initialize the neural network model of the Transformer encoder architecture, which includes a word embedding layer, a positional encoding layer, a Transformer encoder layer, and an output layer; the word embedding layer maps the index of each word in the input sample to a 128-dimensional vector to obtain the first word embedding tensor; the sine and cosine positional encoding layer captures the relationship between words based on the first word embedding tensor and sine and cosine positional encoding to obtain a third result tensor containing positional information; the Transformer encoder layer uses a multi-head attention mechanism to obtain the output tensor of the fourth encoder layer; the category output layer converts the output of the Transformer encoder layer into a score for each class of the sample;

[0023] The third result tensor containing location information is obtained by adding the second position tensor to the first word embedding tensor corresponding to the training sample; for the position encoding tensor of the second position tensor before dimensional transformation, the calculation formula for the position encoding of each position in dimension i is as follows:

[0024] If i is even

[0025] If i is odd

[0026] Where pos is the position index, i is the dimension index, and d_model is the dimension of the word embedding.

[0027] S330: The neural network model is trained using the training tensor from the text encoding result. The cross-entropy loss function CELLoss is used as the loss function and the Adam optimizer is used for training. A confidence threshold is set for each category. The neural network model outputs the prediction result only when the predicted probability of the category exceeds this confidence threshold.

[0028] Before step S320, step S310 is included: loading training samples and test samples from the text encoding result to obtain training dataset and test dataset, loading to ensure the random order of training samples in each training cycle; and / or after step S330, step S340 is included: testing the neural network model using test samples from the text encoding result, and evaluating the model performance of the trained neural network model based on the test results.

[0029] Step S400 includes:

[0030] S410: Establish a user preference information extraction module, which is configured to analyze user preference information based on the user's input questions and historical dialogues; the user preference information includes preference for text, images or videos, and whether the current question is relevant to the course content;

[0031] S420: Establish a user profile generation function, which is configured to collect historical conversations and call the LLMChain chain to generate a user profile, which includes slices of interest to the user.

[0032] The function for generating user profiles is obtained in the following way:

[0033] S421: Select the recommendation model;

[0034] S422: Construct a chat prompt template, the chat prompt template including prompt words for generating user profiles;

[0035] S423: Define an LLMChain chain, which is configured to receive user input questions, historical conversations, and created chat prompt templates, and pass the chat prompt templates as input prompt words to the recommendation model, so that the recommendation model can generate a corresponding user profile based on this, wherein the user profile includes chapters and sections that the user is interested in.

[0036] On the other hand, the present invention provides a multimodal teaching content recommendation system, which is established using the method for establishing a multimodal teaching content recommendation system described above.

[0037] On the other hand, the present invention provides a method for using a multimodal teaching content recommendation system, including:

[0038] A1: Provides the multimodal teaching content recommendation system mentioned above;

[0039] A2: Utilize a trained neural network model to identify question-related segments based on user input questions, and retrieve and output corresponding course resources from the knowledge base;

[0040] A3: Utilize the user profile generation function to obtain the slices that the user is interested in based on the user's input questions and historical dialogues, and retrieve and output the recommended links of course resources for the corresponding slices from the knowledge base.

[0041] The multimodal teaching content recommendation system of this invention utilizes a neural network based on a Transformer encoder architecture to build a knowledge retrieval and classification model oriented towards user questions. Simultaneously, it provides personalized content recommendations through a content-based recommendation algorithm. Thus, it organically combines the text understanding and generation capabilities of a large model with the semantic classification and retrieval capabilities of the Transformer-Encoder architecture, effectively improving the retrieval accuracy of user question information, alleviating the illusion of large models, and recommending personalized learning paths for users. This makes the recommendation system efficient, accurate, and user-friendly, suitable for various scenarios such as e-books, online learning platforms, and information aggregation applications. Attached Figure Description

[0042] Figure 1 This is an architecture diagram of a multimodal teaching content recommendation system according to an embodiment of the present invention.

[0043] Figure 2 This is a schematic diagram of steps S100-S300 of the modal teaching content recommendation system according to an embodiment of the present invention. Detailed Implementation

[0044] The preferred embodiments of the present invention are given below with reference to the accompanying drawings and described in detail.

[0045] like Figure 1 As shown, the workflow of the multimodal teaching content recommendation system established in this invention mainly includes two parts:

[0046] First, a neural network model based on the Transformer Encoder architecture is used to predict relevant chapters and sections based on the user's query input, and to retrieve and output recommended links of corresponding course resources from the knowledge base and through the large language model.

[0047] Secondly, by using content-based recommendation algorithms, combined with user query results and historical dialogues, we can obtain the chapters and sections that users are interested in, and retrieve and output recommended links of corresponding course resources from the knowledge base and through a large language model.

[0048] In other words, the recommendation system obtained by the method for establishing the multimodal teaching content recommendation system of this invention not only improves the accuracy and relevance of content recommendations, but also enhances the user experience through intelligent content integration and output. First, the neural network part adopts the Transformer-Encoder model to effectively process and understand the natural language of user queries, extracting high-quality feature vectors for subsequent content matching and recommendation decisions. Second, the content-based recommendation algorithm analyzes content features in the knowledge base and user interaction data to identify and recommend content similar to the user's query, providing richer and more accurate recommendation options.

[0049] like Figure 2 As shown, according to an embodiment of the present invention, a multimodal teaching content recommendation system based on a private domain course resource knowledge base includes the following steps:

[0050] S100: Prepare course resources.

[0051] Step S100 includes:

[0052] Provide course materials and video recordings in PDF or Markdown format as course resources;

[0053] Based on gpt-4-vision-preview and the Tongyi Listening and Comprehension Model, images in course materials and course explanation videos are used to generate image text descriptions and video text summaries.

[0054] In this embodiment, the course resources utilize a textbook and lecture videos from a university's Information Management Engineering program. A multimodal teaching content recommendation system is established based on these materials. The system aims to push relevant text explanations and course videos to students through self-questioning, enabling interactive course retrieval. A large-scale model is used for refinement and editing. The information management engineering textbook and lecture videos are organized as follows: 10 chapters, each with 2-5 subsections. Students can locate relevant knowledge points through chapters and subsections. Images in the course materials are embedded within the chapters and subsections to aid in understanding the text; these images are stored in the textbook's courseware files. The videos are explanations of each subsection of the textbook, accompanied by the teacher's audio narration.

[0055] Each section of the textbook courseware contains several embedded images. Therefore, the images serve as a graphical representation of the theoretical content of that section, or as a mind map, summarizing or supplementing the content of each section. This invention uses the gpt-4 model developed by the OpenAI team for image processing. The selected model version is "gpt-4-vision-preview". This gpt-4-vision-preview model can understand images and generate natural language text through deep learning technology. The process of obtaining image text descriptions involves: first, obtaining the gpt-4 model API from the OpenAI website; then, encoding the image file into Base64 format using Python code; and finally, calling the gpt-4 model API to obtain the image text description `intro_for_pics`. The API's message parameters are set as: "role": "system", "content": "Please summarize and describe the knowledge points introduced in this image". The API is then used to complete image processing, resulting in the image text description `intro_for_pics`.

[0056] For each lesson video segment, speech-to-text recognition is performed on Alibaba's Tongyi Tingwu website using a dedicated large-scale model, resulting in a video-to-text description (video_to_txt). In this invention, the "glm-4" large-scale language model developed by the Zhipu Qingyan team is used for video text processing and correction. The selected model version is "glm-4-air," which is more economical to use than OpenAI's large-scale models and focuses on text processing.

[0057] Therefore, based on the video, we obtain `video_to_txt` and `abstract_for_video`, and generate an image index knowledge base and an image knowledge base, specifically including:

[0058] Obtain the API for GLM-4-air from the Zhipu official website;

[0059] Importing video-to-text (video_to_txt) based on Tongyi Listening and Comprehension Extraction using Python code;

[0060] The GLM-4-air model is invoked to summarize, refine, and summarize the extracted video text to generate a video text summary (abstract_for_video). This ensures that there are no typos or semantic inconsistencies after recognition. The API message parameter is set as follows: "role": "system", "content": "The following text content is a video-to-text version of a university course. Please summarize, refine, and modify it to ensure that there are no statements that do not conform to the semantics and actual classroom situation."

[0061] Step S200: Construct a knowledge base with multiple slices based on the course resources and generate category labels for identifying the slices. The knowledge base is used to output the course resources of the corresponding slices to the user based on the category labels. At the same time, prepare training sets and test sets based on the text content of each slice, construct a vocabulary and encode the training sets and test sets into text.

[0062] The knowledge base includes a course text knowledge base, an image knowledge base, and a video knowledge base.

[0063] Step S200 specifically includes:

[0064] Step S210: Slice and organize the textual materials of the teaching materials according to chapters and sections to form a course text knowledge base that is easy to manage and retrieve; then establish an image knowledge base and a video knowledge base, wherein the relational data in the image knowledge base and video knowledge base includes extracted image text descriptions and video text summaries, as well as image addresses and video addresses; then, combine the chapters and sections as category labels for the prediction model.

[0065] Step S210 specifically includes:

[0066] Step S211: Convert the text data to a different file format and organize it into sections and subsections to form a course text knowledge base.

[0067] In step S211, the PDF file is converted to Markdown format using PyMuPDF software. Based on the chapter content, Markdown documents corresponding to different sections within different chapters are first extracted. Specifically, the section titles within a chapter are standardized using Markdown symbols such as # and ##. The corresponding documents are then stored in the course text knowledge base `base_for_textbook`. Thus, the course text knowledge base includes a JSON overview file and a file library arranged by chapters and sections. The JSON overview file is stored in the following format: it records and stores the chapter and section to which the corresponding document belongs. For example, if document 1 records the content of the first section of the first chapter, then the first record would be: `{ "text_id": "text1","title": "Introduction", "chapter": 1, "section": 1, "url": "https: / / example.com / 1"}`.

[0068] Step S212: Process the images and videos; that is, generate an image index knowledge base and a cloud-based image library, as well as a video index knowledge base and a cloud-based video library, based on the image text descriptions and video text summaries. The cloud-based image and video libraries store the images and corresponding videos from the courseware; the image and video index knowledge bases record and store the image text descriptions, video text summaries, and cloud storage links.

[0069] Each section of the textbook courseware contains several embedded images. Therefore, the images serve as a graphical representation of the theoretical content of that section, or as a mind map, summarizing or supplementing the content of each section. This invention uses the gpt-4 model developed by the OpenAI team for image processing, specifically the version "gpt-4-vision-preview". This gpt-4-vision-preview model, through deep learning technology, can understand images and generate natural language text. Based on the image descriptions, an image index knowledge base and an image knowledge base are formed. Specifically, this includes replacing the original images with image descriptions (intro_for_pics), building an image index knowledge base based on the images and their descriptions, storing the images in the folder directory of the image index knowledge base to obtain image addresses (url_for_pics), and importing the image addresses (url_for_pics) and image descriptions (intro_for_pics) into the image knowledge base (base_for_pics) in a relational data format.

[0070] In this embodiment, the image knowledge base `base_for_pics` includes a JSON overview file and a file library organized by chapters and sections. The storage format of the JSON overview file is as follows: it records and stores the chapter and section to which the corresponding image belongs. For example: Figure 1 The first record records the content of the first section of the first chapter, so the first record would be: {"pic_id": "pic1","title": "Introduction","chapter": 1,"section": 1,"url": "https: / / example.com / video1"}. Images are stored in folders within the image index knowledge base, arranged by chapter-section format.

[0071] Therefore, based on the video-to-text (video_to_txt) and video text summary (abstract_for_video) methods, an image index knowledge base and an image knowledge base are generated, specifically including:

[0072] A video index knowledge base (located on a resource cloud drive) is built based on the video and video text summary (abstract_for_video). The videos are stored in the folder directory of the video index knowledge base, and the video address (url_for_video) of each video is obtained. The video address (url_for_video), video-to-text (video_to_txt), and video text summary (abstract_for_video) are put into the video description table (video_excel) and imported into the local video knowledge base (base_for_video).

[0073] In this embodiment, the video knowledge base `base_for_video` includes a general JSON overview file and a file library arranged by chapters and sections. The JSON file is stored in a format that records and stores the chapter and section to which the corresponding video belongs. For example, if video 1 records the content of the first section of the first chapter, then the first record would be: `{"video_id": "video1","title": "Introduction","chapter": 1,"section": 1,"url": "https: / / example.com / video1"}`. Videos are stored in folder directories, which are arranged according to the chapter and section arrangement.

[0074] Step S213: Perform course text processing; that is, extract the text content under the first-level chapter name (corresponding to chapter) and the second-level title name (corresponding to session) according to the course text knowledge base, and merge the image text description intro_for_pics, video to text video_to_txt, and video text summary abstract_for_video into the corresponding text content content, so that each piece of text content content uniquely points to a set (chapter, session).

[0075] The training materials used in this invention consist of 10 chapters. The chapter titles have been marked with # in step S201, representing the first-level chapter names (i.e., the chapter titles) in Markdown format. Each chapter title is followed by 2-5 subsections, and the second-level headings (i.e., the subsection titles) have been marked with ## in Markdown format in step S201 above.

[0076] Therefore, in step S203, the content under the headings is extracted using regular expressions based on Python code. The regular expression for the first-level chapter name is: ^#[^#].*, and the regular expression for the second-level heading name is: ^##[^#].*. Finally, the text content "content" under the first-level chapter name and the second-level heading name is extracted. Subsequently, according to the image and video processing, the image text description "intro_for_pics", the video-to-text "video_to_txt", and the video text summary "abstract_for_video" are added to the subsections of the corresponding chapters as part of the text content "content" of the subsection (i.e., the second-level heading name). Finally, each piece of text content "content" uniquely points to a set of (chapter, session).

[0077] Step S214: Data labeling, that is, combining the first-level chapter name and the second-level heading name as the category label of the prediction model, so as to transform the bi-objective prediction model into a single-objective prediction model;

[0078] For example, the text content under the first section of the first chapter is predicted to have a category label of 1, the text content under the second section of the first chapter is predicted to have a category label of 2, and so on, until the content categories under all chapters and sections are labeled.

[0079] In this invention, the prediction model is the Transformer-Encoder model.

[0080] Step S220 (optional): Use a large language model to perform text equalization on the text content of each chapter and section, and generate summary text to obtain integrated text content, which is used to help understand key information.

[0081] Step S220 specifically includes:

[0082] Step S221: Perform text balancing, that is, count the number of words in each section of each chapter (i.e., across all categories), based on 3- The principle is to calculate the average (μ) and standard deviation of the number of words in all sections. If the number of words in a section is less than μ-3 Then the text content under that section will be enhanced.

[0083] Enhance the text content in this section by calling the API of the GLM-4-air large model based on Python code. The API message parameters are set as follows: {"role": "system", "content": "Assuming you are a senior education expert, you are compiling a textbook"}, {"role": "user", "content": "The above text is the knowledge point content of section x of chapter x of this textbook. Due to the large difference in text volume compared to other sections, in order to ensure the effectiveness of subsequent modeling and prediction, please enhance the knowledge point content involved in this section without extending to other irrelevant knowledge points, expand its text, and add your output after the original section text"}.

[0084] Step S222: Perform text enhancement, that is, use Python to call the GLM-4-air API to summarize the text content under each category label to obtain the abstract text.

[0085] The text content is summarized, mainly outlining the knowledge points and methods, to generate an abstract text for each category label's text content. This abstract text enhances the original text.

[0086] Step S223: Text integration, that is, by simply concatenating the abstract text, the abstract text is directly added to the end of the original text content, to obtain the integrated text content complete_file.

[0087] In this context, a separator is added between the original text content (content) and the abstract text (abstract). <sep>Marking. Furthermore, since this invention targets professional knowledge-based textbooks, whose texts are highly descriptive and devoid of emotional tone, the impact of punctuation on semantic understanding can be disregarded. Therefore, a unified strategy is adopted to standardize all punctuation marks in the text—marking punctuation marks (,, ;, !, ?, ... ——) with the same format. <sep>Mark substitution. Save the integrated text as a txt file, denoted as the integrated text content complete_file.

[0088] Step S230: Prepare training and testing sets based on the text content of each slice, and construct a vocabulary list based on high-frequency words in the text content; wherein, the training set includes the segmented text and corresponding category labels; the testing set includes test questions and category labels corresponding to each test question; and perform text encoding on the segmented text and test questions based on the vocabulary list.

[0089] Word segmentation was performed using the jieba library. Initial word frequency filtering was conducted, and tokens with a frequency ≥ 3 underwent a chi-square test to calculate the p-value of the chi-square statistic. The significance level threshold for the p-value was set to 0.05. When p- < 0.05, the token was retained in the vocabulary, ultimately constructing a vocabulary suitable for teaching content.

[0090] Step S231: Segment the text content to obtain segmented text, prepare a training set based on the segmented text, and prepare a test set based on the text content;

[0091] The text content is segmented, specifically by using Python to call the API of the large model GLM-4-air to segment the integrated text content complete_file in each chapter and section.

[0092] The API's message parameters are set as follows: {"role": "system", "content": "Assuming you are a professor of a relevant course at a university."}, {"role": "user", "content": "Now you need to segment the following text. The segmentation method can determine whether to segment by using keywords and concept recognition. If a section is logically divided into different parts (such as explanation, examples, summaries, etc.), it can also be segmented. Ensure that each segment corresponds to the same knowledge point. Please output the original text directly, adding the '&' symbol between the two segments that need to be segmented."}

[0093] The purpose of segmentation is to break down lengthy textbook content (including abstract text) into smaller, more easily processed and understood segments, each corresponding to a single knowledge point. Each segmented part is not necessarily a complete knowledge point; a knowledge point may span multiple segments. Segmentation is solely for making each text segment more structured and easier to process. The '&' symbol is used to mark the segmentation position as a delimiter. Based on Python's `split` function, the segments to the left and right of the '&' symbol are separated, resulting in segmented knowledge text for different sessions within each chapter. The segmented knowledge text for the same session within the same chapter corresponds to the same label. The segmented text is stored as a file named `split_knowledge.csv`. This CSV file has two columns: the first column is the segmented knowledge text, and the second column is the corresponding category label, which was obtained in step S214 above.

[0094] The training set is a CSV file, specifically the split_knowledge.csv file obtained above, which contains the segmented text "knowledge" and its corresponding category label. The text content for each section is compiled based on the actual content of each section in the textbook.

[0095] Using Python to call the GLM-4-air API, 100 relevant academic questions (test_qusetion) are generated for each section, serving as the test set text. The API's message parameters are set as follows: {"role": "system", "content": "Assume you are a student with higher education."}, {"role": "user", "content": "Now you are studying the following text. Please ask questions about different concepts, examples, and logical structures in the text from different perspectives, with different content, and different ways of thinking. The questions must conform to academic standards."}. The test set is also stored as a test_qusetion.csv file, which has two columns: the first column is the test question (test_qusetion), and the second column is the category label corresponding to the test question, indicating which section (i.e., which category) the test question is asking.

[0096] Step S232: Construct a preliminary vocabulary based on the high-frequency words in the text content, and then perform a chi-square test on the high-frequency words to filter out high-frequency words that are independent of the category label, thus obtaining the final vocabulary.

[0097] The process of constructing a preliminary vocabulary based on the high-frequency words in the text content includes: importing the integrated text content complete_file obtained in step S223 using the pandas library, segmenting the integrated text content complete_file using the jieba library, counting the frequency of each word, obtaining the high-frequency words in the text content, and constructing a preliminary vocabulary vocab.

[0098] Among them, except <sep>Apart from the separator, characters that appear 3 or more times (i.e.: Words that are marked as high-frequency words are defined as such. High-frequency words are then added to the vocabulary list vocab.

[0099] A chi-square test is performed on high-frequency words to filter out those independent of the category label, resulting in the final vocabulary, which includes:

[0100] Step S2321: Calculate the expected frequency of high-frequency words under the assumption that high-frequency words and category labels are independent.

[0101] For high-frequency words and category label, High-frequency words The total frequency in the entire corpus refers to high-frequency words. The total number of times it appears in the entire corpus. High-frequency words The total term frequency in the category label refers to high-frequency words. The total number of times a word appears under a specific category label. N is the total word frequency of the entire corpus, which is the sum of the occurrences of every word in the entire corpus. Expected frequency. This assumes that the category label variables are independent, and the high-frequency words are... The expected number of times it will occur.

[0102] Among them, the expected frequency of high-frequency words The calculation method is as follows .

[0103] Step S2322: For each high-frequency word Calculate its chi-square statistic across all category labels.

[0104] The expression for the chi-square statistic is: ,

[0105] in, High-frequency words The observation frequency in the category label is a specific high-frequency word. The actual number of times it appears in the category label. It is the expected frequency of high-frequency words.

[0106] Step S2322: Evaluate each high-frequency word The saliency of the vocabulary is used to optimize the vocabulary.

[0107] Step S2322 specifically includes: calculating the p-value of the chi-square statistic and setting the significance level threshold for the p-value to 0.05. If the p-value of the chi-square statistic for a high-frequency word with any category is less than this threshold, we reject the hypothesis (word independence from category) and consider the word to have a significant association with the specific category label, retaining the word with a significant p-value in the vocabulary vocab. If a word has no significance in all category labels, i.e., the p-value is greater than the threshold, it is considered that the word is unrelated to the category, fails the chi-square test, and is removed from the vocabulary vocab. Finally, a special unknown word tag is added to the vocabulary vocab. <unk>This is used to represent words with low frequency (not stored in the vocabulary) during subsequent encoding. The vocabulary `vocab` is set as a dictionary, where the keys are words and the values ​​are sequences of positive integers starting from 2. <unk>The key value is 0. <sep>The key value is 1. <sep>It is a symbol for punctuation marks and for integrating the original text and the summary text.

[0108] Step S233: Text encoding, that is, according to the vocabulary, the segmented text knowledge of the training set is text encoded to obtain the training tensor train_sequences, and the test set is text encoded to obtain the test tensor test_sequences.

[0109] The segmented text "knowledge" is the text that has already been segmented using GLM-4-air. Based on the segmented text file `split_knowledge.csv` obtained in step S231 and the vocabulary obtained in step S232, an index-based representation method is used to map the vocabulary to an index. The segmented text is then precisely segmented using jieba, and the index is mapped. When traversing each word in the sample, it is checked whether the word is in the vocabulary. If not, it is converted to 0; if it is in the vocabulary, it is represented as the key value of the corresponding word in the vocabulary. Similarly, each test question in the test set `test_qusetion.csv` is encoded based on the vocabulary index to obtain the test sample.

[0110] Since the maximum length of the split_content of all training samples is the maximum sample length max_len, after step S230, step S240 can also be included: padding the split_content of training samples whose sample length is less than the maximum sample length max_len, with a padding value of 0, so that the length of the split_content of all training samples is consistent; similarly, padding the encoding results of test questions whose sample length is less than the maximum length of test samples, with a padding value of 0, so that the length of the encoding of all test text and training text is consistent, maintaining the uniformity and stability of the model.

[0111] Ultimately, each section (i.e., each slice) under each chapter in the course file is encoded to obtain a corresponding training sample `splited_content`. Different training samples from different slices are combined to form the training tensor `train_sequences`. Different encodings of different test samples are combined to form the test tensor `test_sequences`.

[0112] Step S300: Using the text encoding result as input sample, train the encoder architecture of Transformer to obtain a trained neural network model. The neural network model is used to identify the user's input question to obtain the category label of the question-related slice.

[0113] Step S300 specifically includes:

[0114] Step S310 (optional): Load the training and test samples from the text encoding results. In step S200, the training samples (including the text under each chapter and subsection) and the test samples (including each question in the test set) have been text encoded, generating training tensors `train_sequences` and test tensors `test_sequences`. To facilitate batch loading and training, PyTorch's DataLoader is used for data loading and management. First, the training samples are loaded. The encoded training sample sequences are converted into PyTorch Dataset objects and loaded in batches using DataLoader, saved as the training dataset `train_data_loader`. The tensor shape of the loaded training tensors is [batch_size, seq_len], where `batch_size` represents the batch size, which is the number of samples the model processes simultaneously during one training cycle, and `seq_len` represents the sequence length, which is the number of characters in each input sample. During training, `shuffle=True` is set to ensure the random order of training samples within each training cycle. Second, the test samples are loaded. Based on the obtained test set file test_qusetion.csv, the encoded test samples are also converted into PyTorch Dataset objects and loaded in batches using DataLoader, saved as test_data_loader. The tensor shape of the loaded test tensors is [batch_size, seq_len]. Since the test samples do not need to be randomized, the DataLoader configuration for the test set does not use shuffle.

[0115] Step S320: Configure and initialize the neural network model of the encoder architecture of the Transformer model, the neural network model including word embedding layer, position encoding layer, Transformer encoder layer and category output layer.

[0116] It should be noted that the encoder architecture of the Transformer model is existing, and this invention only uses its encoder part, without using the decoder part.

[0117] The process involves several layers: a word embedding layer maps the index of each word in the input sample to a 128-dimensional vector, resulting in the first word embedding tensor. A sine / cosine positional encoding layer, based on the first word embedding tensor and using sine / cosine positional encoding to capture the relationships between words, produces a third result tensor containing positional information. The Transformer encoder layer uses a multi-head attention mechanism to produce the output tensor of the fourth encoder layer. The category output layer converts the output of the Transformer encoder layer into a score for each sample across all categories.

[0118] For the word embedding layer, the input data format is a tensor [batch_size, seq_len], and the output data format is a tensor [batch_size, seq_len, embedding_dim]. For the sine and cosine positional encoding layer, the input data format is a tensor [batch_size, seq_len, embedding_dim], and the output data format is also a tensor [batch_size, seq_len, embedding_dim]. For the Transformer encoding layer, the input data format is the tensor [batch_size, seq_len, embedding_dim] output from the word embedding layer and the positional encoding layer, and the output is a tensor with shape [batch_size, seq_len, hidden_dim] after processing by the self-attention mechanism and the feedforward neural network. For the output layer, the input is the output tensor [batch_size, seq_len, hidden_dim] of the Transformer encoding layer, and the output is the probability distribution of each class, with shape [batch_size, seq_len, hidden_dim]. [num_classes] represents the score or probability distribution of each sample for its corresponding class.

[0119] Here, `batch_size` represents the batch size, which is the number of samples the model processes simultaneously during one training cycle; `seq_len` and `sequence_length` represent the sequence length, which are the number of characters in each input sample; and `embedding_dim` represents the embedding dimension, which is the dimension of the vector representing each input character. During forward propagation, the input is transposed to ensure it conforms to the requirements of `nn.TransformerEncoder` in PyTorch. `nn.TransformerEncoder` is the encoder part of the Transformer model, responsible for processing the input sequence and generating context-sensitive representations.

[0120] The word embedding layer maps the index of each word in the input sample to a 128-dimensional vector. The input to the word embedding layer is the loaded training tensor `train_dataloader` and the loaded test tensor `test_dataloader`, where each input sample is a tensor of size [batch_size, seq_len]. In the word embedding layer, each word (i.e., the word index) is mapped to its corresponding high-dimensional vector representation by looking up the word embedding matrix. Each word is mapped to an embedding vector with dimension `embedding_dim=128`. After word embedding, the output tensor has the shape [batch_size, seq_len, embedding_dim], meaning each word corresponds to a 128-dimensional vector representation. This results in two embedding tensors: `train_embedding_sequences` (denoted as the first word embedding tensor) containing all training samples and `test_embedding_sequences` (containing all test samples).

[0121] The sine and cosine positional encoding layer captures the relationships between words based on sine and cosine positional encoding. Define a PositionalEncoding class to add positional information to the input, where the input is a tensor output by the word embedding layer, in the format [batch_size, seq_len, embedding_dim].

[0122] First, a zero-filled tensor `pe0` of size (max_len, d_model) is created to initialize the data structure. Memory is pre-allocated by creating a basic tensor to store the positional encoding information for each location later. The zero-filled tensor `pe0` ensures that a zero-filled tensor `pe0` with a matching structure and size exists before the actual positional encoding is computed, allowing direct operations such as padding the computed positional encoding values. Here, `max_len` is the maximum length of the input sequence, and its size is equal to the maximum sample length `max_len`; `d_model` is the dimension of the word embedding, i.e., `embedding_dim`, which is set to 128 here, consistent with the `embedding_dim` in the tensor output by the word embedding layer.

[0123] Next, a first position tensor, `position`, is generated. It is a series of values ​​from 0 to `max_len-1`, representing the index of each position. The format of the first position tensor `position` is `(max_len,)`.

[0124] The `unsqueeze(1)` function changes the dimension of the first position tensor `position` from `(max_len,)` to `(max_len, 1)`, inserting a new dimension into the first dimension of `position` to adapt it for subsequent broadcast operations. This ensures that the dimension of the first position tensor `position` matches the dimension `d_model` of the word embedding after expansion, allowing for correct addition operations with positional encodings.

[0125] The first position tensor is expanded to a larger dimension based on the broadcast operation to match the dimension d_model of the word embedding, resulting in the position-encoded tensor pe stored at the all-zero tensor pe0.

[0126] The position encoding tensor *pe* consists of multiple position encoding vectors, each with dimension *d_model*, where *d_model* = 128. For each position in each dimension, the position encoding is calculated using two formulas. For positions in even-numbered dimensions, the position encoding is based on the formula... For odd-dimensional locations, the location encoding is based on a formula. The calculation is performed where pos is the position index, representing the position of the word in the sequence, which is an integer sequence from 0 to max_len-1; i is the dimension index, representing a certain dimension in the embedding vector, and the value of i is an integer between 0 and 63.

[0127] Finally, the unsqueeze(0) function is used to insert a new dimension at position 0 of the position encoding tensor pe, and the transpose(0, 1) function is used to swap the 0th and 1st dimensions of the position encoding tensor pe, thus transforming the dimensions of the position encoding tensor pe to (1, max_len, d_model), thereby enabling the position encoding tensor pe to obtain the second position tensor through dimension transformation.

[0128] The second position tensor is added to the first word embedding tensor corresponding to the training sample to obtain a third result tensor containing position information, in the format [batch_size, seq_len, embedding_dim], which can help the model understand semantic information.

[0129] The Transformer encoder layer uses a multi-head attention mechanism to obtain the output tensor of the fourth encoder layer.

[0130] The Transformer encoder layer is created using nn.TransformerEncoderLayer, and it employs a multi-head attention mechanism and a feedforward neural network to extract contextual information from the input sequence. The multi-head attention mechanism enables the model to process multiple information points in the sequence in parallel, thereby capturing semantics and dependencies more comprehensively and enhancing the model's information integration capabilities.

[0131] The Transformer encoder layer consists of a self-attention mechanism and a feedforward network, and typically also includes layer normalization and residual connections.

[0132] The maximum length of each input sample group in the Transformer encoder layer is `max_len`. The generated attention mask `attention_mask` has a shape of `[batch_size, seq_len]`, which controls the attention the Transformer-Encoder model gives to certain positions when calculating self-attention, thus masking padding in the input data. In this embodiment, the specific configuration includes: setting `hidden dim=128`, `num head=4`, `dim feedforward=512`, and `dropout=0.1`, with two stacked encoder layers. The hidden layer dimension (`hidden_dim`) represents the feature dimension of each head output and the feedforward network in the multi-head attention mechanism and feedforward network, which is set to 128 in this invention. The number of attention heads (`num_head`) is the number of independent attention mechanisms operating simultaneously in the multi-head attention mechanism, which is set to 4. The feedforward network dimension (`dim_feedforward`) is the dimension of the hidden layer in the feedforward neural network, which is set to 512. The dropout rate (dropout) refers to randomly discarding a certain proportion of data units during training to prevent overfitting; its value is set to 0.1. The Transformer encoder layer uses ReLU as the activation function. Furthermore, the Transformer encoder layer enhances the model's processing power by stacking two encoder layers (num_layers=2). The overall encoding process is completed by nn.TransformerEncoder, resulting in a fourth encoder layer output tensor with the shape [batch_size, seq_len, hidden_dim]. This output contains a complex representation of each position after multiple Transformer encodings, effectively capturing various dependencies within the sequence.

[0133] The category output layer transforms the output of the Transformer encoder layer into scores for each sample per class. Using an indexed approach (src=src[:,-1,:]), the output of the last time step of each sample sequence from the fourth encoder layer's output tensor is used as input to the linear layer, aiming to capture and utilize the cumulative information of the entire sequence. The linear layer `nn.Linear(hidden_dim,num_class)` performs a linear transformation, mapping each feature vector extracted from the encoder from the `hidden_dim` dimension to a new dimensional space (`num_class` dimension), where each dimension represents a score for a class, set according to the predefined number of class labels for chapters and subsections. The output of this layer is the fifth category output layer's output tensor `logits`, in the format `[batch_size, num_class]`. Each row of the fifth category output layer's output tensor `logits` represents the score for a sample per class.

[0134] The output layer shows the probability distribution of each class for each sample. The output of the class output layer is the output tensor of the fifth class output layer, where each row represents the score of a sample for each class. These scores are then converted into predicted class probabilities using the softmax function, forming a tensor of shape [batch_size, num_class], which outputs the probability distribution of each class for each input sample.

[0135] The loss function and optimizer settings for the neural network model. This invention uses the cross-entropy loss function nn.CELLoss() as the training loss function, as shown in the formula. This is used to calculate the gap between the model's predictions and the actual target. Here, N is the total number of samples, and M is the total number of classes. It is an indicator variable; it is 1 if sample i belongs to category j, and 0 otherwise. This represents the probability that the i-th sample is predicted to be of class j. The optimizer chosen is Adam (optim.Adam()), with an initial learning rate of 0.001. The Adam optimizer adaptively adjusts the learning rate based on gradient descent, which accelerates convergence and improves model stability. The Adam algorithm flow is as follows:

[0136] Obtain the gradient values ​​for the next round:

[0137] Update the first moment vector:

[0138] Update the second moment vector:

[0139] Calculate the first moment vector for deviation correction:

[0140] Calculate the second moment vector for deviation correction:

[0141] Update parameters: ,until convergence.

[0142] in: The learning rate is initially set to 0.001.

[0143] Step S330: Train the neural network model using the training tensor from the text encoding result. The cross-entropy loss function CELLoss is used as the loss function and the Adam optimizer is used for training. A confidence threshold is set for each category. The neural network model outputs the prediction result only when the predicted probability of the category exceeds this confidence threshold.

[0144] The training process of the neural network model is carried out within multiple training epochs. Within each epoch, the training dataset `train_data_loader` obtained in step S310 based on the training tensor is traversed. This `train_data_loader` divides the training tensor `train_sequences` from the text encoding results into multiple mini-batches, allowing the neural network model to process only a small portion of the data in each training step, rather than loading the entire training dataset, effectively managing memory usage and improving computational efficiency. Each batch of data is then input into the neural network model for forward propagation. The original output of the neural network model is the fifth-class output layer output tensor `logits`, corresponding to the original predicted score for each class.

[0145] When evaluating model loss using the loss function, the output tensor logits of the fifth-class output layer is used with the softamax function to obtain the model's predicted log probabilities log_probs. These log probabilities are the log probabilities predicted by the neural network model for each class based on the input text, not the true log probabilities. When processing batch data containing padding, a mask is used to identify valid data points, ensuring that only non-padding, valid logits are converted into log probabilities log_probs. log_probs[mask] only calculates and stores the log probabilities of the corresponding valid data points. Based on the file split_knowledge.csv stored above, the class label of the training samples is obtained, which is the model's target label targets. Similarly, the targets[mask] of valid data points is obtained through mask selection. Then, PyTorch's nn.CrossEntropyLoss (i.e., the cross-entropy loss function CELLoss) is used to calculate the cross-entropy loss, calculating the loss value between the model's predicted log probabilities log_probs[mask] and the target labels targets[mask]. The mask here ensures that only the valid input portion (excluding the padded portion) is calculated for loss. `optimizer.zero_grad()` is called to clear the previously calculated gradient, and backpropagation is performed using `loss.backward()` to calculate the gradient. Finally, `optimizer.step()` is called to update the model parameters. The loss for each batch is accumulated and recorded in the total loss `total_loss`. At the end of each epoch, the total loss for the current training epoch is output to monitor the trend of loss changes during training. After training, the model will apply a confidence threshold of 0.5 to the probability of each predicted class in practical applications. In this embodiment, the confidence threshold is 0.5.

[0146] The predicted category is the category label of the training samples stored in the split_knowledge.csv file. The neural network model only outputs relevant prediction results when the predicted probability exceeds this confidence threshold, ensuring high reliability of the output. If no category with a confidence threshold exists, it is judged as an "irrelevant question," and no relevant detection or answer support is provided to the user.

[0147] Step S340: Test the neural network model using the test samples from the text encoding results, and evaluate the performance of the trained neural network model based on the test results.

[0148] The test dataset `test_data_loader`, obtained from the test tensor `test_sequences` in step S310, is traversed. During testing, `test_data_loader` loads samples sequentially according to the order in the dataset, maintaining the continuity of the data but loading it in batches. During testing, the neural network model uses `torch.no_grad()` to disable gradient calculation, preventing backpropagation and improving computational efficiency. Forward propagation is performed on the input data to generate the model output. The predicted category for each sample is obtained, i.e., the predicted label for the category involved in the test text question, and compared with the target label of the test set.

[0149] In one exemplary embodiment, the present invention employs the following computer hardware and software configuration: a 13th generation Intel(R) Core(TM) i9-13900HX central processing unit (CPU), equipped with two graphics processing units (GPUs), including Intel(R) UHD Graphics and an NVIDIA GeForce RTX 4050 Laptop GPU. The system uses CUDA compiler version 12.4, specifically V12.4.131. The system runs on a Windows 11 operating system, uses Python 3.8.19 as the programming language, employs the PyTorch 1.9.1 deep learning framework, and supports the GPU acceleration capabilities of CUDA 11.6 for building and training neural network models.

[0150] The performance of the trained neural network model is evaluated based on the test results. Specifically, the neural network model traverses all samples in the training dataset train_data_loader in each training epoch. Subsequently, the test set is not used as training samples, but only used to evaluate the training function value of the trained neural network model.

[0151] The test results showed that during the 11th to 13th epochs, the training loss of the neural network model tended to stabilize with the increase of epochs and no longer gradually decreased, triggering the early stopping mechanism. This indicates that the neural network model has converged.

[0152] Evaluating the performance of the trained neural network model based on test results includes: obtaining the model's predicted class based on `output.argmax(dim=1)`, which determines the predicted class label by finding the class index of the maximum value in each sample (i.e., each row of the output tensor). `dim=1` indicates that the function searches for the maximum value along the class dimension (i.e., the class score in each row). The predicted class label is compared with the actual target class label, `targets label`, and the number of correct predictions is calculated and accumulated. Finally, the overall accuracy (acc) of the neural network model on the test set is calculated and output.

[0153] The examples demonstrate that analyzing the scope of a user's question using text preprocessing and a Transformer-Encoder model is more efficient than directly relying on the computational power of a large language model to search various knowledge bases, and improves accuracy by approximately 15%. This strategy effectively locates the scope of the user's question and outputs a solution.

[0154] The performance of the trained neural network model was evaluated based on the test results. This included generating 50 sets of irrelevant questions for the financial and technology sectors using Python's GLM-4-air API, storing relevant questions in a file named `not_related_questions.csv`, and then encoding and importing these questions as the test dataset. The results showed that the neural network model predicted 48 sets of questions as irrelevant, providing no output to the user. This indicates that the model's ability to capture question relevance exceeds 90%.

[0155] Step S400: Establish a user profile generation function. The user profile generation function is set to extract user preference information based on the user's input questions and historical dialogues, and generate a user profile based on the user preference information. The user profile includes slices that the user is interested in.

[0156] Specifically, step S400 includes:

[0157] S410: Establish a user preference information extraction module, which is configured to analyze user preference information (such as interests, needs and preferences) based on the user's input questions and historical dialogues.

[0158] The user preference information extraction module is configured to perform the following steps:

[0159] Step S411: Determine if there is a history of dialogue. If yes, proceed to step S412. Otherwise, obtain the category label of the question-related slice based only on the user's input question.

[0160] In this embodiment, the `get_conversation_history(conversation_id)` function is used to retrieve the current user's question history in the conversation, with a maximum limit of 5 question records. If there are more than 5 question records, it is determined that there is a historical conversation, and step S12 is executed to generate personalized recommendations for the user based on the content of the historical conversation.

[0161] Step S412: Extract user preference information from the obtained user's historical dialogues based on the large language model GLM-4-air.

[0162] The input parameters for calling the GLM-4-air API are set as follows: {"role": "system", "content": "As a course consultant, your goal is to identify users' needs for interpreting text materials or querying resources. Please analyze users' historical conversations to extract user preference information. Specifically, you need to determine whether users prefer course text materials or other resources (such as images, videos, etc.). At the same time, assess whether the user's current question is closely related to the course content. Please output your analysis results in the following format: 'User Preferences: Textbook Materials / Image Materials / Video Materials; Question Relevance: Relevant / Irrelevant'."} If the current user question is irrelevant to the course, it will not be answered or recommended.

[0163] Therefore, the extracted user preference information includes whether the user prefers text, images, or videos, and whether the current question is relevant to the course content. Consequently, the following recommendations will incorporate user preference information to output relevant content (text, images, or videos).

[0164] Step S420: Establish a user profile generation function. The user profile generation function is set to collect historical conversations and call the LLMChain chain to generate user profiles. The user profiles include slices that the user is interested in.

[0165] The user profile user_profile_ans is generated by the user profile generation function generate_user_profile_and_extract_info. The user profile user_profile_ans only includes [Chapter] and [Session]. The former is used to identify the course chapters that the user is interested in, and the latter is used to identify the subsections or units within the course that the user is interested in.

[0166] The user profile generation function is an asynchronous method designed to generate user profiles based on a user's chat history.

[0167] The input parameters for the user profile generation function include the user's chat history message list (chat_messages), the prompt word used to generate the user profile (user_profile_prompt), and the GLM-4-air instance (model). The GLM-4-air instance (model) is the model loaded by calling the GLM-4-air API. When instantiating the model, a temperature setting should also be set to adjust the randomness of the generated content.

[0168] The user profile generation function is obtained through the following steps: creating a chat prompt template for generating the user profile, and defining an LLMChain chain that receives user input questions, historical conversations, and the created chat prompt template, and sends them to the recommendation model. Specifically, the user profile generation function `generate_user_profile_and_extract_info` is obtained through the following steps:

[0169] Step S421: Select a recommendation model, which is used to generate recommended content that meets user needs.

[0170] In this embodiment, the recommendation model uses the GLM-4-air large language model. The default sampling temperature for the generated content is 0.7, and there is no limit to the maximum number of tokens in the generated content. Based on the GLM-4-air large language model and the temperature setting, the system calls the large language model to generate recommended content that meets the user's needs. By adjusting the temperature parameter, the system can control the randomness of the generated content to adapt to different user scenarios.

[0171] Step S422: Construct a chat prompt template, the chat prompt template including the prompt word user_profile_prompt for generating user profiles.

[0172] Step S422 specifically includes: First, a chat prompt template is created using the function `ChatPromptTemplate.from_messages()`. This chat prompt template contains a user's dialogue input, namely the prompt word `user_profile_prompt` used to generate the user profile. This is the core prompt information for the model to generate the user profile. By constructing appropriate prompt words, the chat prompt template enables the user profile generation function to understand that its task is to generate a user profile based on the chat history and extract key information.

[0173] The chat prompt template includes the prompt term `user_profile_prompt` for generating user profiles. When organizing the `user_profile_prompt`, the user's chat history list `history(List)` needs to be input into the GLM-4-air large language model first. Then, the `user_profile_prompt` is written after the message list to ensure the model processes the historical message list. The specific content of the prompt term `user_profile_prompt` is: Please analyze the current user's needs based on this historical dialogue record `history[List]` and generate a user profile. The user profile should be formatted as follows: `[Chapter]`: Extract the most matching chapter name from the course chapters. `[Session]`: Extract the relevant section name from the course knowledge base, ensuring that chapter names and section names are processed separately. This represents different levels of user interests and needs, enabling more detailed and specific personalized services to be provided to the user.

[0174] Step S423: Define the LLMChain chain, which is set to receive user input queries, historical dialogues, and created chat prompt templates, and pass the chat prompt templates as input prompt words to the recommendation model, so that the recommendation model can generate corresponding user profiles based on them.

[0175] Define the LLMChain class, which encapsulates all the necessary logic for interacting with the recommendation model, specifying that the recommendation model to be called is the GLM-4-air large language model. Instantiate the LLMChain class, which is a chained implementation that uses the large language model to handle specific tasks.

[0176] The LLMChain class receives a created chat prompt template, which is passed as an input prompt word to the GLM-4-air model, which then generates the corresponding output based on this prompt.

[0177] The LLMChain class also executes the main functionality of the chain object through the invoke method of the user profile chain function user_profile_chain, allowing the chain to perform configured tasks. Specifically, the user profile chain function user_profile_chain receives input data from the function caller, including the user's input question query and chat history message list, analyzes and processes the question using the GLM-4-air model, and generates a user profile user_profile_ans. The user profile user_profile_ans includes [Chapter] and [Session], representing the chapters and sections (slices) of interest to the user. To ensure that the generated [Chapter] and [Session] prompts are within the scope of the course titles, this invention predefines the correspondence between course chapters and sections in the prompts that call the GLM-4 main model: informing the main model of this correspondence, such as [1-1, 1-2, 2-1, 2-2], requesting the main model to ensure that the output chapters and sections are within this range.

[0178] Step S500: Integrate the knowledge base, neural network model, and user profile generation function into a multimodal teaching content recommendation system. This system generates question-related slices and user-interested slices based on the user's input questions and historical dialogues, and retrieves and outputs corresponding course resources from the knowledge base. Specifically, the multimodal teaching content recommendation system uses a trained neural network model to identify question-related slices based on the user's input questions and uses a user profile generation function to generate user-interested slices based on the user's input questions and historical dialogues.

[0179] Therefore, one function of the multimodal teaching content recommendation system is to output basic retrieval answers based on the user's input question and the knowledge base. Specifically, the multimodal teaching content recommendation system uses a trained neural network model to identify question-related slices (i.e., relevant chapters and sections) based on the user's input question. Based on these question-related slices, it retrieves and outputs the corresponding course materials, including text, images, and video materials, from the knowledge base. During the preprocessing of textbook text, images, and course videos, the text, images, and course videos are organized into knowledge bases according to each section of each chapter: base_for_textbook for course text, base_for_pics for images, and base_for_video for videos. The Transformer-Encoder model successfully locates the relevant question slices (i.e., chapters and sections). Therefore, it can directly retrieve the corresponding text, image, and video information from the knowledge base based on the slices using GLM-4-air, that is, directly obtain the text, image, and video information of the entire section. The prompts for calling the GLM-4-air large model based on Python include: obtaining category labels based on the user's input question and the neural network model, and requiring the generation of text information, image links, and video links corresponding to the category labels.

[0180] Another function of the multimodal teaching content recommendation system is personalized recommendation. After extracting the user profile through the user profile generation function, the system uses the personalized recommendation module to call the knowledge base search service MilvusKBService to retrieve the most relevant documents to the user profile. Based on user preference information (preferring text, images, or videos) and the user profile, it performs a document similarity search from the course text knowledge base (base_for_textbook), the image knowledge base (base_for_pics), and the video knowledge base (base_for_video), returning the top three most relevant files from the knowledge base. This not only supplements the relevant knowledge points retrieved for users through the Transformer-Encoder architecture but also customizes relevant knowledge recommendations for users based on their user profiles.

[0181] Therefore, step S500 specifically includes: defining an asynchronous function `recommend_base_chat()`, and integrating the knowledge base, neural network model, and user profile generation function using the asynchronous function `recommend_base_chat()`. This allows the multimodal teaching content recommendation system to use the trained neural network model to identify question-related slices and slices of interest to the user based on the user's input question and historical dialogue, and to retrieve and output corresponding course resources from the knowledge base. This design pattern enables the function to effectively process input while generating high-quality answers and recommendations using the GLM-4-air large language model without blocking the main execution thread, thereby improving user experience and system performance. In this invention, the asynchronous function calls the GLM-4-air API to generate answers and recommendations for the user based on user preference information (preferring text, images, or videos), combined with the dialogue context and data from the knowledge base.

[0182] Table 1: Forms, processing procedures, and input methods of multiple parameters for asynchronous functions

[0183]

[0184] The asynchronous function `recommend_base_chat()` is a multi-parameter API interface that accepts multiple input parameters, including the user-inputted question (`query`), user ID (`user_id`), conversation ID (`conversation_id`), conversation name (`conversation_name`), knowledge base name (`knowledge_base_name`), and user's conversation history. The format, processing, and input methods of the multiple parameters of the asynchronous function `recommend_base_chat()` are shown in Table 1 above.

[0185] For the large language model GLM-4-air, its output is the generated text. This text combines the category labels of question-related segments obtained from the user's input question recognition process based on the Transformer-Encoder model mentioned above, retrieves course text, images, and video information from the knowledge base, and integrates relevant course questions from the user's inquiry to provide an easily understandable and readable output. Setting the parameter `stream=bool` in the asynchronous function `recommend_base_chat()` triggers an asynchronous data processing flow, using Python's asynchronous iterator `async for` to handle streaming data. This setting aims to achieve streaming output; the system gradually returns the generated content segments to the user, rather than waiting for all content to be generated at once.

[0186] On the other hand, the present invention provides a multimodal teaching content recommendation system, which is established using the method for establishing a multimodal teaching content recommendation system described above.

[0187] like Figure 1 As shown, on the other hand, the present invention provides a method for using a multimodal teaching content recommendation system, which includes:

[0188] Step A1: Provide a multimodal teaching content recommendation system;

[0189] Step A2: Use a trained neural network model to identify question-related segments based on the user's input question, and retrieve and output the corresponding course resources (including course text, images, and video materials) from the knowledge base.

[0190] Step A3: Implement a content-based recommendation algorithm using a user profile generation function. Based on the user's input question and historical dialogue, obtain the slices that the user is interested in, and retrieve and output the recommended links of course resources for the corresponding slices from the knowledge base.

[0191] The above description is merely a preferred embodiment of the present invention and is not intended to limit the scope of the invention. Various variations can be made to the above embodiments of the present invention. That is, all simple and equivalent changes and modifications made based on the claims and description of this invention fall within the protection scope of the claims of this patent. All aspects not described in detail in this invention are conventional technical content.< / sep> < / sep> < / unk> < / unk> < / sep> < / sep> < / sep>

Claims

1. A method for establishing a multimodal teaching content recommendation system, characterized in that, include: Step S100: Prepare course resources; Step S200: Construct a knowledge base with multiple slices based on the course resources and generate category labels for identifying the slices. The knowledge base is used to output the course resources corresponding to the slices to the user based on the category labels. At the same time, prepare training sets and test sets according to each slice, construct a vocabulary list, and perform text encoding on the training sets and test sets. Step S300: Using the text encoding result as input sample, train the encoder architecture of Transformer to obtain a trained neural network model. The neural network model is used to identify the user's input question to obtain the category label of the question-related slice. Step S400: Establish a user profile generation function. The user profile generation function is set to generate user profiles based on input questions and historical dialogues. The user profiles include slices that the user is interested in. Step S500: Integrate the knowledge base, neural network model, and user profile generation function into a multimodal teaching content recommendation system, so that the multimodal teaching content recommendation system can obtain question-related slices and user-interested slices based on the input question and historical dialogue, and retrieve and output corresponding course resources from the knowledge base; Step S200 further includes: Step S210: Slice and organize the textual materials of the teaching materials according to chapters and sections to form a course text knowledge base that is easy to manage and retrieve; then establish an image knowledge base and a video knowledge base, wherein the relational data in the image knowledge base and video knowledge base includes extracted image text descriptions and video text summaries, as well as image addresses and video addresses; then, combine the chapters and sections as category labels for the prediction model; Step S220: Use a large language model to perform text equalization on the text content of each chapter and section, and generate summary text to obtain integrated text content, which is used to help understand key information; Step S230: Prepare a training set and a test set based on the text content of each slice, and construct a vocabulary list based on the high-frequency words in the text content; wherein, the training set includes the segmented text and the corresponding category labels; the test set includes test questions and the category labels corresponding to each test question; and perform text encoding on the segmented text and test questions according to the vocabulary list; Step S240: Pad the training and test samples obtained after text encoding to ensure that the lengths are consistent; Step S220 specifically includes: Step S221: Count the number of words in each section of each chapter, based on 3- The principle is to calculate the average (μ) and standard deviation of the number of words in all sections. If the number of words in a section is less than μ-3 Then the text content under that section will be enhanced; Enhance the text content in this section, specifically by calling the API of the GLM-4-air large model based on Python code and setting the API's message parameters; Step S222: Using Python to call the GLM-4-air API, summarize the text content under each category label to obtain the abstract text. Step S223: By simply concatenating the abstract text, add it directly to the end of the original text content, to obtain the integrated text content complete_file; Step S230 includes: Step S231: Segment the text content to obtain segmented text, prepare a training set based on the segmented text, and prepare a test set based on the text content; the purpose of segmentation is to break down the text content into smaller segments that are easier to process and understand, and each segment corresponds to only one knowledge point; Step S232: Construct a preliminary vocabulary based on the high-frequency words in the text content, and then perform a chi-square test on the high-frequency words to filter out high-frequency words that are independent of the category label, thus obtaining the final vocabulary. The process of constructing a vocabulary list based on the high-frequency words in the text content includes: importing the integrated text content complete_file obtained in step S223 using the pandas library, segmenting the integrated text content complete_file using the jieba library, counting the frequency of each word, obtaining the high-frequency words in the text content, and constructing a vocabulary list vocab. A chi-square test is performed on high-frequency words to filter out those independent of the category label, resulting in the final vocabulary, which includes: Step S2321: Calculate the expected frequency of high-frequency words under the assumption that high-frequency words and category labels are independent; Step S2322: For each high-frequency word Calculate its chi-square statistic across all category labels; Step S2322: Evaluate each high-frequency word The significance of the p-value is used to optimize the vocabulary; step S2322 specifically includes: calculating the p-value of the chi-square statistic, setting the threshold of the significance level of the p-value to 0.05; as long as the p-value of the chi-square statistic of the high-frequency word with any category is less than the threshold, the hypothesis that the word is independent of the category is rejected and it is considered that the word has a significant association with the specific category label, and the words with significant p-values ​​are retained in the vocabulary vocab. Step S233: Based on the vocabulary, the segmented text knowledge of the training set is text-encoded to obtain the training tensor train_sequences, and the test set is text-encoded to obtain the test tensor test_sequences. The segmented text "knowledge" is the text that has already been segmented by GLM-4-air. Based on the segmented text file split_knowledge.csv obtained in step S231 and the vocabulary obtained in step S232, an index-based representation method is used to map the vocabulary to an index. The segmented text is then precisely segmented using jieba, and then the index is mapped. When traversing each word in the sample, it is checked whether the word is in the vocabulary. If it is not in the vocabulary, it is converted to 0; if it is in the vocabulary, it is represented as the key value of the corresponding word in the vocabulary. Similarly, each test question in the test set file test_qusetion.csv is encoded based on the vocabulary index to obtain the test sample.

2. The method for establishing a multimodal teaching content recommendation system according to claim 1, characterized in that, Step S100 includes: Provide course materials and video recordings in PDF or Markdown format as course resources; Based on gpt-4-vision-preview and the Tongyi Listening and Comprehension Model, images in course materials and course explanation videos are used to generate image text descriptions and video text summaries.

3. The method for establishing a multimodal teaching content recommendation system according to claim 1, characterized in that, Step S300 specifically includes: Step S320: Configure and initialize the neural network model of the Transformer encoder architecture. The neural network model includes a word embedding layer, a positional encoding layer, a Transformer encoder layer, and an output layer. The word embedding layer maps the index of each word in the input sample to a 128-dimensional vector to obtain the first word embedding tensor. The sine and cosine positional encoding layer captures the relationship between words based on the first word embedding tensor and sine and cosine positional encoding to obtain a third result tensor containing positional information. The Transformer encoder layer uses a multi-head attention mechanism to obtain the output tensor of the fourth encoder layer. The category output layer converts the output of the Transformer encoder layer into a score for each category of the sample. The third result tensor containing location information is obtained by adding the second position tensor to the first word embedding tensor corresponding to the training sample; for the position encoding tensor of the second position tensor before dimensional transformation, the calculation formula for the position encoding of each position in dimension i is as follows: If i is even , If i is odd , Where pos is the position index, i is the dimension index, and d_model is the dimension of the word embedding; Step S330: Train the neural network model using the training tensor from the text encoding result. The cross-entropy loss function CELLoss is used as the loss function and the Adam optimizer is used for training. A confidence threshold is set for each category. The neural network model outputs the prediction result only when the predicted probability of the category exceeds this confidence threshold.

4. The method for establishing a multimodal teaching content recommendation system according to claim 3, characterized in that, Before step S320, step S310 is also included: loading training samples and test samples from the text encoding result to obtain training dataset and test dataset, and loading to ensure the random order of training samples in each training cycle; and / or Following step S330, step S340 is further included: testing the neural network model using test samples from the text encoding results, and evaluating the performance of the trained neural network model based on the test results.

5. The method for establishing a multimodal teaching content recommendation system according to claim 1, characterized in that, Step S400 includes: Step S410: Establish a user preference information extraction module. The user preference information extraction module is set to analyze and obtain user preference information based on the user's input questions and historical dialogues. The user preference information includes preference for text, images or videos, and whether the current question is relevant to the course content. Step S420: Establish a user profile generation function, which is configured to collect historical conversations and call the LLMChain chain to generate user profiles, the user profiles including slices of interest to the user.

6. The method for establishing a multimodal teaching content recommendation system according to claim 5, characterized in that, The function for generating user profiles is obtained in the following way: Step S421: Select the recommendation model; Step S422: Construct a chat prompt template, the chat prompt template including prompt words for generating user profiles; Step S423: Define an LLMChain chain, which is set to receive user input questions, historical dialogues, and created chat prompt templates, and pass the chat prompt templates as input prompt words to the recommendation model, so that the recommendation model can generate corresponding user profiles based on them. The user profiles include chapters and sections that the user is interested in.

7. A multimodal teaching content recommendation system, characterized in that, It is established using the method for establishing a multimodal teaching content recommendation system according to any one of claims 1-6.

8. A method for using a multimodal teaching content recommendation system, characterized in that, include: Step A1: Establish a multimodal teaching content recommendation system using the method for establishing a multimodal teaching content recommendation system according to any one of claims 1-6, which includes a trained neural network model and a user profile generation function; Step A2: Use a trained neural network model to identify question-related segments based on the user's input question, and retrieve and output the corresponding course resources from the knowledge base; Step A3: Implement a content-based recommendation algorithm using a user profile generation function. Based on the user's input question and historical dialogue, obtain the slices that the user is interested in, and retrieve and output the recommended links of course resources for the corresponding slices from the knowledge base.