A question explanation interaction method and system based on multi-model fusion

By employing a multi-model fusion approach to interactive question explanation, utilizing ASR, VLM, and TTS models, we have achieved accurate recognition of student speech and question images, generating personalized and dynamic explanations. This addresses the lack of personalization and interactivity in existing systems, thereby improving the tutoring efficiency and learning experience of smart education.

CN122308598APending Publication Date: 2026-06-30NINGBO SHENQI INTELLIGENT TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NINGBO SHENQI INTELLIGENT TECHNOLOGY CO LTD
Filing Date
2026-02-09
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing smart education systems struggle to provide personalized, real-time, and interactive explanations, especially when dealing with complex questions. They lack effective visual element recognition and dynamic adjustment capabilities, resulting in a lack of targeted explanations.

Method used

Employing a multi-model fusion approach, this method utilizes Automatic Speech Recognition (ASR), Visual Language Modeling (VLM), and Text-to-Speech (TTS) models to accurately recognize student voice questions and question images, generating personalized and dynamic explanation strategies and supporting multi-round interactive optimization.

Benefits of technology

It enables convenient voice interaction between students and AI, accurately understands question information, generates personalized and dynamic explanation content, improves homework tutoring efficiency and learning experience, and achieves intelligent adaptive tutoring.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122308598A_ABST
    Figure CN122308598A_ABST
Patent Text Reader

Abstract

This invention relates to a problem-solving interaction method and system based on multi-model fusion. The method includes: receiving a voice signal containing a question input through a terminal device; preprocessing the voice signal to convert it into text data and extracting question keywords related to the question from the text data; preprocessing the problem image and recognizing the processed problem image to generate structured problem data containing problem element information; fusing the text data, question keywords, and structured problem data as interactive data to obtain explanation requirements; generating personalized explanation strategies based on the explanation requirements; and generating explanation text based on the explanation strategies. This method achieves convenient voice interaction and comprehensive problem information recognition. Students can directly ask questions via voice, and the system can accurately understand all information in the problem, including text, graphics, formulas, etc., and generate personalized and dynamic explanation strategies based on this.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of smart education technology, and in particular to a question explanation and interactive method and system based on multi-model fusion. Background Technology

[0002] In the current field of smart education, homework tutoring is an important need in the student learning process. Traditional methods of explaining problems mainly rely on teacher guidance, textual explanations of teaching materials, or pre-set video tutorials, which have many limitations: First, teacher guidance is limited by time, space, and the number of teachers, making it difficult to provide personalized and timely tutoring for each student; second, textual explanations are relatively abstract, and the visualization of complex problems (such as geometric proofs and physics experiments) is insufficient, making it difficult for students to understand; third, pre-set video tutorials have fixed content, making it impossible to flexibly adjust the explanation approach and focus according to students' questions, and they lack interactivity, preventing students from actively raising targeted questions.

[0003] With the development of artificial intelligence technology, various AI tutoring products have emerged, but existing products still have obvious shortcomings: many products still rely primarily on text input for interaction, requiring students to manually type in questions or problem content, which is inefficient, especially for younger students or questions involving complex symbols and graphics, where the input threshold is high and does not meet students' demand for immediate answers; some products that support voice or image input often have limited recognition capabilities to text content, lacking effective structured recognition and understanding of visual elements such as graphics, charts, mathematical formulas, and chemical structural formulas that carry key information in the questions, making it difficult to fully understand the questions, and the generated explanations are prone to bias and lack specificity; most products rely on preset answer databases or static knowledge graphs for their explanations, generating monotonous analyses, unable to dynamically adjust the depth, pace, order, and expression of the explanations based on the specific focus of the student's questions, historical interaction records, and real-time feedback, making it difficult to achieve true personalized teaching.

[0004] Therefore, how to build a system that can naturally understand students' questions, fully comprehend the content of the questions, and provide personalized and interactive explanations of the questions has become a key technical issue in the field of smart education. Summary of the Invention

[0005] The purpose of this invention is to overcome the shortcomings of the prior art and provide a multi-model fusion-based interactive method and system for question explanation that enables voice interaction between students and AI, accurately and comprehensively identifies questions, and dynamically generates personalized and targeted explanation content.

[0006] Firstly, the technical solution adopted by this invention is a question explanation interaction method based on multi-model fusion, which includes the following steps: S1. Receive a voice signal containing a question or problem input through a terminal device; S2. The speech signal is preprocessed, and the processed speech signal is converted into text data using an automatic speech recognition (ASR) model. Question keywords related to the question are extracted from the text data. S3. Obtain the question image through the terminal device, preprocess the question image, use the Visual Language Model (VLM) to recognize the processed question image, and generate structured question data containing question element information. S4. The information fusion unit integrates the text data, question keywords, and structured question data as interactive data to obtain the explanation requirements, and generates a personalized explanation strategy based on the explanation requirements. S5. Generate explanation text according to the explanation strategy, and use a speech synthesis TTS model to convert the explanation text into a speech signal for output, thus completing a single round of explanation interaction. S6. If a new voice question is initiated again, steps S1 to S5 are repeated to conduct a new round of interaction. During the new round of interaction, the information fusion unit optimizes the explanation strategy generated by the new round of interaction by associating the interaction data from the previous round.

[0007] The beneficial effects of this invention are as follows: By adopting the above-mentioned interactive method for explaining problems based on multi-model fusion, convenient voice interaction and comprehensive problem information recognition are achieved. Students can directly ask questions via voice, and the system can accurately understand all the information in the problem, such as text, graphics, and formulas, and generate personalized and dynamic explanation strategies based on this. It supports multi-round interactive optimization, making the explanation content more targeted, effectively improving the efficiency of homework tutoring and students' learning experience, and achieving intelligent and adaptive tutoring.

[0008] Preferably, in step S2, the preprocessing of the speech signal specifically includes: noise reduction, filtering, and enhancement processing of the speech signal.

[0009] Preferably, in step S2, the Automatic Speech Recognition (ASR) model is an end-to-end speech recognition model based on the Transformer architecture; the specific process of using the ASR model to convert the processed speech signal into text data and extracting question keywords related to the question from the text data includes: The preprocessed speech signal is input into the Automatic Speech Recognition (ASR) model, which then performs the speech-to-text conversion to generate the corresponding first text information. Semantic analysis is performed on the first text information to identify and extract at least one question keyword that reflects the student's question; The first text information and the at least one question keyword are used as the output.

[0010] In step S3, the preprocessing of the question image specifically includes: grayscale conversion, binarization, tilt correction, and noise reduction of the question image.

[0011] Preferably, in step S3, the processed question image is identified using the Visual Language Model (VLM). The specific process includes: The processed question image is input into the Visual Language Model (VLM). The Visual Language Model (VLM) is used to analyze and recognize the text, graphics, and formula elements in the problem image. The Visual Language Model (VLM) outputs structured problem data containing the problem stem, known conditions, questions to be solved, graphical features, and formula parameters.

[0012] Preferably, step S4 includes the following steps: S41. Map and associate the text data, question keywords and specific elements in the structured question data, perform deep semantic analysis based on the text data to obtain the question intent and cognitive state, and locate specific knowledge difficulties by combining the associated structured question data. S42. Based on the questioning intent, cognitive state, and knowledge difficulties, a multi-dimensional demand vector is generated through a demand classification model. The multi-dimensional demand vector includes the knowledge point understanding level and the demand dimension of the problem-solving process. S43. Based on the multidimensional demand vector, query the dynamic knowledge base, combine the preset teaching logic rules and the student's historical interaction profile, and dynamically assemble and generate an executable explanation strategy. The explanation strategy specifies the content sequence, level of detail, and expression method of the explanation.

[0013] Preferably, step S41 includes the following steps: S411. An internal index is established for the structured question data, wherein each identified question element is assigned a unique identifier; S412. Perform named entity recognition and coreference resolution on the text data to identify entity references and reference descriptions in the text data; S413. By calculating semantic similarity and context matching degree, the entity mentions and references identified in the text data are linked to the corresponding title element identifiers in the internal index. S414. Input the text data into a pre-trained intent recognition model, and the intent recognition model outputs a fine-grained question intent classification corresponding to the voice question. S415. Based on the text data and historical interaction data, generate a cognitive state vector representing the student's current confidence level and level of confusion. S416. Based on the linked question elements and the fine-grained question intent classification, query the subject knowledge graph; S417. In the subject knowledge graph, locate the core knowledge points that are related to the question elements and the intent of the question, and use them as the specific knowledge difficulties.

[0014] Preferably, the dynamic knowledge base includes: a structured knowledge point graph, a question-knowledge point association library, a lecture template and logic rule library, and a student historical interaction profile library; the generation process of the lecture strategy is controlled by the weak knowledge points and past questioning patterns recorded in the student historical interaction profile.

[0015] Secondly, a problem-solving interactive system based on multi-model fusion includes: The terminal interaction module is used to collect audio signals containing questions and images of the questions, and output audio signals for explanation. The signal preprocessing module is used to preprocess the speech signal and the title image respectively; The multi-model processing module includes: The ASR model unit is used to convert the preprocessed speech signal into text data and extract question keywords from the text data. The VLM model unit is used to recognize the preprocessed question images and generate structured question data; TTS model unit, used to synthesize explanatory text into speech signal; The information fusion and strategy generation module is used to fuse and analyze the text data, question keywords and structured question data, and generate personalized explanation strategies based on the analysis results. The content generation module is used to generate explanation text according to the explanation strategy.

[0016] The aforementioned interactive problem-solving system based on multi-model fusion allows students to ask questions directly via voice. The system accurately understands all information in the problem, including text, graphics, and formulas, and generates personalized and dynamic explanation strategies accordingly. It supports multi-round interactive optimization, making the explanations more targeted, effectively improving the efficiency of homework tutoring and the student learning experience, thus achieving intelligent and adaptive tutoring. Attached Figure Description

[0017] Figure 1 This is a flowchart of a problem explanation and interaction method based on multi-model fusion according to the present invention. Detailed Implementation

[0018] The invention will be further described below with reference to the accompanying drawings and specific embodiments, so that those skilled in the art can implement it based on the description. The scope of protection of the invention is not limited to these specific embodiments.

[0019] This invention relates to a multi-model fusion-based interactive method for explaining questions. Its core lies in the collaboration and information fusion of three major models: Automatic Speech Recognition (ASR), Visual Language Model (VLM), and Text-to-Speech (TTS), to achieve intelligent understanding and personalized voice explanation of students' voice questions. Figure 1 The overall flowchart of one embodiment of the method of the present invention mainly includes: students initiating voice questions and taking pictures of the questions through terminal devices; the system preprocesses the voice and images, and then converts them into structured text and question data through ASR and VLM models respectively; the information fusion unit fuses and analyzes these multimodal data to determine the students' core needs and generate personalized explanation strategies; explanation text is generated according to the strategy, and finally synthesized into natural speech and output to the students through a TTS model. If students have follow-up questions, the system can conduct multiple rounds of interaction and continuously optimize the explanation using historical data. The specific method includes the following steps: S1. Voice Interaction Initiation Phase: Students initiate voice questions related to the problem through the voice acquisition module (such as a microphone) of their terminal device. The terminal device transmits the acquired raw voice signal to the signal preprocessing unit of the backend system. For example, a student asks through the built-in microphone of the terminal device, "Teacher, how do I add auxiliary lines to this geometry problem?" The terminal device performs analog-to-digital conversion on the acquired raw analog voice signal, generates a digital voice signal, encapsulates it into a data packet, and transmits it over the network to the AI ​​problem explanation service backend deployed in the cloud.

[0020] S2. Speech Signal Recognition Stage: The signal preprocessing unit performs standardization processing on the original speech signal, including noise reduction, filtering, and enhancement. Specifically, spectral subtraction or deep learning-based noise reduction algorithms are used to eliminate environmental noise; bandpass filters are used to retain the main frequency range of human voice; and signal enhancement is performed to improve speech clarity and intelligibility. Subsequently, the processed speech signal is input into the Automatic Speech Recognition (ASR) model. The ASR model converts the speech signal into corresponding text information and performs initial semantic analysis to extract keywords reflecting the student's core intent, such as "how to prove," "auxiliary lines," and "formula application." The generated text data and core question keywords are then transmitted to the information fusion unit.

[0021] S3. Question Information Acquisition and Recognition Stage: The terminal device acquires the question image (including question text, graphics, formulas, tables, etc.) through the image acquisition module. The question image, including question text, graphics, formulas, tables, etc., is transmitted to the image preprocessing unit. The image preprocessing unit performs grayscale conversion, binarization, tilt correction, and noise reduction on the question image to obtain a clear, standardized question image. The standardized question image is input into the Visual Language Model (VLM). The VLM performs structured recognition on the text, graphics, formulas, and other elements in the question image to generate structured question data containing information such as the question stem, known conditions, unsolved problems, graphic features, and formula parameters. The VLM transmits the structured question data to the information fusion unit.

[0022] S4. Information Fusion and Explanation Strategy Generation Stage: The information fusion unit receives text data transmitted by the Automatic Speech Recognition (ASR) model, core question keywords, and structured question data transmitted by the Visual Language Model (VLM). It performs fusion analysis on the three types of data to determine the students' core needs, such as basic explanation, difficulty breakthrough, and extension. The fused information is then input into the explanation strategy generation unit. Based on a preset knowledge point map, question type library, and explanation logic rules, the explanation strategy generation unit generates personalized explanation strategies in combination with the students' core needs. The explanation strategies include the explanation order (e.g., first review the knowledge points → then analyze the question stem → then solve the problem), the explanation depth (basic version / advanced version), and the focus (e.g., focusing on formula application / focusing on logical reasoning).

[0023] S5. Explanation Content Generation and Voice Output Stage: The explanation content generation unit generates corresponding explanation text according to the explanation strategy. The explanation text includes knowledge point association explanations, key information analysis of the question stem, problem-solving ideas, step breakdown, answer derivation, and tips on common mistakes. The explanation text is input into the speech synthesis TTS model, which converts the explanation text into a natural and fluent speech signal. The timbre and speech rate of the speech signal can be adjusted according to the student's preset preferences. The speech signal generated by the speech synthesis TTS model is played through the audio output module of the terminal device, completing one AI question explanation interaction.

[0024] S6. Multi-round interaction optimization stage: If a student raises a new voice question based on this explanation, repeat steps S1 to S5. In subsequent interactions, the information fusion unit will associate historical interaction data (such as the questions that students have already asked and the knowledge points they have already mastered) to optimize the explanation strategy and achieve targeted multi-round interactive explanation.

[0025] The Automatic Speech Recognition (ASR) model adopts an end-to-end speech recognition model based on the Transformer architecture. It is fine-tuned and trained using speech corpora in the education field (including common student questioning voices and knowledge-related voices) to improve the recognition accuracy of specific voices in educational scenarios. The ASR model also has dialect adaptation capabilities and can recognize questioning voices in multiple mainstream dialects.

[0026] The Visual Language Model (VLM) employs a visual-language pre-trained model, fine-tuned using massive amounts of educational question image data (including questions from multiple disciplines such as mathematics, physics, chemistry, and Chinese, covering various elements such as text, geometric figures, function graphs, experimental setup diagrams, and formulas). It possesses the ability to accurately identify elements of different subjects and question types. The VLM model also has formula recognition and parsing functions, which can convert formulas in images into editable and computable structured formula data.

[0027] The TTS speech synthesis model adopts an end-to-end speech synthesis architecture based on deep neural networks, which includes a text encoder, a prosody prediction module, an acoustic feature decoder, and a neural vocoder. The prosody prediction module can generate corresponding prosodic features based on the semantic content of the explanatory text and preset emotion tags to achieve emotional speech synthesis. The model also integrates a timbre control module, which changes the timbre of the output speech by switching different speaker embedding vectors.

[0028] Specifically, in step S2, the process of extracting keywords reflecting students' core intentions through initial semantic analysis of the text includes: The preprocessed speech signal is input into the Automatic Speech Recognition (ASR) model, which then performs the speech-to-text conversion to generate the corresponding first text information. Semantic analysis is performed on the first text information to identify and extract at least one question keyword that reflects the student's question; The first text information and the at least one question keyword are used as the output.

[0029] In step S3, the preprocessing of the question image specifically includes: grayscale conversion, binarization, tilt correction, and noise reduction of the question image.

[0030] Specifically, in step S3, the processed question image is identified using the Visual Language Model (VLM). The specific process includes: The processed question image is input into the Visual Language Model (VLM). The Visual Language Model (VLM) is used to analyze and recognize the text, graphics, and formula elements in the problem image. The Visual Language Model (VLM) outputs structured problem data containing the problem stem, known conditions, questions to be solved, graphical features, and formula parameters.

[0031] For example, it can identify a right triangle ABC and extract its attributes ∠C=90° and side AC=5cm; it can identify the component connections in a physical circuit diagram; and in particular, it can parse mathematical formulas and chemical equations in images, converting them into structured formula data such as LaTeX format or computable built-in expression trees. Finally, the Visual Language Model (VLM) outputs a structured question data, organized in formats such as JSON, clearly containing lists of text, graphic objects, formulas, and their interrelationships, and sends it to the information fusion unit.

[0032] Specifically, the process of step S4 is as follows: Step S41: The information fusion unit receives text data and keywords from ASR, as well as structured question data from VLM.

[0033] First, an internal index is created for each question element in the structured question data. These question elements can be text paragraphs, graphic objects (G001), formulas (F001), etc. Then, Named Entity Recognition (NER) and coreference resolution are performed on the text data output by ASR. For example, it is identified that "this edge" in the text points to the graphic object "edge AB" (ID: G001_E01). By calculating semantic similarity, entities in the text ("auxiliary lines") are linked to the question element type ("graphic - addable line segments").

[0034] The complete question text data is input into a pre-trained intent recognition model. The model outputs a fine-grained intent classification, such as "REQUEST_SOLUTION_STEP" (requesting steps to solve a problem). Simultaneously, by combining the sentiment tendency of the current question text (through a sentiment analysis model) and the student's historical interaction data (such as past question frequency and error correction records), a cognitive state vector is generated, numerically representing the student's current "confusion level" and "confidence level."

[0035] Based on the specific question elements linked (e.g., "Right Triangle G001") and the identified intent ("Requesting solution steps"), the system queries the subject knowledge graph in the backend. This graph stores knowledge points (e.g., "Pythagorean Theorem," "Trigonometric Functions") and their relationships (e.g., "Prerequisites," "Application Scenarios") in a graph structure. Through graph reasoning, the system identifies core knowledge points that may be involved in solving the current problem but that the student has not yet mastered. For example, it determines that the student's difficulty may lie in "failing to actively construct the connection between right triangles and the definition of trigonometric functions."

[0036] Step S42: Input the output of step S41 (intent, cognitive state, knowledge difficulties) into a demand classification model, which maps it into a multi-dimensional demand vector. For example, the multi-dimensional demand vector is: [knowledge level: application, problem-solving demand: step guidance, cognitive support: graphical visualization, detail: high].

[0037] Step S43: Explanation: The strategy generation unit queries the dynamic knowledge base based on the generated multi-dimensional demand vector.

[0038] Specifically, the dynamic knowledge base includes a structured knowledge point graph, a question-knowledge point association library, a lecture template and logic rule library, and a student historical interaction profile library; the generation process of the lecture strategy is controlled by the weak knowledge points and past questioning patterns recorded in the student historical interaction profile.

[0039] Among them, the structured knowledge point graph is associated with the subject knowledge graph.

[0040] Question-Knowledge Point Association Database: Records the combinations of knowledge points tested in various types of questions.

[0041] Explanation Templates and Logic Rule Base: Stores explanation paradigms for different intentions and question types, such as "Logic for explaining proof questions: first restate the conditions -> clarify the conclusion to be proved -> reverse analyze the intermediate theorems required -> forward write the steps".

[0042] Student Historical Interactive Profile Database: Records the student's weak knowledge points, preferred learning styles (such as preference for examples), and historical questioning patterns.

[0043] This invention also relates to a problem-solving interactive system based on multi-model fusion, which can be deployed using a cloud-edge collaborative architecture, including: The terminal interaction module is used to collect audio signals containing questions and images of the questions, and output audio signals for explanation. The signal preprocessing module is used to preprocess the speech signal and the title image respectively; The multi-model processing module includes: The ASR model unit is used to convert the preprocessed speech signal into text data and extract question keywords from the text data. The VLM model unit is used to recognize the preprocessed question images and generate structured question data; TTS model unit, used to synthesize explanatory text into speech signal; The information fusion and strategy generation module is used to fuse and analyze the text data, question keywords and structured question data, and generate personalized explanation strategies based on the analysis results. The content generation module is used to generate explanation text according to the explanation strategy.

[0044] The aforementioned interactive problem-solving system based on multi-model fusion allows students to ask questions directly via voice. The system accurately understands all information in the problem, including text, graphics, and formulas, and generates personalized and dynamic explanation strategies accordingly. It supports multi-round interactive optimization, making the explanations more targeted, effectively improving the efficiency of homework tutoring and the student learning experience, thus achieving intelligent and adaptive tutoring.

Claims

1. A problem-solving interaction method based on multi-model fusion, characterized in that: The method includes the following steps: S1. Receive a voice signal containing a question or problem input through a terminal device; S2. The speech signal is preprocessed, and the processed speech signal is converted into text data using an automatic speech recognition (ASR) model. Question keywords related to the question are extracted from the text data. S3. Obtain the question image through the terminal device, preprocess the question image, use the Visual Language Model (VLM) to recognize the processed question image, and generate structured question data containing question element information. S4. The information fusion unit integrates the text data, question keywords, and structured question data as interactive data to obtain the explanation requirements, and generates a personalized explanation strategy based on the explanation requirements. S5. Generate explanation text according to the explanation strategy, and use a speech synthesis TTS model to convert the explanation text into a speech signal for output, thus completing a single round of explanation interaction. S6. If a new voice question is initiated again, steps S1 to S5 are repeated to conduct a new round of interaction. During the new round of interaction, the information fusion unit optimizes the explanation strategy generated by the new round of interaction by associating the interaction data from the previous round.

2. The interactive question explanation method based on multi-model fusion according to claim 1, characterized in that: In step S2, the preprocessing of the speech signal specifically includes: noise reduction, filtering, and enhancement processing of the speech signal.

3. The interactive question explanation method based on multi-model fusion according to claim 1 or 2, characterized in that: In step S2, the Automatic Speech Recognition (ASR) model is an end-to-end speech recognition model based on the Transformer architecture; the specific process of using the ASR model to convert the processed speech signal into text data and extracting question keywords related to the question from the text data includes: The preprocessed speech signal is input into the Automatic Speech Recognition (ASR) model, which then performs the speech-to-text conversion to generate the corresponding first text information. Semantic analysis is performed on the first text information to identify and extract at least one question keyword that reflects the student's question; The first text information and the at least one question keyword are used as the output.

4. The interactive question explanation method based on multi-model fusion according to claim 3, characterized in that: In step S3, the preprocessing of the question image specifically includes: grayscale conversion, binarization, tilt correction, and noise reduction of the question image.

5. The interactive question explanation method based on multi-model fusion according to claim 4, characterized in that: In step S3, the processed question image is identified using the Visual Language Model (VLM). The specific process includes: The processed question image is input into the Visual Language Model (VLM). The Visual Language Model (VLM) is used to analyze and recognize the text, graphics, and formula elements in the problem image. The Visual Language Model (VLM) outputs structured problem data containing the problem stem, known conditions, questions to be solved, graphical features, and formula parameters.

6. The interactive question explanation method based on multi-model fusion according to claim 5, characterized in that: Step S4 includes the following steps: S41. Map and associate the text data, question keywords and specific elements in the structured question data, perform deep semantic analysis based on the text data to obtain the question intent and cognitive state, and locate specific knowledge difficulties by combining the associated structured question data. S42. Based on the questioning intent, cognitive state, and knowledge difficulties, a multi-dimensional demand vector is generated through a demand classification model. The multi-dimensional demand vector includes the knowledge point understanding level and the demand dimension of the problem-solving process. S43. Based on the multidimensional demand vector, query the dynamic knowledge base, combine the preset teaching logic rules and the student's historical interaction profile, and dynamically assemble and generate an executable explanation strategy. The explanation strategy specifies the content sequence, level of detail, and expression method of the explanation.

7. The interactive question explanation method based on multi-model fusion according to claim 6, characterized in that: The specific process of step S41 includes the following steps: S411. An internal index is established for the structured question data, wherein each identified question element is assigned a unique identifier; S412. Perform named entity recognition and coreference resolution on the text data to identify entity references and reference descriptions in the text data; S413. By calculating semantic similarity and context matching degree, the entity mentions and references identified in the text data are linked to the corresponding title element identifiers in the internal index. S414. Input the text data into a pre-trained intent recognition model, and the intent recognition model outputs a fine-grained question intent classification corresponding to the voice question. S415. Based on the text data and historical interaction data, generate a cognitive state vector representing the student's current confidence level and level of confusion. S416. Based on the linked question elements and the fine-grained question intent classification, query the subject knowledge graph; S417. In the subject knowledge graph, locate the core knowledge points that are related to the question elements and the intent of the question, and use them as the specific knowledge difficulties.

8. The interactive question explanation method based on multi-model fusion according to claim 7, characterized in that: The dynamic knowledge base includes: a structured knowledge point graph, a question-knowledge point association library, a lecture template and logic rule library, and a student historical interaction profile library; the generation process of the lecture strategy is controlled by the weak knowledge points and past questioning patterns recorded in the student historical interaction profile.

9. A problem-solving interactive system based on multi-model fusion, used to implement the problem-solving interactive method based on multi-model fusion as described in any one of claims 1 to 8, characterized in that: include: The terminal interaction module is used to collect audio signals containing questions and images of the questions, and output audio explanations. Number; A signal preprocessing module is used to preprocess the speech signal and the title image respectively; The multi-model processing module includes: The ASR model unit is used to convert the preprocessed speech signal into text data and extract question keywords from the text data. The VLM model unit is used to recognize the preprocessed question images and generate structured question data; TTS model unit, used to synthesize explanatory text into speech signal; The information fusion and strategy generation module is used to fuse and analyze the text data, question keywords and structured question data, and generate personalized explanation strategies based on the analysis results. The content generation module is used to generate explanation text according to the explanation strategy.