A voice interaction system for knowledge training

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By deploying a lightweight model and knowledge graph-based voice interaction system on edge devices, the network dependency and knowledge update lag issues of existing knowledge training systems are resolved, enabling multimodal interaction and offline access, and improving the adaptability and professionalism of training.

CN122201264APending Publication Date: 2026-06-12SHANGHAI BUSINESS SCHOOL

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: SHANGHAI BUSINESS SCHOOL
Filing Date: 2026-01-22
Publication Date: 2026-06-12

Application Information

Patent Timeline

22 Jan 2026

Application

12 Jun 2026

Publication

CN122201264A

IPC: G10L15/07; G10L15/06; G10L15/16; G10L15/18; G10L15/183; G10L15/26; G10L13/027; G10L13/08; G10L25/63; G06N5/022; G06N5/04

AI Tagging

Application Domain

Speech recognition Inference methods

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Vehicle with a beamforming system for improving speech recognition in the vehicle's interior, and methods for improving speech recognition in a vehicle's interior.
DE102024116406B4Aircraft componentsSubstation speech amplifiers
Intelligent voice human-machine interaction system adapted to multiple dialects
CN122201298ASpeech recognition Speech synthesis
REPRESENTATION OF THE SPEECH APPARATUS IN ARTICULATORY FEATURE SPACE
DE102024137164A1TracheaeSensors
A signal control closed-loop execution method and system based on AI voice instructions
CN121884812BRoad vehicles traffic control Biological models
Method and system for generating machine learning models for a vehicle's voice assistant
DE102024136402A1Semantic analysis Vehicle components

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN122201264A_ABST

Patent Text Reader

Abstract

The application discloses a voice interaction system for knowledge training, which is deployed on an edge device and comprises a model inference layer, a knowledge enhancement layer and an interaction perception layer; the edge device is composed of a microcontroller chip and an extension interface, and is used for loading the model inference layer, the knowledge enhancement layer and the interaction perception layer to realize voice interaction with a user; the model inference layer is used for loading a light language model; the light language model is a light model obtained by model compression of a large language model; the knowledge enhancement layer is used for establishing, storing and maintaining a knowledge graph to realize offline knowledge access in application; and the interaction perception layer realizes multi-modal information fusion interaction based on a context perception algorithm to generate interaction content with the user. The application deploys a large language model on an edge device to realize voice interaction, effectively realizes model volume compression, relieves the knowledge lag problem and strengthens the availability in a weak network environment, so that the application greatly expands the applicable scenarios of the system.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of edge computing and deep learning, and more specifically, to a voice interaction system for knowledge training. Background Technology

[0002] Knowledge training is a crucial way to improve the professional skills of practitioners and promote the standardized development of the industry. In particular, some training programs that need to be widely disseminated (such as e-commerce training) may need to be conducted in remote areas and reach people of various ages. Currently, this type of training mainly relies on two technological approaches: one is web-based online learning platforms (such as Taobao University and JD Academy), and the other is online learning through general voice assistants (such as Xiao Ai and Tmall Genie). However, with the widespread development of training, the geographical scope of training needs is expanding, placing demands on the convenience of training; based on the training objectives, more professional training content is required; and based on the diversification of the training audience, higher requirements are placed on language interaction.

[0003] Current knowledge training systems suffer from the following problems: 1) Excessive reliance on network: Existing knowledge training systems rely on cloud-based AI services, resulting in high average response latency, especially in weak network environments (<50Kbps), where availability is even lower and cannot meet the training needs of rural areas; 2) Generalization of domain knowledge: Existing knowledge training systems rely on general large models (such as GPT and Wenxin Yiyan), which have insufficient knowledge coverage in vertical domains and cannot dynamically and customarily update platform rules; 3) Limited interactive experience: Existing systems are mostly one-way information transmission systems, lacking context-aware multi-turn dialogue capabilities and failing to achieve a complete learning loop of "problem-analysis-answer-verification"; 4) Lagging knowledge update mechanism: The rule update cycle averages 6-12 months, which is seriously out of sync with actual rule changes (such as the monthly updates of Douyin rules).

[0004] Therefore, there is an urgent need for a technical solution to improve the network adaptability, domain expertise, interactive intelligence, and timeliness of knowledge training. Summary of the Invention

[0005] To achieve the above objectives, this application provides a voice interaction system for knowledge training; the voice interaction system is deployed on an edge device and includes: a model inference layer, a knowledge enhancement layer, and an interaction perception layer; Edge devices are composed of microcontroller chips and expansion interfaces, and are used to load model inference layers, knowledge enhancement layers and interactive perception layers to realize voice interaction with users; The model inference layer is used to load a lightweight language model; a lightweight language model is a lightweight model formed by compressing a large language model. The knowledge enhancement layer is used to build, store, and maintain knowledge graphs, enabling offline knowledge access in applications; the knowledge enhancement layer supports multiple knowledge graphs to support knowledge training in different industries and fields; The interaction perception layer is based on a context-aware algorithm to achieve multimodal information fusion and interaction, and generate interactive content with users.

[0006] Edge devices include: The microcontroller chip, as the main control chip, performs the core computation for voice interaction and voice response content generation, generates voice text, and supports the execution of tasks such as model inference, voice processing, and communication. The voice input module is used to collect user voice input and integrate an adaptive noise reduction algorithm to improve voice quality; The voice output module is used to convert the voice text generated by the system into voice output. When outputting, an adaptive audio enhancement algorithm is used to dynamically adjust the output volume according to the ambient noise. The communication module is used to enable data interaction with the cloud server and supports knowledge base updates and user data synchronization. The storage module is used to store the compressed lightweight language model and knowledge base, supporting the computing of the main control chip; The power management module employs a low-power design to provide power to all components of the edge device.

[0007] Furthermore, components for voice input, voice output, communication, and power management, an expansion slot for an external storage card, and the storage card are integrated with the microcontroller chip and packaged in a single box to form an edge device.

[0008] The model compression includes the following steps: Structural awareness analysis is performed on a large language model to identify key and redundant layers. During structural awareness analysis, all layers in the large language model are classified. Classification is achieved by calculating the correlation score of network layers, including: calculating the weight distribution entropy H of each network layer, expressed as: ,in, The normalized distribution probability of the weight matrix of this network layer after histogram statistics; the task relevance R of each network layer is obtained; the relevance score is calculated and expressed as: ,Will: Network layers with values greater than a specified value (e.g., 0.7) are identified as key layers. Network layers with values less than or equal to a specified value (such as 0.7) are identified as redundant layers; By performing incremental quantization and optimizing model inference techniques, the size of the model can be effectively reduced. Model lightweighting is performed based on the channel importance assessment of the redundant layer, and redundant channels are removed. The removal of redundant channels includes: assessing the importance of each channel in the redundant layer to the speech response content generation task, and the importance is reflected by the importance value, which is scored by the formula: Importance = (weight distribution entropy) × (task relevance); setting an importance threshold and removing channels with importance less than the importance threshold. The original large language model is used as the teacher model, and the lightweight language model is used as a lightweight student model for learning, while maintaining the key capabilities of the lightweight language model. For lightweight language models, TensorFlow Lite for Microcontrollers is used to accelerate inference, merging Conv+BN+ReLU into a single operator to reduce memory allocation.

[0009] Specifically, when establishing the knowledge graph, an entity recognition model is used to extract key entities and relationships from knowledge documents to form the knowledge graph; the knowledge documents include rule documents, cases, and industry reports; when maintaining the knowledge graph, a knowledge enhancement mechanism is used to accurately optimize the knowledge graph, and an incremental learning mechanism is used to continuously maintain the knowledge graph.

[0010] Among them, the knowledge enhancement mechanism refers to achieving accurate knowledge matching based on semantic similarity calculation using graph neural networks; and then using a dynamic weight adjustment algorithm to dynamically adjust the weight of the search results according to the semantic similarity between the user's question and the knowledge graph.

[0011] The incremental learning mechanism refers to automatically identifying key update points based on changes in platform rules through an incremental update algorithm for the knowledge base; after identifying the key update points, transfer learning techniques are used to fine-tune the changed parts. The knowledge base incremental update algorithm is based on the difference comparison algorithm and includes: identifying update points based on text similarity, locating the knowledge graph based on the update points, and making local fine adjustments to the located content.

[0012] Furthermore, the difference comparison algorithm employs a two-layer filtering mechanism, including: The first layer uses SimHash for text deduplication, dividing the newly acquired rule document into 4KB text blocks and comparing them with the Hamming distance of the existing document fingerprints in the knowledge graph. Text blocks with a distance > 3 are marked as candidate update blocks. The second layer uses semantic comparison based on BERT sentence vector similarity. Candidate update blocks are encoded into vectors using Sentence-BERT and their cosine similarity is calculated with the entity description vectors in the knowledge graph. Points with a similarity of <0.75 are considered key update points.

[0013] The multimodal fusion interaction is implemented by a multimodal interaction algorithm based on dialogue state tracking, including: The Transformer-based dialogue state tracking model updates the dialogue context in real time; it introduces an attention mechanism, sets the dialogue state encoding dimension to 128 dimensions, focuses on key dialogue information, and ignores irrelevant information. A multimodal fusion network is adopted to perform weighted calculation and fusion of multimodal information based on semantic similarity; The response content is dynamically generated and adjusted based on the dialogue status.

[0014] Furthermore, the implementation of the interactive awareness layer includes the following steps: It acquires and processes user voice input, uses an independent speech recognition model to perform speech recognition on edge devices, converts speech information into text information, and provides input data for upper-layer dialogue management. Dialogue management is achieved through dialogue state tracking and user intent recognition based on recurrent neural networks; It integrates a knowledge graph query engine to retrieve relevant knowledge entries in real time based on the characteristics of input data; it performs multi-dimensional comprehensive sorting based on semantic relevance, information timeliness, and user preferences, and outputs response text. The text is converted into speech output, and speech synthesis is optimized for specific scenarios.

[0015] This invention integrates multimodal information such as voice input, knowledge graph, and role-playing to construct an adaptive interaction mechanism based on dialogue state tracking. Addressing the rapid iteration of knowledge, an incremental learning algorithm is employed to achieve quarterly updates of the knowledge base, effectively alleviating the problem of knowledge lag. Simultaneously, it supports offline access to the knowledge base and, combined with adaptive noise reduction and audio enhancement technologies, enhances usability in weak network environments, thereby greatly expanding the system's applicable scenarios. Attached Figure Description

[0016] Figure 1 This is a schematic diagram of the structure of a voice interaction system for knowledge training provided according to an embodiment of the present invention; Figure 2 This is a functional structure diagram of an edge device provided according to an embodiment of the present invention; Figure 3 This is a schematic diagram illustrating the functional implementation of the interaction perception layer of a voice interaction system according to an embodiment of the present invention. Figure 4 This is a schematic diagram of an edge device operating a voice interaction system according to an embodiment of the present invention. Detailed Implementation

[0017] The specific implementation of the present invention will now be described in detail with reference to the accompanying drawings.

[0018] The voice interaction system for knowledge training provided by this invention is deployed on an edge device, and its structure is as follows: Figure 1 As shown, it includes: a model reasoning layer, a knowledge enhancement layer, and an interaction perception layer; The P100 edge device is composed of a microcontroller chip and an expansion interface, which is used to load the model inference layer, knowledge enhancement layer and interactive perception layer to realize voice interaction with users.

[0019] The expansion interface supports voice input, voice output, storage, communication, and power supply; at the same time, it integrates a microphone for voice input, a power amplifier and speaker for voice output, an expansion slot and memory card for external memory card, communication components, and power management components with a microcontroller chip into a single package to form an edge device.

[0020] The functional structure of edge devices is as follows Figure 2 As shown, it includes the following parts: 1) Main control chip: In this invention, a microcontroller chip is used as the main control chip to realize the core calculation of voice interaction and voice response content generation, generate voice text, and support the execution of tasks such as model inference, voice processing and communication; 2) Voice input module: After collecting user voice input, it integrates an adaptive noise reduction algorithm to improve voice quality; 3) Voice output module: Converts the system-generated speech text into voice output. When outputting, it uses an adaptive audio enhancement algorithm to dynamically adjust the output volume according to the ambient noise. 4) Communication module: including WIFI, Bluetooth and other components, to realize data interaction with the cloud server, and support knowledge base updates and user data synchronization; 5) Storage Module: Used to store compressed lightweight language models, knowledge bases, user learning records, and other information, supporting the main control chip's computation; the storage module can be integrated with the main control chip or implemented through an external SD card or TF card via an expansion slot. For example: using the ESP32-S3's built-in 16MB Flash and an external TF card expansion slot, the TF card is used to store the lightweight language model, and the built-in Flash is used to store the operating system and temporary data; 6) Power Management Module: Adopts a low-power design to provide stable power to all components of the edge device.

[0021] This invention provides an embodiment where the microcontroller chip is an ESP32-S3 (480MHz main frequency, 64MB RAM, 16MB Flash); the voice input device is a MAX9814 microphone with a sampling rate of 16kHz and a signal-to-noise ratio >70dB; the voice output component is an I2S digital amplifier with an output power of 3W and a frequency response of 20Hz-20kHz; the communication components are Wi-Fi 6 and Bluetooth 5.0; the power supply uses a 5V / 2A USB-C port with a battery life of 8 hours, forming an edge device such as... Figure 4 As shown. In this embodiment, a dual-core architecture of the ESP32-S3 is used, with one core running model inference and the other core handling voice I / O to avoid interrupt conflicts.

[0022] The P110 model inference layer is used to load a lightweight language model; the lightweight language model is a lightweight model constructed by compressing a large language model. Model compression is implemented using PyTorch, specifically including model quantization and pruning frameworks, comprising the following steps: 1) Perform structure-aware analysis on large language models to identify key and redundant layers; In the structural awareness analysis process, all layers in the large language model are classified. Generally, attention layers and feedforward layers are classified as key layers, and some convolutional layers are classified as redundant layers. Classifying all layers in a large language model is achieved by calculating the relevance scores of network layers, including the following steps: The weight distribution entropy H of each network layer is calculated and expressed as: ,in, This represents the normalized distribution probability of the network layer weight matrix after histogram statistics. The task relevance R of each network layer is obtained by masking that layer on the question-answering validation set. Sure; The relevance score is calculated and expressed as follows: ,Will: The network layers are identified as key layers (such as the first 2 / 3 of the Transformer self-attention layer and feedforward layer). The network layers are identified as redundant layers (such as the last 1 / 3 layers and the embedded layers of a feedforward network).

[0023] 2) Perform progressive quantization to optimize model inference techniques: This includes using INT8 quantization for critical layers and FP16 quantization for redundant layers; progressive quantization can reduce the precision of the model representation and effectively reduce the size of the model. 3) Lightweight the model based on the channel importance assessment of the redundant layer and remove redundant channels; after identifying the redundant layer, evaluate the importance of each channel of the redundant layer to the speech response content generation task. The importance is reflected by the importance value, and the scoring formula is: Importance = (weight distribution entropy) × (task relevance). The weight distribution entropy is obtained by calculating the singular value decomposition entropy of the channel weight matrix, including the following steps: The SVD decomposition of the two-dimensional weight matrix of a channel in a convolutional layer is expressed as: ,in, and Σ is an orthogonal matrix, and Σ is a diagonal matrix; Extract the singular values from Σ to form a vector. ,and: ; to vector Normalization is performed to obtain discrete probability distributions ,satisfy ; Based on the discrete probability distribution P, the Shannon entropy is calculated as the final weighted distribution entropy value H, expressed as: .

[0024] In this step, an importance threshold is set, and channels with an importance less than the threshold are pruned. Pruning channels reduces the number of parameters. Furthermore, the model has already been quantized when pruning redundant channels, and the quantization effect is preserved during the pruning process, further reducing computational load.

[0025] 4) Finally, knowledge distillation is performed, that is, the original large language model is used as the teacher model, and the lightweight language model after quantization and channel pruning is trained as the lightweight student model, while maintaining the key capabilities of the lightweight language model.

[0026] The loss function for knowledge distillation is: ,in, For standard cross-entropy loss, The KL divergence output by the teacher and student models. The L2 distance loss is the value of the attention matrix. , , These are the weighting coefficients.

[0027] Furthermore, TensorFlow Lite for Microcontrollers is used to accelerate inference for the lightweight student model, merging Conv+BN+ReLU into a single operator to reduce memory allocation and optimize computational power.

[0028] In this embodiment of the invention, the Wenxin Yiyan large model is compressed, reducing the original large language model from 0.3GB to 45MB, and increasing the inference speed from 1.2 seconds / time to 185ms / time. After training on a professional question test set, the model accuracy is no less than 2.5%.

[0029] The P120 knowledge enhancement layer is used to build, store, and maintain the knowledge graph, enabling offline knowledge access and professional switching in applications. The knowledge graph consists of rules, cases, and industry reports.

[0030] Specifically, knowledge documents, such as rule documents, case studies, and industry reports, are obtained from the platform; when building the knowledge graph, a BERT+BiLSTM entity recognition model is used to extract key entities and relationships from the knowledge documents to form the knowledge graph; when maintaining the knowledge graph, a knowledge enhancement mechanism is used to optimize the knowledge graph precisely, and an incremental learning mechanism is used to continuously maintain the knowledge graph.

[0031] The knowledge enhancement mechanism refers to achieving accurate knowledge matching through semantic similarity calculation based on graph neural networks; then, a dynamic weight adjustment algorithm is used to dynamically adjust the weight of the search results based on the semantic similarity between the user's question and the knowledge graph.

[0032] The incremental learning mechanism refers to automatically identifying key update points based on changes in platform rules through an incremental update algorithm for the knowledge base; after identifying key update points, transfer learning techniques are used to fine-tune the changed parts to reduce the consumption of update resources.

[0033] The incremental update algorithm for the knowledge base is based on a difference comparison algorithm. It identifies update points based on text similarity, locates the knowledge graph based on these update points, and performs local fine-tuning of the located content. Therefore, each update only updates the parameters of the affected subgraph (approximately 3%-5% of nodes) in the knowledge graph. The final knowledge graph serves as a local knowledge base integrated on edge devices. During the training process, it supports offline knowledge access, thus eliminating reliance on cloud AI services, reducing average response latency, and making it particularly suitable for weak network environments.

[0034] Specifically, the difference comparison algorithm employs a two-layer filtering mechanism, which includes: The first layer: based on SimHash text deduplication, the newly acquired rule document is divided into 4KB text blocks, and the Hamming distance is compared with the existing document fingerprints in the knowledge graph. Text blocks with a distance > 3 are marked as candidate update blocks. The second layer is based on semantic comparison using BERT sentence vector similarity. Candidate update blocks are encoded into vectors using Sentence-BERT and their cosine similarity is calculated with the entity description vectors in the knowledge graph. Points with a similarity of <0.75 are considered key update points.

[0035] In this embodiment of the invention, after obtaining knowledge files from platforms such as Taobao, JD.com, Pinduoduo, and Douyin, more than 120,000 nodes and 180,000 relationships are extracted to construct a knowledge graph of the knowledge files; incremental learning and fine-tuning are carried out quarterly to achieve regular maintenance of the knowledge base.

[0036] The knowledge enhancement layer supports multiple knowledge graphs to support knowledge training in different industries and fields. For example, a single edge device can support e-commerce knowledge training and literary appreciation training. The interaction perception layer obtains instructions for knowledge base switching to enable professional switching, triggering the pointing to the knowledge graph corresponding to the specified profession in practical applications, thus meeting the need for training in different professional knowledge on the same edge device.

[0037] The P130 interactive perception layer is based on a context-aware algorithm to achieve multimodal information fusion interaction and generate interactive content with users, thereby improving the continuity of dialogue and the problem-solving rate.

[0038] The interactive perception layer supports multimodal information including voice information, knowledge graphs, and role information; through context-aware algorithms and LSTM-based dialogue state tracking, it achieves the understanding and memorization of dialogue context.

[0039] The multimodal fusion interaction in this step is implemented by a dialogue state tracking-based multimodal interaction algorithm (DST-MMI), including: 1) Dialogue state tracking: A Transformer-based dialogue state tracking model is adopted to update the dialogue context in real time; an attention mechanism is introduced, and the dialogue state encoding dimension is set to 128 dimensions to focus on key dialogue information and ignore irrelevant information.

[0040] 2) Multimodal information fusion: Based on multimodal information such as voice input, knowledge graph, and role-playing, a multimodal fusion network (MMFNet) is adopted, and weighted calculation and fusion are performed based on semantic similarity to achieve feature-level fusion.

[0041] 3) Context-aware generation: Based on a context-aware generation strategy, the response content is dynamically generated and adjusted according to the dialogue state. When generating the response content, a conditional generation mechanism is adopted to ensure that the response is consistent with the current dialogue context. Finally, the response content is adjusted based on the quality assessment results of the generated content using ROUGE-L and BERTScore.

[0042] During user interaction, the functional implementation process of the interaction perception layer is as follows: Figure 3 As shown: 1) Acquire and process user voice input, using an independent speech recognition model to complete speech recognition on edge devices, converting voice information into text information, and providing input data for upper-layer dialogue management; 2) Dialogue management is implemented based on Dialogue State Tracking (DST) and User Intent Recognition using Recurrent Neural Networks (RNNs). During dialogue management, the dialogue context is dynamically updated to ensure effective information transmission during interaction. User intent recognition employs a hybrid model combining Bidirectional Long Short-Term Memory (BiLSTM) and Conditional Random Fields (CRF) to accurately interpret the semantic purpose of questions. The dialogue state is encoded as a 128-dimensional vector, integrating user intent, key information, and historical dialogue summaries to form a complete conversation representation. Dialogue management achieves continuous tracking and smooth updating of the dialogue state, thereby ensuring the logical consistency and overall fluency of complex multi-turn interactions. 3) Knowledge enhancement integrates external knowledge to improve the professionalism and accuracy of the answers. In practice, it integrates a knowledge graph query engine to retrieve relevant knowledge items in real time based on the characteristics of the input data; and performs multi-dimensional comprehensive sorting based on semantic relevance, information timeliness and user preferences to output the response text.

[0043] 4) During speech synthesis, text is converted into speech output, optimized for specific scenarios, to ensure a real-time interactive experience and complete the closed loop of multimodal interaction.

[0044] This invention proposes a voice interaction system for knowledge training, implemented by deploying a large language model on edge devices. To adapt to the operating environment of edge devices, a hybrid quantization and pruning algorithm is used during the deployment of the large language model, effectively compressing the model size and reducing inference latency. To ensure the professionalism of the knowledge, a knowledge graph containing over 120,000 nodes is constructed, and a semantic similarity calculation method based on graph neural networks is proposed, effectively improving the accuracy of professional questions. To adapt to the diversity of training subjects, multimodal information such as voice input, knowledge graph, and role-playing is fused to realize an adaptive interaction system based on dialogue state tracking. To address the effectiveness of the knowledge, an incremental learning-based knowledge base update algorithm is adopted to achieve quarterly knowledge base updates, solving the problem of knowledge update lag. To reduce dependence on the network, offline access to the knowledge base is supported, and adaptive noise reduction and audio enhancement algorithms are used to improve the system's availability in weak network environments, significantly expanding the application scenarios.

[0045] The above-disclosed embodiments are merely a few specific examples of the present invention. However, the present invention is not limited thereto, and any variations that can be conceived by those skilled in the art should fall within the protection scope of the present invention.

Claims

1. A voice interaction system for knowledge training, characterized in that, The voice interaction system is deployed on an edge device and includes: a model inference layer, a knowledge enhancement layer, and an interaction perception layer; The edge device is composed of a microcontroller chip and an expansion interface, and is used to load a model inference layer, a knowledge enhancement layer and an interactive perception layer to realize voice interaction with the user. The model inference layer is used to load a lightweight language model; the lightweight language model is a lightweight model constructed by compressing a large language model. The knowledge enhancement layer is used to build, store, and maintain knowledge graphs, enabling offline knowledge access in applications; the knowledge enhancement layer supports multiple knowledge graphs to support knowledge training in different industries and fields; The interaction perception layer is based on a context-aware algorithm to achieve multimodal information fusion and interaction, and generate interactive content with the user.

2. The voice interaction system according to claim 1, characterized in that, The edge device includes: The microcontroller chip, as the main control chip, performs the core computation for voice interaction and voice response content generation, generates voice text, and supports the execution of tasks such as model inference, voice processing, and communication. The voice input module is used to collect user voice input and integrate an adaptive noise reduction algorithm to improve voice quality; The voice output module is used to convert the voice text generated by the system into voice output. When outputting, an adaptive audio enhancement algorithm is used to dynamically adjust the output volume according to the ambient noise. The communication module is used to enable data interaction with the cloud server and supports knowledge base updates and user data synchronization. The storage module is used to store the compressed lightweight language model and knowledge base, supporting the computing of the main control chip; The power management module employs a low-power design to provide power to all components of the edge device.

3. The voice interaction system according to claim 2, characterized in that, The components for voice input, voice output, communication, and power management, along with the external memory card expansion slot and memory card, are integrated and packaged together with the microcontroller chip in a single box to form an edge device.

4. The voice interaction system according to claim 1, characterized in that, The model compression includes the following steps: A structure-aware analysis is performed on a large language model to identify key and redundant layers. During this analysis, all layers in the large language model are classified. This classification is achieved by calculating the correlation score of each network layer, including: calculating the weight distribution entropy H of each network layer, expressed as: ,in, The normalized distribution probability of the weight matrix of this network layer after histogram statistics; the task relevance R of each network layer is obtained; the relevance score is calculated and expressed as: ,Will: Network layers with values greater than a specified value are identified as critical layers. Network layers with values less than or equal to a specified value are identified as redundant layers. By performing incremental quantization and optimizing model inference techniques, the size of the model can be effectively reduced. The model is lightweighted based on the channel importance assessment of the redundant layer, and redundant channels are pruned. The pruning of redundant channels includes: assessing the importance of each channel of the redundant layer to the speech response content generation task, and the importance is reflected by the importance value, which is scored by the formula: Importance = (weight distribution entropy) × (task relevance); setting an importance threshold and pruning channels with Importance less than the importance threshold. The original large language model is used as the teacher model, and the lightweight language model is used as the lightweight student model for learning, thus maintaining the key capabilities of the lightweight language model. The lightweight language model is accelerated for inference using TensorFlow Lite for Microcontrollers, which merges Conv+BN+ReLU into a single operator to reduce the number of memory allocations.

5. The voice interaction system according to claim 1, characterized in that, When establishing the knowledge graph, an entity recognition model is used to extract key entities and relationships from knowledge documents to form the knowledge graph; the knowledge documents include rule documents, cases, and industry reports; when maintaining the knowledge graph, a knowledge enhancement mechanism is used to optimize the knowledge graph for precision, and an incremental learning mechanism is used to continuously maintain the knowledge graph.

6. The voice interaction system according to claim 5, characterized in that, The knowledge enhancement mechanism refers to achieving accurate knowledge matching through semantic similarity calculation based on graph neural networks; and then using a dynamic weight adjustment algorithm to dynamically adjust the weight of the search results based on the semantic similarity between the user's question and the knowledge graph.

7. The voice interaction system according to claim 5, characterized in that, The incremental learning mechanism refers to automatically identifying key update points based on changes in platform rules through an incremental update algorithm for the knowledge base; After identifying key update points, transfer learning techniques are used to fine-tune the changed parts; The knowledge base incremental update algorithm is based on a difference comparison algorithm and includes: identifying update points based on text similarity, locating the knowledge graph based on the update points, and making local fine adjustments to the located content.

8. The voice interaction system according to claim 7, characterized in that, The difference comparison algorithm is implemented using a two-layer filtering mechanism, including: The first layer uses SimHash for text deduplication, dividing the newly acquired rule document into 4KB text blocks and comparing them with the Hamming distance of the existing document fingerprints in the knowledge graph. Text blocks with a distance > 3 are marked as candidate update blocks. The second layer uses semantic comparison based on BERT sentence vector similarity. Candidate update blocks are encoded into vectors using Sentence-BERT and their cosine similarity is calculated with the entity description vectors in the knowledge graph. Points with a similarity of <0.75 are considered key update points.

9. The voice interaction system according to claim 1, characterized in that, The multimodal fusion interaction is implemented by a multimodal interaction algorithm based on dialogue state tracking, including: The Transformer-based dialogue state tracking model updates the dialogue context in real time; it introduces an attention mechanism, sets the dialogue state encoding dimension to 128 dimensions, focuses on key dialogue information, and ignores irrelevant information. A multimodal fusion network is adopted to perform weighted calculation and fusion of multimodal information based on semantic similarity; The response content is dynamically generated and adjusted based on the dialogue status.

10. The voice interaction system according to claim 1, characterized in that, The functionality of the interactive perception layer includes the following steps: It acquires and processes user voice input, uses an independent speech recognition model to perform speech recognition on edge devices, converts speech information into text information, and provides input data for upper-layer dialogue management. Dialogue management is achieved through dialogue state tracking and user intent recognition based on recurrent neural networks; It integrates a knowledge graph query engine to retrieve relevant knowledge entries in real time based on the characteristics of input data; it performs multi-dimensional comprehensive sorting based on semantic relevance, information timeliness, and user preferences, and outputs response text. The text is converted into speech output, and speech synthesis is optimized for specific scenarios.