Speech interaction method, system, device and medium based on large language model

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By combining convolutional neural networks and pre-trained language models, speech-semantic alignment and multi-dimensional feature clustering are achieved, generating natural and personalized speech responses. This solves the problems of low feature alignment accuracy and poor naturalness of speech interaction in existing technologies.

CN122245312APending Publication Date: 2026-06-19HUAZHONG NORMAL UNIV

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: HUAZHONG NORMAL UNIV
Filing Date: 2026-03-26
Publication Date: 2026-06-19

Application Information

Patent Timeline

26 Mar 2026

Application

19 Jun 2026

Publication

CN122245312A

IPC: G10L15/22; G10L15/18; G10L15/02; G10L25/30; G10L15/06

AI Tagging

Application Domain

Speech recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing voice interaction technologies have low alignment accuracy between speech and semantic features, are easily affected by noise and prosody changes, and do not adequately mine user emotional and behavioral features, resulting in poor naturalness and accuracy of interaction and difficulty in adapting to complex scenarios.

Method used

Convolutional neural networks are used for time-frequency domain feature extraction, combined with a pre-trained language model for speech-semantic alignment, and a comprehensive intent representation is constructed through cross-modal contrastive learning and KL divergence verification. Dynamic weighted clustering is performed based on multi-dimensional classification indicators, and a Transformer decoder is used to generate response text. Finally, a speech synthesis engine is used to generate natural speech responses.

Benefits of technology

It improves the accuracy of voice-semantic feature alignment, deeply mines user emotional and behavioral features, and generates a natural and personalized interactive experience, solving the problems of low feature alignment accuracy and poor interaction naturalness in traditional methods.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122245312A_ABST

Patent Text Reader

Abstract

This application relates to a voice interaction method, system, device, and medium based on a large language model. The method includes: acquiring user voice signals and extracting time-frequency domain feature vectors from the user voice signals using a deep learning network; simultaneously performing acoustic decoding and semantic encoding on the user voice to generate a semantic embedding vector; aligning voice and semantic features through cross-modal contrastive learning and constructing a comprehensive intent representation vector using KL divergence verification; constructing a weighted fusion tensor based on three categories of indicators: dialogue sentiment, behavioral tendency, and intent confidence; and using a density peak algorithm to optimize clustering and generate indicator clusters; and integrating the indicator clusters and intent vectors through a Transformer decoder to generate optimized response text and voice content. This method improves the naturalness and accuracy of voice interaction through multi-dimensional feature decoupling and dynamic weighting strategies.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of intelligent voice interaction, and in particular relates to voice interaction methods, systems, devices and media based on large language models. Background Technology

[0002] With the rapid development of intelligent voice interaction technology, voice interaction technologies integrating large language models have emerged. Leveraging the semantic understanding capabilities of these large models, this technology achieves direct mapping from speech signals to semantic content, becoming a core application technology in scenarios such as smart terminals and intelligent customer service. Current mainstream voice interaction methods are all developed based on this technological framework. Traditional technologies typically involve first acoustically decoding the user's speech signal to obtain text, then inputting the text into a language model to generate response text, which is then converted into a speech response by a speech synthesis engine. This process employs a sequential feature processing and modality conversion workflow. However, such methods suffer from low alignment accuracy between speech and semantic features, are susceptible to speech noise and prosodic variations, and lack sufficient mining of user dialogue emotion and behavioral tendencies. Response generation relies solely on pure semantic information, resulting in poor naturalness and accuracy of the interaction. Furthermore, the feature processing lacks dynamic weighting and clustering optimization, making it difficult to adapt to complex real-world interaction scenarios. Summary of the Invention

[0003] Therefore, it is necessary to provide a voice interaction method, system, device, and medium based on a large language model that can solve the above problems.

[0004] Firstly, this application provides a voice interaction method based on a large language model, including:

[0005] The user's speech signal is collected and input into a convolutional neural network to extract time-frequency domain features and obtain a speech representation vector.

[0006] Based on the user's speech signal, acoustic text decoding is performed to obtain a text sequence, and the text sequence is input into a pre-trained language model for semantic encoding to obtain a semantic embedding vector;

[0007] Based on the speech representation vector and semantic embedding vector, speech-semantic alignment is performed to obtain the comprehensive intent representation vector;

[0008] Based on the comprehensive intent representation vector, clustering is performed according to preset classification indicators to generate indicator clusters;

[0009] The index clusters and the comprehensive intent representation vector are input into the Transformer decoder to generate the response text;

[0010] The response text, indicator clusters, and comprehensive intent representation vector are input into the speech synthesis engine to generate the speech response content.

[0011] In one embodiment, the convolutional neural network includes a temporal convolutional layer, a frequency-domain convolutional layer, and a channel-domain convolutional layer;

[0012] The user's speech signal is input into a convolutional neural network for time-frequency domain feature extraction, resulting in a speech representation vector, including:

[0013] Mel spectrograms are extracted using a Mel filter based on the user's speech signal;

[0014] The Mel spectrogram is input into a temporal convolutional layer to extract temporal features, resulting in a temporal feature tensor.

[0015] The time-domain feature residuals are calculated based on the time-domain feature tensor and Mel spectrogram.

[0016] The time-domain feature tensor is input into the frequency-domain convolutional layer, and frequency-domain dimension convolution operation is performed to obtain the initial time-frequency feature tensor. Based on the initial time-frequency feature tensor, the time-frequency fusion feature tensor is generated by combining the time-domain feature residual.

[0017] The time-frequency fusion feature tensor is input into the channel domain convolutional layer to perform channel dimension feature mapping, thereby obtaining the mapped feature tensor. The mapped feature tensor is then subjected to multi-scale downsampling to generate a speech representation vector.

[0018] In one embodiment, speech-semantic alignment is performed based on the speech representation vector and the semantic embedding vector to obtain a comprehensive intent representation vector, including:

[0019] Based on semantic embedding vectors, a context-aware window with a preset number of scales and window size is constructed;

[0020] Based on the context-aware window, the speech representation vector is divided into time-frequency domains to obtain a set of local feature blocks;

[0021] Based on local feature block sets and semantic embedding vectors, cross-modal contrastive learning is performed to generate an alignment weight matrix;

[0022] Based on the set of local feature blocks and semantic embedding vectors, time-frequency domain alignment is performed according to the alignment weight matrix to generate a preliminary intent representation vector.

[0023] The KL divergence calculation method is used to perform semantic consistency verification between the preliminary intent representation vector and the semantic embedding vector, and the consistency verification result is obtained.

[0024] Based on the consistency verification results, a binary gating mechanism is used to suppress cross-modal noise and generate a comprehensive intent representation vector.

[0025] In one embodiment, the classification indicators include dialogue sentiment indicators, dialogue behavior tendency indicators, and dialogue intent confidence indicators, as well as indicator weighting coefficients preset according to each classification indicator.

[0026] Based on the comprehensive intent representation vector, clustering is performed according to preset classification indicators to generate indicator clusters, including:

[0027] Based on the comprehensive intent representation vector, the tone prosody feature vector, semantic role feature vector and context association feature vector are extracted by multi-dimensional feature decoupling;

[0028] Based on the index weighting coefficients, the prosodic feature vector, semantic role feature vector, and context association feature vector are weighted and fused to obtain the index fusion tensor. Among them, the prosodic feature vector adopts the index weighting coefficient of the corresponding dialogue sentiment index, the semantic role feature vector adopts the index weighting coefficient of the corresponding dialogue behavior tendency index, and the context association feature vector adopts the index weighting coefficient of the corresponding dialogue intent confidence index.

[0029] Calculate the similarity matrix between the fusion tensors of each index, and generate an initial index cluster based on the similarity matrix and the similarity threshold.

[0030] The density peak algorithm is used to optimize the cluster boundaries of the initial index clusters to obtain the index clusters.

[0031] In one embodiment, the Transformer decoder presets a text generator, intonation pattern mapping rules, entity relationship graph, and discourse coherence correction rules;

[0032] The indicator clusters and the comprehensive intent representation vector are input into the Transformer decoder to generate the response text, including:

[0033] Based on the comprehensive intent representation vector, a text generator is used to generate the initial response text;

[0034] Based on the dialogue sentiment index of index clustering, the intonation control vector of the initial response text is generated by the intonation pattern mapping rule.

[0035] Based on the dialogue behavior tendency index of index clustering, combined with the initial response text, the entity relationship graph is used to retrieve the associated entity triples, and semantic slot filling instructions are generated according to the associated entity triples.

[0036] Extract contextual fragments from the initial response text, and based on these contextual fragments, combine them with the dialogue intent confidence index of the index cluster, and generate a discourse coherence correction vector according to the discourse coherence correction rules.

[0037] Integrate intonation control vectors, semantic slot filling instructions, and discourse coherence correction vectors to generate a text optimization instruction set;

[0038] The initial response text is iteratively optimized according to the text optimization instruction set until the preset convergence condition is met, and the response text is obtained.

[0039] In one embodiment, the formula used to calculate the similarity matrix between the fusion tensors of each index is:

[0040]

[0041] in, Let i be the similarity value between the fused tensor of the i-th and j-th indices. These are the dynamic weighting coefficients for the corresponding classification indicators. As a sentiment indicator in dialogue, As an indicator of conversational behavior tendencies, This is a confidence indicator of dialogue intent. For the k-th type of feature subspace of the tensor of the i-th index, The feature space scaling factor. For bias correction term, The LeakyReLU activation function is used. This is a tensor inner product operation.

[0042] In one embodiment, the response text, indicator clusters, and comprehensive intent representation vector are input into the speech synthesis engine to generate speech response content, including:

[0043] Based on the response text, extract basic speech synthesis feature parameters;

[0044] Based on the index clustering clusters, the dialogue sentiment index, dialogue behavior tendency index, and dialogue intent confidence index are used to generate speech adjustment parameters using Mel-Cepstral Coefficient Transform; the speech adjustment parameters include the speech fundamental frequency profile and energy distribution.

[0045] Based on the comprehensive intent representation vector, the speech adjustment parameters are corrected in the time-frequency domain to generate corrected speech adjustment parameters;

[0046] The basic speech synthesis feature parameters and the corrected speech adjustment parameters are input into the speech synthesis engine, and the original waveform of the speech response is generated by the waveform generation algorithm to obtain the speech response content.

[0047] Secondly, this application also provides a voice interaction system based on a large language model, including:

[0048] The speech feature extraction module is used to collect user speech signals and input the user speech signals into a convolutional neural network to extract time-frequency domain features and obtain speech representation vectors.

[0049] The semantic embedding encoding module is used to perform acoustic text decoding based on user speech signals to obtain text sequences, and input the text sequences into a pre-trained language model for semantic encoding to obtain semantic embedding vectors.

[0050] The speech-semantic alignment module is used to perform speech-semantic alignment based on the speech representation vector and the semantic embedding vector to obtain a comprehensive intent representation vector;

[0051] The indicator clustering generation module is used to cluster indicators based on the comprehensive intent representation vector and according to preset classification indicators to generate indicator clusters.

[0052] The response text generation module is used to input the indicator clusters and the comprehensive intent representation vector into the Transformer decoder to generate response text;

[0053] The speech response synthesis module is used to input the response text, indicator clusters, and comprehensive intent representation vectors into the speech synthesis engine to generate speech response content.

[0054] Thirdly, this application also provides a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the above-described voice interaction method based on a large language model.

[0055] Fourthly, this application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the above-described voice interaction method based on a large language model.

[0056] The aforementioned voice interaction methods, systems, devices, and media based on large language models, by acquiring user voice signals and using convolutional neural networks for time-frequency domain feature extraction, can suppress voice noise and prosodic variation interference, improving the robustness of feature representation. Based on user voice signals, simultaneous acoustic decoding and semantic encoding are performed to generate semantic embedding vectors, achieving accurate semantic mapping of voice content and reducing error accumulation in traditional serial processing. Cross-modal contrastive learning aligns voice representation vectors and semantic embedding vectors, and combined with KL divergence verification to construct a comprehensive intent representation vector, enhancing the accuracy of voice-semantic feature alignment. Based on the comprehensive intent representation vector, dynamic weighted clustering is performed according to multi-dimensional classification indicators such as dialogue sentiment, behavioral tendency, and intent confidence to generate indicator clusters, deeply mining emotional and behavioral features in user interaction. The indicator clusters and comprehensive intent representation vectors are input into a Transformer decoder to generate response text, and multi-dimensional optimization of the instruction set ensures the coherence and accuracy of the response. The response text and multi-dimensional features are input into a speech synthesis engine to generate voice response content, achieving a natural and personalized interactive experience, comprehensively solving the technical problems of low feature alignment accuracy, insufficient multi-dimensional feature mining, and poor interaction naturalness in traditional methods. Attached Figure Description

[0057] To more clearly illustrate the technical solutions in the embodiments or related technologies of this application, the accompanying drawings used in the description of the embodiments or related technologies will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0058] Figure 1 This is a flowchart of the voice interaction method based on a large language model according to the present invention;

[0059] Figure 2 This is a structural diagram of the voice interaction system based on a large language model according to the present invention. Detailed Implementation

[0060] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0061] In one embodiment, such as Figure 1As shown, a voice interaction method based on a large language model is provided. This embodiment illustrates the application of this method to a terminal. It is understood that this method can also be applied to a server, or to a system including both a terminal and a server, and implemented through the interaction between the terminal and the server. In the implementation environment of this application, the hardware architecture mainly includes a user terminal device (such as a smartphone, smart speaker, or in-vehicle system) and a cloud server. The terminal device can be equipped with a microphone array, audio codec, and local processor for real-time acquisition of user voice signals and preliminary time-frequency domain feature extraction. The server deploys and pre-trains a language model to support complex calculations such as semantic encoding, cross-modal alignment, and response generation. In application scenarios such as intelligent customer service or personalized assistants, when a user requests voice interaction, the terminal collects voice signals through a microphone, the local convolutional neural network quickly extracts Mel-spectrum features, and transmits them to the server via the network. After receiving the voice representation vector, the server simultaneously performs acoustic decoding and semantic encoding, aligns the voice-semantic features using cross-modal contrastive learning, and constructs a comprehensive intent representation vector using KL divergence verification. The server dynamically weights and clusters indicators based on dialogue sentiment, behavioral tendencies, and other multi-dimensional metrics to generate indicator clusters, integrates them through a Transformer decoder to generate optimized response text, and drives a speech synthesis engine to generate personalized voice response content. The server returns the voice data to the terminal, which outputs it through a speaker, achieving low-latency, highly natural multi-turn interaction. Throughout this process, the terminal handles front-end signal acquisition and simple processing, while the server undertakes heavy model computation. A distributed architecture ensures real-time performance and scalability, meeting the accuracy and adaptability requirements of voice interaction in complex scenarios.

[0062] In this embodiment, the method includes the following steps:

[0063] S01: Collect user speech signals and input them into a convolutional neural network to extract time-frequency domain features and obtain speech representation vectors.

[0064] The process involves acquiring raw speech signals using audio acquisition devices (such as microphone arrays) and preprocessing them to enhance signal quality. A convolutional neural network (CNN), a deep learning model, is configured to extract multi-level features from the speech signal across both the time and frequency domains. Time-frequency feature extraction is achieved by converting the speech signal into a spectral representation (such as a Mel spectrogram) and applying convolution operations to capture local temporal patterns and global spectral characteristics. The speech representation vector, a high-dimensional embedding vector, encodes key acoustic properties of the speech and can be generated through hierarchical filtering of the convolutional layers of the CNN, multi-scale downsampling via pooling operations, and nonlinear transformations of activation functions. Recurrent neural networks or attention mechanisms can also be used to assist feature extraction; this step provides a stable input basis for subsequent cross-modal alignment by suppressing environmental noise and prosodic variation interference.

[0065] S02. Based on the user's speech signal, perform acoustic text decoding to obtain a text sequence, and input the text sequence into a pre-trained language model for semantic encoding to obtain a semantic embedding vector.

[0066] The conversion from acoustic to text can be achieved through automatic speech recognition technology. Acoustic text decoding can employ end-to-end deep learning-based models, such as recurrent neural networks trained with connectionist temporal classification loss or Transformer-based architectures, mapping the time-frequency features of the speech signal to discrete text units (such as phonemes, words, or sub-words). The text sequence is input into a pre-trained language model, such as a large-scale Transformer-based model (like BERT or GPT series). This model, pre-trained on a large amount of unlabeled text, can capture contextual dependencies through self-attention mechanisms. Semantic encoding is achieved by extracting deep representations of the text, such as embeddings using [CLS] tags or sequence average pooling, generating high-dimensional semantic embedding vectors that encode the semantic essence of the text and support subsequent cross-modal alignment. Alternatively, hybrid acoustic models (such as Hidden Markov Models combined with deep neural networks) can be used for decoding, or lightweight pre-trained models (such as distilled versions) can be employed to adapt to resource-constrained scenarios. A robust mapping from speech signals to the semantic space can be achieved through a decoding-encoding pipeline, reducing error propagation in traditional serial processing and improving the semantic understanding capability of the interactive system.

[0067] S03. Based on the speech representation vector and semantic embedding vector, speech-semantic alignment is performed to obtain the comprehensive intent representation vector.

[0068] This process achieves consistent mapping of multimodal features through cross-modal fusion technology. Speech-semantic alignment involves constructing a context-aware window to divide speech features into time-frequency domains and using a cross-modal contrastive learning mechanism to calculate an alignment weight matrix. This matrix is then used to weightedly fuse local feature blocks and semantic vectors to generate a preliminary intent representation. Semantic consistency can be verified using divergence metrics (such as KL divergence), and a binarization gating mechanism can be combined to suppress cross-modal noise, outputting a robust comprehensive intent representation vector. Attention mechanisms can be used to dynamically adjust modal weights, or adversarial learning can be introduced to enhance alignment generalization. Feature-level interaction can be used to reduce modal differences and enhance the completeness and accuracy of the intent representation. This step improves the adaptability and naturalness of the voice interaction system in complex scenarios by decoupling and aligning multi-dimensional features.

[0069] S04, based on the comprehensive intent representation vector, clusters are generated according to preset classification indicators to produce indicator clusters.

[0070] The structured representation of user intent can be achieved through multi-dimensional feature analysis and dynamic grouping techniques. The comprehensive intent representation vector, as a high-dimensional embedding after cross-modal alignment, encodes the fusion information of speech and semantics. Pre-defined classification indicators may include dialogue sentiment indicators (reflecting the user's emotional state, such as pleasure or anger), dialogue behavior tendency indicators (indicating the user's behavioral intent, such as inquiry or instruction), and dialogue intent confidence indicators (measuring semantic certainty). These indicators can be quantified using predefined weighting coefficients. During clustering, a feature decoupling method can be used to extract multi-dimensional sub-features from the comprehensive intent representation vector, such as tone prosody feature vectors (associating with sentiment), semantic role feature vectors (associating with behavior), and contextual association feature vectors (associating with confidence). These are then weighted and fused based on the indicator weighting coefficients to generate an indicator fusion tensor. A similarity matrix can be constructed through similarity calculation (such as using inner product operations and activation functions), and a clustering algorithm can be applied for initial grouping to optimize cluster boundaries and form robust indicator clusters. Other classification metrics (such as dialogue urgency or domain specificity) can also be introduced, or alternative clustering methods (such as density-based DBSCAN or graph clustering) can be used to capture the hierarchical structure of intent through multi-scale clustering, thereby improving the system's ability to discriminate user intent.

[0071] S05, input the index cluster and the comprehensive intent representation vector into the Transformer decoder to generate the response text.

[0072] The indicator clusters, dynamically weighted groupings of features based on multi-dimensional classification indicators such as dialogue sentiment, behavioral tendency, and intent confidence, encode the hierarchical structure of user intent. The Transformer decoder, a sequence generation model based on a self-attention mechanism, is configured to receive input from the indicator clusters and the comprehensive intent representation vector, generating a coherent response. It can generate an initial response text based on a preset text generator (such as a Transformer-based generative component), and combine these with dialogue sentiment indicators from the indicator clusters to generate a tone control vector through tone pattern mapping rules; retrieve associated entity triples based on the dialogue behavioral tendency indicator using an entity relationship graph to generate semantic slot filling instructions; and extract contextual fragments based on the dialogue intent confidence indicator and generate a correction vector through discourse coherence correction rules. These elements are integrated to form a text optimization instruction set, iteratively optimizing the initial response until convergence. Other sequence generation architectures (such as those based on recurrent neural networks or encoder-decoder structures) or additional optimization rules (such as sentiment enhancement or style adaptation) can also be used to enhance the overall interactive experience by leveraging the clustered intent features and cross-modal vectors.

[0073] S06, input the reply text, indicator clusters, and comprehensive intent representation vector into the speech synthesis engine to generate the speech reply content.

[0074] The speech synthesis engine is a system component based on waveform generation algorithms (such as neural vocoders or parametric synthesizers). It is configured to receive text and feature inputs and generate natural speech. In implementation, basic speech synthesis feature parameters (such as phoneme sequences or prosodic boundaries) can be extracted from the response text. Based on dialogue sentiment indicators, dialogue behavior tendency indicators, and dialogue intent confidence indicators from indicator clustering, speech adjustment parameters (including the fundamental frequency profile to regulate pitch and energy distribution to control intensity) are generated using techniques such as Mel-spectral coefficient transformation. Time-frequency domain correction is performed based on the comprehensive intent representation vector (e.g., adjusting spectral features using linear predictive coding or deep learning models) to generate corrected speech adjustment parameters. The basic parameters and corrected parameters are input into the engine, and the original waveform of the speech response is synthesized using waveform generation algorithms (such as WaveNet or Griffin-Lim). Alternatively, end-to-end speech synthesis models (such as Tacotron or FastSpeech) can be used to directly integrate multi-dimensional inputs, or adaptive learning mechanisms can be used to dynamically optimize parameters. Through feature-level correction and multi-source information fusion, the speech output can be improved in terms of emotional expression, behavioral adaptation, and intent. Figure 1 To enhance the naturalness of the design and improve the user experience.

[0075] In one embodiment, the convolutional neural network includes a temporal convolutional layer, a frequency-domain convolutional layer, and a channel-domain convolutional layer;

[0076] The user's speech signal is input into a convolutional neural network for time-frequency domain feature extraction, resulting in a speech representation vector, including:

[0077] S11, based on the user's voice signal, uses a Mel filter to extract the Mel spectrogram;

[0078] S12, input the Mel spectrogram into the temporal convolutional layer to extract temporal features and obtain the temporal feature tensor;

[0079] S13, calculate the time-domain feature residuals based on the time-domain feature tensor and Mel spectrogram;

[0080] S14, input the time-domain feature tensor into the frequency-domain convolutional layer, perform frequency-domain dimension convolution operation to obtain the initial time-frequency feature tensor, and generate the time-frequency fusion feature tensor based on the initial time-frequency feature tensor and the time-domain feature residual;

[0081] S15 inputs the time-frequency fusion feature tensor into the channel domain convolutional layer, performs channel dimension feature mapping to obtain the mapped feature tensor, and performs multi-scale downsampling processing on the mapped feature tensor to generate the speech representation vector.

[0082] For example,

[0083] In one embodiment, speech-semantic alignment is performed based on the speech representation vector and the semantic embedding vector to obtain a comprehensive intent representation vector, including:

[0084] S21, Based on semantic embedding vectors, construct a context-aware window with a preset number of scales and window size;

[0085] S22, Based on the context-aware window, the speech representation vector is divided into time-frequency domain to obtain a set of local feature blocks;

[0086] S23, based on the set of local feature blocks and semantic embedding vectors, performs cross-modal contrastive learning to generate an alignment weight matrix;

[0087] S24, Based on the local feature block set and semantic embedding vector, time-frequency domain alignment is performed according to the alignment weight matrix to generate a preliminary intent representation vector;

[0088] S25, the KL divergence calculation method is used to perform semantic consistency verification between the preliminary intent representation vector and the semantic embedding vector to obtain the consistency verification result;

[0089] S26. Based on the consistency verification result, a binary gating mechanism is used to suppress cross-modal noise and generate a comprehensive intent representation vector.

[0090] Specifically, a context-aware window with a preset number of scales and window size can be constructed based on the semantic embedding vector. This window is dynamically adjusted according to the context length of the semantic vector. For example, a multi-head attention mechanism can be used to define windows of different scales to capture local and global dependencies. The context-aware window is used to partition the speech representation vector in the time and frequency domains, dividing the speech features along the time and frequency axes into a set of local feature blocks. Each block corresponds to an acoustic unit of the speech segment, achieving fine-grained feature extraction. Cross-modal contrastive learning is performed based on the set of local feature blocks and the semantic embedding vector. By calculating the similarity score between the speech block and the semantic vector, a contrastive loss function (such as InfoNCE loss) is applied to optimize alignment, generating an alignment weight matrix. This matrix quantifies the relevance weights between the speech and semantic modalities. The set of local feature blocks and the semantic embedding vector are then weighted and fused according to the alignment weight matrix to achieve time-frequency domain alignment processing. For example, weighted features can be integrated through matrix multiplication to generate a preliminary intent representation vector, which initially encodes cross-modal consistency. The KL divergence method can be used to verify the semantic consistency between the initial intent representation vector and the semantic embedding vector. The divergence value between their probability distributions is calculated as the consistency verification result; if the divergence is below a threshold, the alignment is considered effective. Based on the consistency verification result, a binarization gating mechanism is applied, such as using the sigmoid function to generate a binary mask, to suppress cross-modal noise components in the initial vector, preserving high-consistency features and outputting a robust comprehensive intent representation vector. Through context awareness, contrastive learning, and noise suppression, high-precision alignment of speech and semantic features is achieved, improving the intent understanding capability of the interactive system in complex environments.

[0091] In one embodiment, the classification indicators include dialogue sentiment indicators, dialogue behavior tendency indicators, and dialogue intent confidence indicators, as well as indicator weighting coefficients preset according to each classification indicator.

[0092] Based on the comprehensive intent representation vector, clustering is performed according to preset classification indicators to generate indicator clusters, including:

[0093] S31, based on the comprehensive intent representation vector, extracts tone prosody feature vector, semantic role feature vector and context association feature vector through multi-dimensional feature decoupling;

[0094] S32, based on the index weighting coefficient, the tone prosody feature vector, semantic role feature vector and context association feature vector are weighted and fused to obtain the index fusion tensor; among them, the tone prosody feature vector adopts the index weighting coefficient of the corresponding dialogue sentiment index, the semantic role feature vector adopts the index weighting coefficient of the corresponding dialogue behavior tendency index, and the context association feature vector adopts the index weighting coefficient of the corresponding dialogue intention confidence index.

[0095] S33, calculate the similarity matrix between the fusion tensors of each index, and generate an initial index cluster based on the similarity matrix and the similarity threshold;

[0096] S34. The density peak algorithm is used to optimize the cluster boundary of the initial index cluster to obtain the index cluster.

[0097] For example, based on the comprehensive intent representation vector, a multi-dimensional feature decoupling technique can be used to extract a prosodic feature vector, a semantic role feature vector, and a contextual association feature vector. The prosodic feature vector captures emotion-related features by analyzing the fundamental frequency profile and energy fluctuations of speech. The semantic role feature vector uses dependency parsing to identify the subject, object, and behavior relationships in the sentence, mapping behavioral tendencies. The contextual association feature vector calculates the correlation between the current dialogue and the historical context through an attention mechanism, reflecting the intent confidence. The three feature vectors are weighted and fused based on preset index weighting coefficients. The prosodic feature vector uses the weighting coefficients corresponding to the dialogue's emotion index, the semantic role feature vector uses the weighting coefficients corresponding to the dialogue's behavioral tendency index, and the contextual association feature vector uses the weighting coefficients corresponding to the dialogue's intent confidence index. Tensor addition operations are used to generate an index fusion tensor, which integrates the weighted contributions of the multi-dimensional features. The similarity matrix between the index fusion tensors is calculated, and tensors with similarity higher than a threshold are grouped into the same cluster. The initial cluster boundaries are optimized using a density peak algorithm. By calculating the local density and relative distance of each data point, the density peak is identified as the cluster center, and boundary points are reallocated to eliminate noise interference, resulting in a stable index cluster. This step enhances the discriminativeness and robustness of intent representation through feature decoupling, dynamic weighting, and cluster optimization, thereby improving the interactive system's ability to adapt to multi-dimensional user emotions, behaviors, and intents.

[0098] In one embodiment, the Transformer decoder presets a text generator, intonation pattern mapping rules, entity relationship graph, and discourse coherence correction rules;

[0099] The indicator clusters and the comprehensive intent representation vector are input into the Transformer decoder to generate the response text, including:

[0100] S41, based on the comprehensive intent representation vector, uses a text generator to generate the initial response text;

[0101] S42, based on the dialogue sentiment index of index clustering, uses intonation pattern mapping rules to generate intonation control vector of initial response text;

[0102] S43, based on the dialogue behavior tendency index of the index cluster, combined with the initial response text, uses the entity relationship graph to retrieve the associated entity triples, and generates semantic slot filling instructions based on the associated entity triples;

[0103] S44. Extract the contextual fragments of the initial response text, and based on the contextual fragments, combine the dialogue intent confidence index of the index cluster, and generate the discourse coherence correction vector according to the discourse coherence correction rules.

[0104] S45 integrates intonation control vectors, semantic slot filling instructions, and discourse coherence correction vectors to generate a set of text optimization instructions.

[0105] S46, perform iterative optimization on the initial response text according to the text optimization instruction set until the preset convergence condition is met, and obtain the response text.

[0106] Specifically, an initial response text is generated based on a comprehensive intent representation vector using a text generator (such as an autoregressive generative model based on Transformer). This generator takes the intent vector as conditional input and decodes it through a self-attention mechanism to generate a coherent text sequence, ensuring that the content of the text sequence is initially aligned with the user's intent. A tone control vector can be generated based on dialogue sentiment indicators (such as sentiment polarity values) from the indicator clusters, using tone pattern mapping rules. This rule converts sentiment values into tone feature vectors through a predefined sentiment-tone mapping table (e.g., mapping positive sentiment to a rising tone parameter), used to regulate the prosodic style of the response. Based on dialogue behavior tendency indicators (such as instruction or query type) from the indicator clusters, and combined with the initial response text, an entity relationship graph is used to retrieve associated entity triples. This graph stores domain entities and their relationships (such as "product-attribute-price"). A graph traversal algorithm is used to match behavior tendency-related triples and generate semantic slot filling instructions to guide the supplementation and structuring of information points in the response. The process extracts contextually relevant segments from the initial response text (e.g., extracting text segments overlapping with historical dialogues via a sliding window) and combines these with dialogue intent confidence metrics from clustered metrics (e.g., confidence scores). Based on discourse coherence correction rules (e.g., coherence calculation based on attention weights), a discourse coherence correction vector is generated, encoding contextual consistency adjustment information. The tone control vector, semantic slot filling instructions, and discourse coherence correction vector are integrated, and a text optimization instruction set is generated through concatenation and normalization operations. This instruction set serves as a multi-dimensional constraint for iterative optimization. The initial response text is iteratively optimized based on the text optimization instruction set, such as applying instructions through a sequence-to-sequence editing model. The response quality is evaluated in each iteration (e.g., using perplexity or coherence scores) until a preset convergence condition is met (e.g., the change in the loss function is less than a threshold), at which point the final response text is output. This process, through multi-rule collaboration and iteration, enhances the naturalness and accuracy of the response in terms of emotional expression, behavioral adaptation, and discourse coherence.

[0107] In one embodiment, S51, the formula used to calculate the similarity matrix between the fusion tensors of each index is:

[0108]

[0109] in, Let i be the similarity value between the fused tensor of the i-th and j-th indices. These are the dynamic weighting coefficients for the corresponding classification indicators. As a sentiment indicator in dialogue, As an indicator of conversational behavior tendencies, This is a confidence indicator of dialogue intent. For the k-th type of feature subspace of the tensor of the i-th index, The feature space scaling factor. For bias correction term, The LeakyReLU activation function is used. This is a tensor inner product operation.

[0110] Specifically, this formula is used to calculate the similarity matrix between the fusion tensors of various indicators. By quantifying the correlation between multi-dimensional features, it provides a measurable similarity benchmark for subsequent clustering operations, supporting the generation of indicator clusters. Furthermore, through dynamic weighting and nonlinear transformation, this formula integrates the contributions of the three feature subspaces—tone prosody, semantic role, and contextual association—into a normalized similarity value. This ensures that similarity calculation reflects the intrinsic relationships between features while maintaining numerical stability to meet the needs of clustering algorithms. The relationship between the overall formula and the variables is reflected in: As a dynamic weighting coefficient, the weights of each feature subspace are adjusted according to the importance of dialogue sentiment, behavioral tendency, and intent confidence indicators. For example, when the sentiment indicator weight... At higher levels, the similarity of tone and rhythm features contributes more to the overall similarity. As a feature space scaling factor, it controls the range of inner product calculation and avoids numerical deviations caused by excessively large or small eigenvalues. As a bias correction term, it is used to compensate for the imbalance of the feature distribution; LeakyReLU activation function. Introducing nonlinearity enhances the formula's robustness to negative inputs and prevents gradient vanishing; tensor inner product operations... We directly calculate the dot product of the i-th and j-th tensors in the k-th feature subspace to capture the angular similarity of the feature vectors. The denominator of the formula is processed by L2 normalization to constrain the similarity value to a reasonable range and avoid bias caused by scale differences during clustering.

[0111] This formula relies on an index fusion tensor, which is obtained by fusing feature vectors of tone prosody, semantic roles, and contextual associations through weighted coefficients. Therefore, the input of the formula directly inherits from the feature decoupling and dynamic weighting process. The output of the similarity matrix is used to generate initial index clusters based on a similarity threshold, providing a foundation for optimizing the density peak algorithm and forming a complete process from feature fusion to cluster optimization. This design not only enhances the accuracy and interpretability of similarity calculation but also ensures the coherence of the entire clustering process, enabling adaptive handling of multi-dimensional intent features in complex interactive scenarios.

[0112] In one embodiment, the response text, indicator clusters, and comprehensive intent representation vector are input into the speech synthesis engine to generate speech response content, including:

[0113] S61, based on the reply text, extracts basic speech synthesis feature parameters;

[0114] S62 uses Mel-Cepstral Coefficient Transform (MCT) to generate speech adjustment parameters based on index clustering clusters, including speech fundamental frequency profile and energy distribution.

[0115] S63, based on the integrated intent representation vector, perform time-frequency domain correction on the speech adjustment parameters to generate corrected speech adjustment parameters;

[0116] S64 inputs the basic speech synthesis feature parameters and the corrected speech adjustment parameters into the speech synthesis engine, and generates the original waveform of the speech response through the waveform generation algorithm to obtain the speech response content.

[0117] For example, basic speech synthesis feature parameters, such as phoneme sequences, phoneme durations, and spectral envelopes, are extracted from the response text and generated through text front-end processing modules (such as word segmenters and phoneme converters) and acoustic models (such as LSTM-based prediction networks), providing a basic acoustic framework for speech synthesis. Based on dialogue sentiment indicators, dialogue behavior tendency indicators, and dialogue intent confidence indicators from indicator clusters, Mel-Cepstral Coefficient Transform is used to generate speech adjustment parameters. The sentiment indicator controls the fundamental frequency profile of the speech through a predefined sentiment-prosodic mapping table (e.g., mapping positive sentiment to a higher fundamental frequency variation range), the behavior tendency indicator adjusts the energy distribution pattern through statistical models (such as Gaussian mixture models) (e.g., stronger energy concentration corresponds to command-type behaviors), and the intent confidence indicator adjusts the speech rate and pause duration through linear interpolation algorithms, forming speech adjustment parameters with emotional expression and behavioral adaptation characteristics. The speech adjustment parameters are corrected in the time and frequency domains based on the comprehensive intent representation vector. Frequency domain filtering techniques (such as least mean square adaptive filters) are used to correct local fluctuations in the fundamental frequency profile, and time-domain interpolation algorithms are employed to smooth the energy distribution, generating corrected speech adjustment parameters to eliminate cross-modal bias. The basic speech synthesis feature parameters and the corrected speech adjustment parameters are input into the speech synthesis engine. Waveform generation algorithms (such as neural vocoders or source-filter models) are used for parameter fusion and waveform reconstruction. The basic parameters provide the acoustic skeleton, while the corrected parameters inject personalized features, generating multi-dimensional speech response content that conforms to the user's intent. This step, through parameter cascading and correction mechanisms, ensures that the speech output maintains naturalness while achieving consistent expression of emotion, behavior, and intent.

[0118] The aforementioned speech interaction method based on a large language model acquires user speech signals and extracts time-frequency domain features using a convolutional neural network with convolutional layers in the time, frequency, and channel domains. It generates a Mel spectrogram based on a Mel filter and generates speech representation vectors through residual connections and multi-scale downsampling. This effectively suppresses environmental noise and prosodic variation interference, improving the robustness and stability of feature representations. Acoustic text decoding is performed synchronously based on the user speech signal to obtain a text sequence, which is then input into a pre-trained language model for semantic encoding to generate semantic embedding vectors. This achieves accurate semantic mapping of speech content, reducing error accumulation and semantic distortion in traditional serial processing. A context-aware window is constructed to partition the speech representation vector in the time and frequency domains. An alignment weight matrix is generated by combining cross-modal contrastive learning, and speech-semantic alignment is performed using KL divergence check and binarization gating mechanisms to obtain a comprehensive intent representation vector. This enhances feature alignment accuracy, suppresses cross-modal noise, and improves the accuracy of intent understanding. Based on a comprehensive intent representation vector, this system extracts prosodic, semantic role, and contextual association feature vectors through multi-dimensional feature decoupling. These vectors are then dynamically weighted and fused according to dialogue sentiment, behavioral tendencies, and intent confidence indices. Density peak clustering is used to generate index clusters, deeply mining emotional and behavioral features in user interactions and improving the system's adaptability and discriminative ability in complex scenarios. The index clusters and comprehensive intent representation vector are input into a Transformer decoder, integrating a text generator, intonation pattern mapping rules, entity relationship graphs, and discourse coherence correction rules to generate initial response text. Iterative optimization of the generated response text enhances the naturalness and accuracy of the response in terms of emotional expression, behavioral adaptation, and contextual coherence. Finally, the response text, index clusters, and comprehensive intent representation vector are input into a speech synthesis engine. Mel-spectral coefficient transform is used to generate speech adjustment parameters, which are then corrected in the time-frequency domain to generate personalized speech response content, achieving a low-latency, highly natural interactive experience. Overall, this method effectively addresses the problems of low alignment accuracy, insufficient feature mining, and poor interaction naturalness in traditional technologies through multi-feature decoupling, dynamic weighting, and clustering optimization throughout the entire process, thereby comprehensively improving the accuracy of voice interaction and user experience.

[0119] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.

[0120] Based on the same inventive concept, this application also provides a large language model-based voice interaction system for implementing the aforementioned large language model-based voice interaction method. The solution provided by this system is similar to the implementation scheme described in the above method. Therefore, the specific limitations of one or more embodiments of the large language model-based voice interaction system provided below can be found in the limitations of the large language model-based voice interaction method described above, and will not be repeated here.

[0121] In one exemplary embodiment, such as Figure 2 As shown, a voice interaction system based on a large language model is provided, including:

[0122] The speech feature extraction module 101 is used to collect user speech signals and input the user speech signals into a convolutional neural network to perform time-frequency domain feature extraction to obtain a speech representation vector.

[0123] The semantic embedding encoding module 102 is used to perform acoustic text decoding based on the user's speech signal to obtain a text sequence, and input the text sequence into a pre-trained language model for semantic encoding to obtain a semantic embedding vector.

[0124] The speech-semantic alignment module 103 is used to perform speech-semantic alignment based on the speech representation vector and the semantic embedding vector to obtain a comprehensive intent representation vector;

[0125] The indicator clustering generation module 104 is used to cluster indicators based on the comprehensive intent representation vector and according to preset classification indicators to generate indicator clusters.

[0126] The response text generation module 105 is used to input the indicator clusters and the comprehensive intent representation vector into the Transformer decoder to generate response text;

[0127] The speech response synthesis module 106 is used to input the response text, indicator clusters, and comprehensive intent representation vector into the speech synthesis engine to generate speech response content.

[0128] In one embodiment, the convolutional neural network in the speech feature extraction module 101 includes a temporal convolutional layer, a frequency domain convolutional layer, and a channel domain convolutional layer.

[0129] The speech feature extraction module 101 is also used for:

[0130] Mel spectrograms are extracted using a Mel filter based on the user's speech signal;

[0131] The Mel spectrogram is input into a temporal convolutional layer to extract temporal features, resulting in a temporal feature tensor.

[0132] The time-domain feature residuals are calculated based on the time-domain feature tensor and Mel spectrogram.

[0133] The time-domain feature tensor is input into the frequency-domain convolutional layer, and frequency-domain dimension convolution operation is performed to obtain the initial time-frequency feature tensor. Based on the initial time-frequency feature tensor, the time-frequency fusion feature tensor is generated by combining the time-domain feature residual.

[0134] The time-frequency fusion feature tensor is input into the channel domain convolutional layer to perform channel dimension feature mapping, thereby obtaining the mapped feature tensor. The mapped feature tensor is then subjected to multi-scale downsampling to generate a speech representation vector.

[0135] In one embodiment, the speech semantic alignment module 103 is further configured to:

[0136] Based on semantic embedding vectors, a context-aware window with a preset number of scales and window size is constructed;

[0137] Based on the context-aware window, the speech representation vector is divided into time-frequency domains to obtain a set of local feature blocks;

[0138] Based on local feature block sets and semantic embedding vectors, cross-modal contrastive learning is performed to generate an alignment weight matrix;

[0139] Based on the set of local feature blocks and semantic embedding vectors, time-frequency domain alignment is performed according to the alignment weight matrix to generate a preliminary intent representation vector.

[0140] The KL divergence calculation method is used to perform semantic consistency verification between the preliminary intent representation vector and the semantic embedding vector, and the consistency verification result is obtained.

[0141] Based on the consistency verification results, a binary gating mechanism is used to suppress cross-modal noise and generate a comprehensive intent representation vector.

[0142] In one embodiment, the classification indicators in the indicator clustering generation module 104 include dialogue sentiment indicators, dialogue behavior tendency indicators, and dialogue intent confidence indicators, as well as indicator weighting coefficients preset according to each classification indicator.

[0143] The index clustering generation module 104 is also used for:

[0144] Based on the comprehensive intent representation vector, the tone prosody feature vector, semantic role feature vector and context association feature vector are extracted by multi-dimensional feature decoupling;

[0145] Based on the index weighting coefficients, the prosodic feature vector, semantic role feature vector, and context association feature vector are weighted and fused to obtain the index fusion tensor. Among them, the prosodic feature vector adopts the index weighting coefficient of the corresponding dialogue sentiment index, the semantic role feature vector adopts the index weighting coefficient of the corresponding dialogue behavior tendency index, and the context association feature vector adopts the index weighting coefficient of the corresponding dialogue intent confidence index.

[0146] Calculate the similarity matrix between the fusion tensors of each index, and generate an initial index cluster based on the similarity matrix and the similarity threshold.

[0147] The density peak algorithm is used to optimize the cluster boundaries of the initial index clusters to obtain the index clusters.

[0148] In one embodiment, the Transformer decoder in the response text generation module 105 presets a text generator, intonation pattern mapping rules, entity relationship graph, and discourse coherence correction rules.

[0149] The reply text generation module 105 is also used for:

[0150] Based on the comprehensive intent representation vector, a text generator is used to generate the initial response text;

[0151] Based on the dialogue sentiment index of index clustering, the intonation control vector of the initial response text is generated by the intonation pattern mapping rule.

[0152] Based on the dialogue behavior tendency index of index clustering, combined with the initial response text, the entity relationship graph is used to retrieve the associated entity triples, and semantic slot filling instructions are generated according to the associated entity triples.

[0153] Extract contextual fragments from the initial response text, and based on these contextual fragments, combine them with the dialogue intent confidence index of the index cluster, and generate a discourse coherence correction vector according to the discourse coherence correction rules.

[0154] Integrate intonation control vectors, semantic slot filling instructions, and discourse coherence correction vectors to generate a text optimization instruction set;

[0155] The initial response text is iteratively optimized according to the text optimization instruction set until the preset convergence condition is met, and the response text is obtained.

[0156] In one embodiment, the formula used by the index clustering generation module 104 to calculate the similarity matrix between the fused tensors of each index is:

[0157]

[0158] in, Let i be the similarity value between the fused tensor of the i-th and j-th indices. These are the dynamic weighting coefficients for the corresponding classification indicators. As a sentiment indicator in dialogue, As an indicator of conversational behavior tendencies, This is a confidence indicator of dialogue intent. For the k-th type of feature subspace of the tensor of the i-th index, The feature space scaling factor. For bias correction term, The LeakyReLU activation function is used. This is a tensor inner product operation.

[0159] In one embodiment, the voice response synthesis module 106 is further configured to:

[0160] Based on the response text, extract basic speech synthesis feature parameters;

[0161] Based on the index clustering clusters, the dialogue sentiment index, dialogue behavior tendency index, and dialogue intent confidence index are used to generate speech adjustment parameters using Mel-Cepstral Coefficient Transform; the speech adjustment parameters include the speech fundamental frequency profile and energy distribution.

[0162] Based on the comprehensive intent representation vector, the speech adjustment parameters are corrected in the time-frequency domain to generate corrected speech adjustment parameters;

[0163] The basic speech synthesis feature parameters and the corrected speech adjustment parameters are input into the speech synthesis engine, and the original waveform of the speech response is generated by the waveform generation algorithm to obtain the speech response content.

[0164] In one embodiment, a computer device is provided, including a memory and a processor, the memory storing a computer program, the processor executing the computer program to implement the steps of the previously described large language model-based voice interaction method.

[0165] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the steps in the above method embodiments.

[0166] For the device embodiments, since they basically correspond to the method embodiments, the relevant parts can be referred to in the description of the method embodiments. The device embodiments described above are merely illustrative. The components described as separate parts may or may not be physically separate, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this disclosure according to actual needs. Those skilled in the art can understand and implement this without creative effort.

[0167] The above-described embodiments are merely illustrative of several implementation methods of the embodiments of this application, and their descriptions are relatively specific and detailed. However, they should not be construed as limiting the scope of the patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the embodiments of this application, and these modifications and improvements all fall within the protection scope of the embodiments of this application.

Claims

1. A voice interaction method based on a large language model, characterized in that, The method includes: The user's speech signal is collected and input into a convolutional neural network to extract time-frequency domain features and obtain a speech representation vector. Based on the user's voice signal, acoustic text decoding is performed to obtain a text sequence, and the text sequence is input into a pre-trained language model for semantic encoding to obtain a semantic embedding vector; Based on the speech representation vector and the semantic embedding vector, speech-semantic alignment is performed to obtain a comprehensive intent representation vector; Based on the comprehensive intent representation vector, clustering is performed according to preset classification indicators to generate indicator clusters; The index clusters and the comprehensive intent representation vector are input into the Transformer decoder to generate the response text; The reply text, the indicator cluster, and the comprehensive intent representation vector are input into the speech synthesis engine to generate the speech reply content.

2. The method according to claim 1, characterized in that, The convolutional neural network includes temporal convolutional layers, frequency convolutional layers, and channel convolutional layers; The step of inputting the user's speech signal into a convolutional neural network for time-frequency domain feature extraction to obtain a speech representation vector includes: Based on the user's voice signal, a Mel-filter is used to extract the Mel spectrogram; The Mel spectrogram is input into the temporal convolutional layer to extract temporal features and obtain a temporal feature tensor. The time-domain feature residuals are calculated based on the time-domain feature tensor and the Mel spectrogram. The time-domain feature tensor is input into the frequency-domain convolutional layer to perform frequency-domain dimension convolution operation to obtain an initial time-frequency feature tensor. Based on the initial time-frequency feature tensor, a time-frequency fusion feature tensor is generated by combining the time-domain feature residual. The time-frequency fusion feature tensor is input into the channel domain convolutional layer to perform channel dimension feature mapping, thereby obtaining a mapped feature tensor. The mapped feature tensor is then subjected to multi-scale downsampling to generate the speech representation vector.

3. The method according to claim 2, characterized in that, The step of performing speech-semantic alignment based on the speech representation vector and the semantic embedding vector to obtain a comprehensive intent representation vector includes: Based on the semantic embedding vector, a context-aware window with a preset number of scales and a window size is constructed; Based on the context-aware window, the speech representation vector is divided into time-frequency domains to obtain a set of local feature blocks; Based on the set of local feature blocks and the semantic embedding vector, cross-modal contrastive learning is performed to generate an alignment weight matrix; Based on the set of local feature blocks and the semantic embedding vector, time-frequency domain alignment is performed according to the alignment weight matrix to generate a preliminary intent representation vector. The KL divergence calculation method is used to perform semantic consistency verification between the preliminary intent representation vector and the semantic embedding vector to obtain the consistency verification result. Based on the consistency verification result, a binary gating mechanism is used to suppress cross-modal noise and generate the comprehensive intent representation vector.

4. The method according to claim 1, characterized in that, The classification indicators include dialogue sentiment indicators, dialogue behavior tendency indicators, and dialogue intention confidence indicators, as well as preset weighting coefficients for each of the classification indicators. The step of clustering based on the comprehensive intent representation vector according to preset classification indicators to generate indicator clusters includes: Based on the comprehensive intent representation vector, the tone prosody feature vector, semantic role feature vector, and context association feature vector are extracted through multi-dimensional feature decoupling; Based on the aforementioned index weighting coefficients, the tone prosody feature vector, semantic role feature vector, and context association feature vector are weighted and fused to obtain an index fusion tensor; wherein, the tone prosody feature vector adopts the index weighting coefficient corresponding to the dialogue sentiment index, the semantic role feature vector adopts the index weighting coefficient corresponding to the dialogue behavior tendency index, and the context association feature vector adopts the index weighting coefficient corresponding to the dialogue intent confidence index. Calculate the similarity matrix between the fusion tensors of each index, and generate an initial index cluster based on the similarity matrix and a similarity threshold. The cluster boundaries of the initial index clusters are optimized using the density peak algorithm to obtain the index clusters.

5. The method according to claim 4, characterized in that, The Transformer decoder includes a preset text generator, intonation pattern mapping rules, entity relationship graph, and discourse coherence correction rules. The index clusters and the comprehensive intent representation vector are input into the Transformer decoder to generate response text, including: Based on the comprehensive intent representation vector, the text generator is used to generate the initial response text; Based on the dialogue sentiment index of the index cluster, the intonation control vector of the initial response text is generated using the intonation pattern mapping rule. Based on the dialogue behavior tendency index of the clustered index, combined with the initial response text, the entity relationship graph is used to retrieve associated entity triples, and semantic slot filling instructions are generated according to the associated entity triples. Extract the contextual fragments of the initial response text, and based on the contextual fragments, combined with the dialogue intent confidence index of the index cluster, generate a text coherence correction vector according to the text coherence correction rules. Integrate the intonation control vector, semantic slot filling instructions, and discourse coherence correction vector to generate a text optimization instruction set; The initial response text is iteratively optimized according to the text optimization instruction set until the preset convergence condition is met, thus obtaining the response text.

6. The method according to claim 4, characterized in that, The formula used to calculate the similarity matrix between the fusion tensors of each index is as follows: in, Let i be the similarity value between the fused tensor of the i-th and j-th indices. These are the dynamic weighting coefficients for the corresponding classification indicators. As a sentiment indicator in dialogue, As an indicator of conversational behavior tendencies, This is a confidence indicator of dialogue intent. For the k-th type of feature subspace of the tensor of the i-th index, The feature space scaling factor. For bias correction term, The LeakyReLU activation function is used. This is a tensor inner product operation.

7. The method according to claim 5, characterized in that, The step of inputting the reply text, the indicator cluster, and the comprehensive intent representation vector into the speech synthesis engine to generate the speech reply content includes: Based on the response text, extract basic speech synthesis feature parameters; Based on the dialogue sentiment index, dialogue behavior tendency index, and dialogue intent confidence index of the aforementioned index clusters, Mel-Cepstral Coefficient Transform is used to generate speech adjustment parameters; the speech adjustment parameters include the speech fundamental frequency profile and energy distribution. Based on the comprehensive intent representation vector, the speech adjustment parameters are corrected in the time-frequency domain to generate corrected speech adjustment parameters; The basic speech synthesis feature parameters and the corrected speech adjustment parameters are input into the speech synthesis engine, and the original waveform of the speech response is generated by the waveform generation algorithm to obtain the speech response content.

8. A voice interaction system based on a large language model, characterized in that, The system includes: The speech feature extraction module is used to collect user speech signals and input the user speech signals into a convolutional neural network to perform time-frequency domain feature extraction to obtain a speech representation vector. The semantic embedding encoding module is used to perform acoustic text decoding based on the user's speech signal to obtain a text sequence, and input the text sequence into a pre-trained language model for semantic encoding to obtain a semantic embedding vector; The speech-semantic alignment module is used to perform speech-semantic alignment based on the speech representation vector and the semantic embedding vector to obtain a comprehensive intent representation vector; The indicator clustering generation module is used to perform clustering based on the comprehensive intent representation vector and preset classification indicators to generate indicator clusters. The response text generation module is used to input the indicator clusters and the comprehensive intent representation vector into the Transformer decoder to generate response text; The speech response synthesis module is used to input the response text, the indicator cluster, and the comprehensive intent representation vector into the speech synthesis engine to generate speech response content.

9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 7.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 7.