Intelligent voice human-machine interaction system adapted to multiple dialects
By using an improved adversarial autoencoder network and attention mechanism, text content and dialect accent features are separated, solving the problem of mismatch between multi-dialect speech recognition and synthesis in existing technologies, and achieving stable and fitting human-computer interaction effects.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANGHAI HAIHE INFORMATION TECHNOLOGY CO LTD
- Filing Date
- 2026-04-17
- Publication Date
- 2026-06-12
AI Technical Summary
Existing intelligent voice human-computer interaction systems suffer from instability in text recognition and semantic understanding when processing multi-dialect speech due to the coupling of acoustic features with text content-related features. This results in a mismatch between the speech synthesis output and the user's accent, leading to insufficient adaptation.
An improved adversarial autoencoder network is used to decouple deep features, separate text content-invariant features from dialect accent features, combine attention mechanism speech recognition and dialect accent classification to form intermediate semantic results with accent annotation, and generate matching synthetic speech through a multi-dialect speech synthesis model.
It improves the stability of text recognition and the accuracy of semantic understanding, ensures the matching of synthesized speech with user accents, and optimizes the adaptation of voice human-computer interaction in multi-dialect scenarios.
Smart Images

Figure CN122201298A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of intelligent voice interaction technology, and in particular to an intelligent voice human-computer interaction system that adapts to multiple dialect accents. Background Technology
[0002] When processing speech with multiple dialect accents, existing intelligent voice human-computer interaction systems directly collect speech samples containing dialect accents to extract acoustic features and directly use the overall acoustic features for speech recognition and accent classification. Semantic understanding relies solely on the recognized text sequence to complete intent and slot parsing. Speech synthesis uses a general model to output speech with a fixed accent. Text content-related features and dialect accent-related features are always kept in a coupled state, without independently separating and processing the two types of features.
[0003] Coupled acoustic features can cause dialect accents to directly interfere with the text content recognition process, easily leading to deviations in the generation of text phoneme sequences. Dialect accent classification results are greatly affected by changes in text content, making it difficult to guarantee the stability of the classification results. Semantic understanding models only process pure text information and cannot adapt to the expression logic of dialect contexts by incorporating dialect accent attributes. The accent output by speech synthesis cannot match the dialect accent input by the user, resulting in significant deficiencies in the adaptation effect of voice human-computer interaction in multi-dialect scenarios.
[0004] A dedicated network is needed to perform deep feature decoupling on the standard acoustic feature vectors, separating independent text content-invariant features and dialect accent features to reduce the interference of accent features on text recognition. At the same time, the text phoneme sequence and dialect accent category labels are fused to form semantic results with accent annotation. Then, intent and slot parsing are performed, and the corresponding speech synthesizer is selected according to the dialect accent category label to generate synthesized speech that matches the user's dialect accent. Summary of the Invention
[0005] The purpose of this invention is to overcome the shortcomings of existing technologies and propose an intelligent voice human-computer interaction system that is adapted to multiple dialects and accents.
[0006] To achieve the above objectives, the present invention adopts the following technical solution: an intelligent voice human-computer interaction system adapted to multiple dialect accents, comprising: The feature decoupling module collects raw speech samples containing multiple dialect accents to obtain standard acoustic feature vectors. It then performs deep feature decoupling on the standard acoustic feature vectors through an improved adversarial autoencoder network to separate text content invariant features and dialect accent features. The recognition and classification module inputs the text content invariant features into the attention-based speech recognition encoder to generate a text phoneme sequence. At the same time, it inputs the dialect accent features into the dialect accent classifier to obtain dialect accent category labels. The semantic understanding module fuses the text phoneme sequence with the dialect accent category label to form an intermediate semantic result with accent annotation, and inputs the intermediate semantic result with accent annotation into the colloquial natural language understanding model for intent and slot parsing; The interactive response module, based on the intent and slot parsing results, calls the corresponding service interface to obtain feedback information, and selects the corresponding target dialect speech synthesizer from the multi-dialect speech synthesis model based on the dialect accent category label. The feedback information is input into the target dialect speech synthesizer to generate target synthesized speech that conforms to the target dialect accent, and the target synthesized speech is output.
[0007] As a further aspect of the present invention, the working principle of the improved adversarial autoencoder network is as follows: The improved adversarial autoencoder network includes a shared encoder, a text content encoder, a dialect accent encoder, a shared decoder, a text content discriminator, and a dialect accent discriminator. The shared encoder performs preliminary encoding on the input standard acoustic feature vector to generate a shared hidden layer vector; The text content encoder processes the shared hidden layer vector to extract the text content invariant features, and the dialect accent encoder processes the shared hidden layer vector to extract the dialect accent features. The shared decoder receives a reconstructed feature vector composed of the text content-invariant features and a randomly sampled reference dialect accent feature, and attempts to reconstruct an output feature that is similar to the input standard acoustic feature vector. The text content discriminator is used to determine whether the input text content invariant feature comes from a real speech sample or is generated by the network, and the dialect accent discriminator is used to determine whether the input dialect accent feature comes from a real speech sample or is generated by the network. During training, by minimizing the reconstruction loss of the shared decoder and maximizing the discrimination error rate of the text content discriminator and the dialect accent discriminator, the improved adversarial autoencoder network effectively decouples the mutually separate and complete text content invariant features and dialect accent features.
[0008] As a further aspect of the present invention, the acquisition of original speech samples containing multiple dialect accents to obtain a standard acoustic feature vector, and the deep feature decoupling of the standard acoustic feature vector through an improved adversarial autoencoder network to separate text content-invariant features and dialect accent features, including: The original speech samples are pre-emphasized and framed with windowing to extract the pre-processed speech frame sequence. The Mel frequency cepstral coefficient features of the pre-processed speech frame sequence are calculated to obtain a standard acoustic feature vector, and the standard acoustic feature vector is input into the improved adversarial autoencoder network for deep feature decoupling. The improved adversarial autoencoder network separates text content-invariant features related to text content and dialect accent features related to speaker dialect accent from the standard acoustic feature vector. As a further aspect of the present invention, Mel-frequency cepstral coefficient features are calculated on the pre-processed speech frame sequence to obtain a standard acoustic feature vector, and the standard acoustic feature vector is input into an improved adversarial autoencoder network for deep feature decoupling, specifically including: Perform a Fast Fourier Transform on each of the pre-processed speech frames to obtain the amplitude spectrum of each speech frame. The amplitude spectrum is passed through a set of Mel-scale triangular filters, and the logarithmic energy of the output of each filter is calculated. Perform a discrete cosine transform on the logarithmic energy, retain a specific number of coefficients, and construct the Mel frequency cepstral coefficient features of the speech frame; The Mel frequency cepstral coefficient features of all speech frames are concatenated in chronological order to form the standard acoustic feature vector; After the standard acoustic feature vector is normalized, it is input into the shared encoder of the improved adversarial autoencoder network, and the shared encoder outputs the shared hidden layer vector.
[0009] As a further aspect of the present invention, the text phoneme sequence is fused with the dialect accent category label to form an intermediate semantic result with accent annotation, and the intermediate semantic result with accent annotation is input into a spoken language natural language understanding model for intent and slot parsing, including: The text phoneme sequence is converted into the original text sequence using a phoneme-to-text conversion model; The dialect accent category labels are converted into corresponding dialect identifiers, and the dialect identifiers are inserted as prefixes at the beginning of the original text sequence to generate the intermediate semantic results with accent annotations. The spoken language natural language understanding model has a parameter adaptation layer for different dialect identifiers. When the intermediate semantic result with accent annotation is received, the corresponding parameter adaptation layer is activated according to the dialect identifier. The original text sequence is encoded by the activated parameter adaptation layer to extract a deep semantic vector containing dialect grammar and expression habits; The deep semantic vector is input into the intent classification head and the slot filling head to obtain the user's interaction intent category and the key information fragments related to the interaction intent in the original text sequence, and the key information fragments are the slot values.
[0010] As a further aspect of the present invention, based on the intent and slot parsing results, the corresponding service interface is invoked to obtain feedback information, and based on the dialect accent category label, the corresponding target dialect speech synthesizer is selected from the multi-dialect speech synthesis model, including: Based on the interaction intent category, determine the type of backend service to be accessed, and construct a query request that conforms to the interface specification of the backend service type based on the slot value. The query request is sent to the service interface, and the structured data returned by the service interface is received as the original feedback information. The original feedback information is filled into a pre-set text template to generate grammatically correct written feedback text. Based on the dialect accent category label, search for dialect speech synthesis model instances associated with the dialect accent category label in the preset model registry, and load the found dialect speech synthesis model instances as the target dialect speech synthesizer; The multi-dialect speech synthesis model contains multiple independently trained dialect speech synthesizer instances, each specifically trained for a particular dialect accent.
[0011] As a further aspect of the present invention, the feedback information is input into the target dialect speech synthesizer to generate target synthesized speech that conforms to the target dialect accent, including: The target dialect speech synthesizer receives the written feedback text and performs text regularization and word segmentation on the written feedback text; The processed text sequence is input into the dialect phoneme converter inside the target dialect speech synthesizer to convert the standard phoneme sequence into a dialect phoneme sequence that conforms to the pronunciation habits of the target dialect. The dialect phoneme sequence is input into the duration predictor to predict the duration of each dialect phoneme and prosodic boundary. Based on the predicted duration, the dialect phoneme sequence is extended and aligned to generate an aligned phoneme sequence containing temporal information; The aligned phoneme sequence is input into the acoustic feature predictor to predict an acoustic feature sequence containing Mel spectral features; The acoustic feature sequence is input into the corresponding dialect vocoder to synthesize the final target synthesized speech.
[0012] As a further aspect of the present invention, the step of inputting the processed text sequence into the dialect phoneme converter inside the target dialect speech synthesizer to convert the standard phoneme sequence into a dialect phoneme sequence that conforms to the pronunciation habits of the target dialect includes: The dialect phoneme converter stores a dialect pronunciation rule base and a dialect-specific vocabulary mapping table. For the processed input text sequence, first replace the standard words in the text with the corresponding dialect words according to the dialect-specific vocabulary mapping table; For the text sequence that has completed word replacement, the initials, finals and tones of standard Mandarin are converted into corresponding pronunciation units of the target dialect by the rule-based initial and final conversion module according to the dialect pronunciation rule library; For dialects with literary and colloquial pronunciations or polyphony, a disambiguation model based on a recurrent neural network is used to determine the correct pronunciation in the current context by combining contextual information. All the converted pronunciation units are arranged in order and output as the dialect phoneme sequence. The symbol set of the dialect phoneme sequence is independent of the standard phoneme symbol set.
[0013] As a further aspect of the present invention, the system further includes: The online adaptive module, during the interaction process, marks the input speech as unknown dialect speech to be analyzed when the confidence level of the dialect accent classifier in recognizing the dialect accent of the input speech is lower than a preset threshold. Extract the dialect accent features of the unknown dialect speech to be analyzed and store them in an online buffer pool; When the number of unknown dialect accent speech samples accumulated in the online buffer pool reaches a set scale, incremental clustering analysis is initiated. Unsupervised clustering is performed on all dialect accent features in the online buffer pool to form several feature clusters; For each feature cluster, its central feature vector is calculated, and the similarity of the central feature vector with the known dialect accent feature vector is compared. If there is a known dialect accent with a similarity exceeding the matching threshold, the feature cluster is classified into a subclass of the known dialect accent, and the corresponding category parameter of the dialect accent classifier is updated. If there are no known dialect accents with similarity exceeding the matching threshold, the feature cluster is registered as a new dialect accent category, and a dedicated speech synthesizer is initialized for each new category and added to the multi-dialect speech synthesis model.
[0014] As a further aspect of the present invention, the step of initializing a dedicated speech synthesizer for each new category and adding it to the multi-dialect speech synthesis model includes: Original speech samples corresponding to dialect accent features belonging to the new category are selected from the online buffer pool; Using the improved adversarial autoencoder network, a new category of pure dialect accent feature vectors is extracted from the selected original speech samples; Using a standard Mandarin speech synthesizer as the base model, and with a small amount of parallel speech text corpus of new dialects, the acoustic feature predictor and vocoder of the base model are fine-tuned and trained. During the fine-tuning training process, the pure dialect accent feature vector is used as a conditional input to guide the model to learn the pronunciation characteristics of the new dialect; After training is complete, the fine-tuned model parameters are saved to form a dedicated speech synthesizer instance for the new dialect accent category; Add a new record to the model registry, associating the new dialect accent category label with the dedicated speech synthesizer instance.
[0015] Compared with the prior art, the advantages and positive effects of the present invention are as follows: The improved adversarial autoencoder network performs deep feature decoupling on standard acoustic feature vectors, independently separating text content-related information and dialect accent-related information contained in the acoustic features to form independent text content-invariant features and dialect accent features. When the text content-invariant features are input into the attention-based speech recognition encoder, accent-related features do not interfere with the generation process of the text phoneme sequence, maintaining a stable state and avoiding sequence generation bias caused by accent differences. When the dialect accent features are input separately into the dialect accent classifier, changes in text content do not affect the accent classification process. The dialect accent category label determination results closely match the accent attributes of the original speech sample. The accent classification process is unaffected by text content fluctuations, optimizing the stability of the classification results, and ensuring that the accent attribute determination accurately reflects the actual feature representation of the speech sample.
[0016] After fusing text phoneme sequences with dialect accent category labels, an intermediate semantic result with accent annotations is formed. This semantic result is then input into a spoken language natural language understanding model, which can perform intent and slot parsing based on the accent annotation information. The parsing process is able to fit the expressive features of the dialect context, thus improving the fit of the semantic parsing. Dialect accent category labels can be directly used as the selection basis for multi-dialect speech synthesis models, accurately locating the corresponding target dialect speech synthesizer. After the feedback information is processed by the target dialect speech synthesizer, the accent attributes of the synthesized speech are consistent with the dialect accent of the user's input speech. The accent type of the synthesized speech matches the input speech, and the accent connection between input and output during voice interaction is more in line with actual interaction needs. The adaptation of voice human-computer interaction can meet the usage needs of multi-dialect scenarios, and the voice feature matching degree during the interaction process is continuously optimized. Attached Figure Description
[0017] Figure 1 This is a timing diagram of the intelligent voice human-computer interaction system adapted to multiple dialects as described in this invention. Figure 2 A flowchart for decoupling deep features to separate text content-invariant features and dialect accent features; Figure 3 A flowchart for generating intermediate semantic results with accent annotations and performing intent and slot resolution. Detailed Implementation
[0018] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.
[0019] In the description of this invention, it should be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," and "outer," etc., indicating orientation or positional relationships, are based on the orientation or positional relationships shown in the accompanying drawings and are only for the convenience of describing the invention and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation, and therefore should not be construed as a limitation of the invention. Furthermore, in the description of this invention, "a plurality of" means two or more, unless otherwise explicitly specified.
[0020] See Figure 1 This invention provides an intelligent voice human-computer interaction system adapted to multiple dialect accents, specifically including: The feature decoupling module collects raw speech samples containing various dialect accents and processes these samples to obtain standard acoustic feature vectors. An improved adversarial autoencoder network receives these standard acoustic feature vectors and performs deep feature decoupling. Its core objective is to separate text content-invariant features that are directly related to semantic content and are unaffected by the speaker, as well as dialect accent features that represent the speaker's unique pronunciation habits. The recognition and classification module processes these two types of features in parallel. Text content-invariant features are input into an attention-based speech recognition encoder, which is responsible for converting them into accurate text phoneme sequences. Simultaneously, dialect accent features are fed into a pre-trained dialect accent classifier, which outputs a specific dialect accent category label. The semantic understanding module then takes over the processing flow. It fuses the text phoneme sequence generated in the previous steps with the dialect accent category label to form an intermediate semantic result with accent annotation. This intermediate semantic result is further input into a spoken language natural language understanding model, which parses the user's interaction intent and related semantic slot information. The interactive response module, based on the intent and slot parsing results output by the semantic understanding module, calls the corresponding service interfaces inside or outside the system to obtain the required feedback information. At the same time, based on the previously obtained dialect accent category tags, the module selects a target dialect speech synthesizer that matches the current user's accent from a resource library containing multiple dialect synthesis capabilities—a multi-dialect speech synthesis model. The obtained feedback information is formatted and input into this target dialect speech synthesizer. The synthesizer generates target synthesized speech that conforms to the user's habits in terms of both content and accent. The system finally outputs the target synthesized speech, completing one round of human-computer interaction.
[0021] In one embodiment of the present invention, the improved adversarial autoencoder network employs a composite structure to achieve deep feature decoupling. The improved adversarial autoencoder network includes a shared encoder, a text content encoder, a dialect accent encoder, a shared decoder, a text content discriminator, and a dialect accent discriminator. The shared encoder performs a nonlinear transformation on the input standard acoustic feature vector to generate a shared hidden layer vector, which is a low-dimensional representation containing mixed information of the original speech. The text content encoder further processes the shared hidden layer vector, extracting text content-invariant features through a series of convolutional and fully connected layers. These text content-invariant features are vectors independent of the specific speaker. The dialect accent encoder processes the shared hidden layer vector in parallel, extracting dialect accent features through similarly structured but independent network layers. These dialect accent features are vectors representing the speaker's dialect background. The shared decoder receives a concatenated feature vector as input. This vector is formed by concatenating the text-content-invariant features output by the text content encoder with a reference dialect accent feature. The reference dialect accent feature can be randomly sampled from a prior distribution or obtained from other speech samples. The shared decoder attempts to reconstruct an output feature vector that closely approximates the original standard acoustic feature vector from the concatenated feature vector using deconvolutional and fully connected layers. The text content discriminator determines whether the input text-content-invariant features originate from real speech samples, outputting a probability value. Similarly, the dialect accent discriminator determines whether the input dialect accent features originate from real speech samples, also outputting a probability value.
[0022] In some embodiments, the standard acoustic feature vector can be a sequence of Mel-frequency cepstral coefficients with a dimension of 80, the shared hidden layer vector has a dimension of 256, the text content-invariant features have a dimension of 128, the dialect accent features have a dimension of 64, and the reconstructed features output by the shared decoder have the same dimension of 80 as the input standard acoustic feature vector. During training, the improved adversarial autoencoder network updates its parameters by optimizing a multipart loss function, which includes a reconstruction loss and an adversarial loss. The reconstruction loss measures the difference between the shared decoder output features and the original standard acoustic feature vector, while the adversarial loss includes a text content discriminator loss and a dialect accent discriminator loss. A specific loss function formula can be expressed as: in: This represents the total loss of the improved adversarial autoencoder network. This represents the reconstruction loss of the shared decoder. This represents the adversarial loss of the text content discriminator. This indicates the adversarial loss of the dialect accent detector. and These are the weighting coefficients that balance the two adversarial losses and the reconstruction loss. Reconstruction loss Mean squared error (MSE) can be used for calculation, which is the mean squared error between the reconstructed features output by the shared decoder and the original standard acoustic feature vector. Adversarial loss of the text content discriminator. This is a binary cross-entropy loss, used to train a text content discriminator to distinguish between real text content-invariant features and those generated by the text content encoder. Simultaneously, it trains the text content encoder to generate text content-invariant features that are difficult for the text content discriminator to distinguish. (This is followed by an adversarial loss for a dialect accent discriminator.) Similarly, the binary cross-entropy loss is used to train the dialect accent discriminator to distinguish between real dialect accent features and dialect accent features generated by the dialect accent encoder, while training the dialect accent encoder to generate dialect accent features that are difficult for the dialect accent discriminator to distinguish.
[0023] In specific implementations, the shared encoder can consist of three one-dimensional convolutional layers, each followed by a batch normalization layer and a ReLU activation function. The shared encoder maps the input standard acoustic feature vector from 80 dimensions to a 256-dimensional shared hidden layer vector. The text content encoder can consist of two fully connected layers, mapping the 256-dimensional shared hidden layer vector to 128-dimensional text content-invariant features. The dialect accent encoder has a similar structure to the text content encoder, mapping the 256-dimensional shared hidden layer vector to 64-dimensional dialect accent features. The shared decoder can consist of two fully connected layers and three deconvolutional layers, mapping the 192-dimensional concatenated feature vector back to 80-dimensional reconstructed features. The concatenated feature vector is composed of 128-dimensional text content-invariant features and 64-dimensional reference dialect accent features. The text content discriminator can consist of two fully connected layers, outputting a scalar representing the probability that the input feature is a true feature. The dialect accent discriminator is structurally similar to a text content discriminator. The dialect accent discriminator outputs a scalar representing the probability that the input feature is a true feature.
[0024] In practice, the improved adversarial autoencoder network is trained using batch stochastic gradient descent. In each iteration, a batch of standard acoustic feature vectors and their corresponding real text content-invariant features and real dialect accent features are sampled from the dataset. The shared encoder processes the standard acoustic feature vectors to generate a shared hidden layer vector; the text content encoder processes the shared hidden layer vectors to generate text content-invariant features; and the dialect accent encoder processes the shared hidden layer vectors to generate dialect accent features. The shared decoder receives a vector concatenated with the text content-invariant features and randomly sampled reference dialect accent features, and outputs the reconstructed features. The text content discriminator receives the real text content-invariant features or the text content-invariant features generated by the text content encoder, and outputs the discrimination result. The dialect accent discriminator receives the real dialect accent features or the dialect accent features generated by the dialect accent encoder, and outputs the discrimination result. The loss function calculates the total loss. The parameters of the shared encoder, text content encoder, dialect accent encoder, shared decoder, text content discriminator, and dialect accent discriminator are updated through backpropagation.
[0025] Optionally, the reference dialect accent features can be randomly sampled from a uniform or Gaussian distribution, with the reference dialect accent features having the same 64-dimensional dimension as the dialect accent features. In the early stages of training, the randomly sampled reference dialect accent features may not match the text content-invariant features. The shared decoder needs to learn how to reconstruct reasonable acoustic features from any combination of dialect accent features and given text content-invariant features. As training progresses, the shared decoder gradually learns to decouple text content and accent information. The text content-invariant features generated by the text content encoder gradually cease to contain accent information, and the dialect accent features generated by the dialect accent encoder gradually cease to contain text content information.
[0026] Understandingly, the training objective of the text content discriminator is to maximize the accuracy in identifying sources of text content-invariant features, while the training objective of the text content encoder is to generate text content-invariant features that maximize the error rate of the text content discriminator. Similarly, the training objective of the dialect accent discriminator is to maximize the accuracy in identifying sources of dialect accent features, while the training objective of the dialect accent encoder is to generate dialect accent features that maximize the error rate of the dialect accent discriminator. The training objective of the shared decoder is to minimize the reconstruction loss, making the reconstructed features as close as possible to the original standard acoustic feature vector. Through this adversarial training, the improved adversarial autoencoder network forces the text content-invariant features and dialect accent features to separate from each other.
[0027] In some embodiments, data comparison can be reflected in feature separation. For example, using speech samples with the same text content but different dialect accents, the cosine similarity between text content-invariant features extracted by the improved adversarial autoencoder network is higher than 0.9, while the cosine similarity between dialect accent features is lower than 0.1. For speech samples with the same dialect accent but different text content, the cosine similarity between extracted dialect accent features is higher than 0.9, while the cosine similarity between text content-invariant features is lower than 0.1. This comparison shows that the improved adversarial autoencoder network effectively decouples text content and dialect accent information.
[0028] In one embodiment of the present invention, when the feature decoupling module starts working, it needs to perform a series of preprocessing steps on the acquired raw speech samples to obtain standard acoustic feature vectors, see [reference]. Figure 2 The original speech samples are pre-emphasized to enhance high-frequency components. Then, the pre-emphasized speech signal is framed and windowed to obtain a series of short-term stationary, pre-processed speech frames. For each pre-processed speech frame, its Mel-frequency cepstral coefficients (CFCs) are calculated. This calculation involves performing a Fast Fourier Transform (FFT) on each frame to obtain its amplitude spectrum. The obtained amplitude spectrum is then passed through a set of Mel-scale triangular filters conforming to human hearing characteristics, and the logarithmic energy of each filter output is calculated. These logarithmic energy sequences are then subjected to a Discrete Cosine Transform (DCT), retaining the first few coefficients that best characterize the vocal tract shape, thus forming the Mel-frequency CFCs of a single speech frame. The Mel-frequency CFCs of all speech frames are concatenated in chronological order to form the standard acoustic feature vector of the speech segment. After obtaining the standard acoustic feature vector, it is normalized to eliminate the influence of dimensions and input into the shared encoder of the improved adversarial autoencoder network. The shared encoder encodes the vector and outputs a shared hidden layer vector. This shared hidden layer vector serves as the common input for the subsequent text content encoder and dialect accent encoder, thereby separating the text content invariant features and dialect accent features.
[0029] In practical implementation, the feature decoupling module's processing flow for the original speech samples includes three main stages: signal preprocessing, feature extraction, and feature input. The original speech samples are speech waveform data containing various dialect accents, stored in pulse code modulation format. In the preprocessing stage, the system applies a pre-emphasis filter to the original speech samples to enhance high-frequency components. The pre-emphasis filter is implemented using a first-order high-pass filter, and its transfer function is... Filter coefficients The value range is 0.95 to 0.97. The pre-emphasized speech signal is then processed by framing, dividing the continuous speech signal into a series of short frames with a frame length of 25 milliseconds and a frame shift of 10 milliseconds. A Hamming window function is applied to each frame of the speech signal to reduce spectral leakage. The expression for the Hamming window function is: ,in It is the sample index within the window function. It is the window length, which corresponds to the product of the frame length and the sampling rate. After framing and windowing operations, the pre-processed speech frame sequence is obtained.
[0030] In some embodiments, Mel-frequency cepstral coefficient features are calculated on the pre-processed speech frame sequence. The Mel-frequency cepstral coefficient feature extraction process is performed independently for each pre-processed speech frame. The first step is to perform a Fast Fourier Transform on each pre-processed speech frame to convert the time-domain speech signal into a frequency-domain representation, obtaining the amplitude spectrum of each speech frame. The second step is to pass the amplitude spectrum through a set of Mel-scale triangular filter banks, which contain multiple triangular bandpass filters uniformly distributed at Mel frequencies. With linear frequency The conversion relationship is The first step is to calculate the logarithmic energy of the output of each triangular filter. Specifically, this involves convolving the square of the amplitude spectrum with the frequency response of each triangular filter and then taking the logarithm. The third step is to perform a discrete cosine transform (DCT) on the logarithmic energy sequence. The formula for the DCT is: in: It is the calculated number of Mel frequency cepstral coefficients, It is the total number of Mel-frequency cepstral coefficients. It is the number of filters in a Mel filter bank. It is the first The logarithmic energy of the Mel-frequency filter output is calculated. The first 13 Mel-frequency cepstral coefficients are retained in the calculation, forming the Mel-frequency cepstral coefficient feature vector of a single speech frame. The Mel-frequency cepstral coefficient features of all speech frames are concatenated in chronological order to form a two-dimensional matrix. This two-dimensional matrix is the standard acoustic feature vector. The time dimension of the standard acoustic feature vector is equal to the total number of speech frames, and the feature dimension is equal to the number of Mel-frequency cepstral coefficients retained, which is 13.
[0031] In practice, after obtaining the standard acoustic feature vectors, they need to be normalized using global mean and variance normalization methods. The mean and standard deviation vectors of all standard acoustic feature vectors in the entire training dataset are calculated, with the dimensions of the mean and standard deviation vectors matching the number of Mel-frequency cepstral coefficients. For any given standard acoustic feature vector, the mean vector is subtracted, and then the result is divided by the standard deviation vector to obtain the normalized standard acoustic feature vector. This normalized standard acoustic feature vector is then input into the shared encoder of the improved adversarial autoencoder network. The shared encoder is the first component of the improved adversarial autoencoder network. It receives the normalized standard acoustic feature vectors and performs encoding operations, outputting a shared hidden layer vector. This shared hidden layer vector serves as the common input for the subsequent text content encoder and dialect accent encoder.
[0032] Optional, pre-emphasis filter coefficients The length of the Hamming window can be set to 0.97. Based on the sampling rate, at a sampling rate of 16kHz, a 25-millisecond frame length corresponds to 400 sampling points. The number of Mel-scale triangular filter banks... It is usually set to 40, which is the number of Mel frequency cepstral coefficients retained. The value is set to 13, and its first-order and second-order difference dynamic features can be additionally concatenated to form a 39-dimensional feature vector. The mean vector and standard deviation vector used for normalization are calculated and saved once from the training dataset before model training, and are directly loaded and used during the model inference stage. It can be understood that the speech frame sequence after preliminary processing is a short-time stationary signal segment, and the Mel frequency cepstral coefficient feature simulates the nonlinear perception characteristics of human ear for frequency. Normalizing the standard acoustic feature vector helps to improve the training stability and convergence speed of the improved adversarial autoencoder network. The shared hidden layer vector output by the shared encoder is a condensed representation containing mixed information of the original speech, providing a basis for feature decoupling.
[0033] In one embodiment of the present invention, after the identification and classification module produces a text phoneme sequence and a dialect accent category label, the semantic understanding module performs a fusion operation, see below. Figure 3This module uses a phoneme-to-text conversion model to transform the phoneme sequence into a readable raw text sequence. Simultaneously, it converts the dialect accent category label into a short dialect identifier according to a preset mapping rule, and inserts this dialect identifier as a prefix at the beginning of the raw text sequence, thus generating an intermediate semantic result with accent annotation. This intermediate semantic result is input into a spoken language natural language understanding model. This model has a parameter adaptation layer designed for different dialect identifiers. When the model receives the intermediate semantic result with accent annotation, it parses the initial dialect identifier and activates the corresponding parameter adaptation layer based on that identifier. The parameters of this adaptation layer are specifically optimized for the grammatical habits and common expressions of a particular dialect. By deeply encoding the original text sequence through an activated parameter adaptation layer, a deep semantic vector that better reflects the dialect's expression habits can be extracted. This deep semantic vector is simultaneously input into the model's intent classification head and slot filling head. The intent classification head outputs the interaction intent category corresponding to the user's current speech, while the slot filling head annotates key information fragments related to the identified interaction intent on the original text sequence. These key information fragments are the required slot values. Based on the parsed interaction intent category, the interaction response module determines the type of backend service to be accessed and constructs a query request conforming to the service interface specification using the slot values. After sending the query request to the corresponding service interface, it receives the returned structured data as the original feedback information. The original feedback information is then filled according to a preset, grammatically correct text template to generate written feedback text. Meanwhile, based on the dialect accent category label, the module searches in a pre-set model registry to locate the pre-trained dialect speech synthesis model instance associated with this label, and loads it as the target dialect speech synthesizer. The multi-dialect speech synthesis model itself contains multiple independently trained dialect speech synthesizer instances, each instance specifically trained for a particular dialect accent.
[0034] In a specific implementation, the semantic understanding module receives a text phoneme sequence and a dialect accent category label from the recognition and classification module. The text phoneme sequence is a symbol sequence generated by a speech recognition encoder based on an attention mechanism, and the dialect accent category label is a discrete category identifier output by a dialect accent classifier. The semantic understanding module first inputs the text phoneme sequence into a pre-trained phoneme-to-character conversion model. The phoneme-to-character conversion model is based on a sequence-to-sequence neural network architecture. The phoneme-to-character conversion model maps the text phoneme sequence to a corresponding original text sequence, and the original text sequence is a readable string composed of Chinese characters or words. At the same time, the semantic understanding module internally maintains a dialect identifier mapping table, which maps the dialect accent category label to a short and unique dialect identifier. For example, it maps the category label "Cantonese - Guangfu dialect area" to the dialect identifier "[YUE]". The semantic understanding module takes the obtained dialect identifier as a prefix and inserts it at the beginning of the original text sequence to generate an intermediate semantic result with accent annotation. The format of the intermediate semantic result with accent annotation is "dialect identifier + original text sequence".
[0035] In some embodiments, the colloquial natural language understanding model is a neural network model with multi-task learning ability. The colloquial natural language understanding model internally contains a shared text encoder, multiple parameter adaptation layers for different dialect identifiers, an intent classification head, and a slot filling head. When the colloquial natural language understanding model receives the intermediate semantic result with accent annotation, the model analyzes the dialect identifier at the beginning of the string and activates the corresponding parameter adaptation layer in the model according to the analyzed dialect identifier. The parameter adaptation layer is a set of learnable weight matrices, and each dialect identifier corresponds to a set of independent parameter adaptation layer weights. The shared text encoder first performs a preliminary encoding on the original text sequence to generate a general context vector sequence. Subsequently, the activated parameter adaptation layer processes this general context vector sequence. The calculation of the parameter adaptation layer can be expressed as: Where: represents the general context vector output by the shared text encoder, and represent the weight matrix and bias vector of the first layer of the parameter adaptation layer corresponding to the dialect identifier , and represent the weight matrix and bias vector of the second layer of the parameter adaptation layer corresponding to the dialect identifier , represents the rectified linear unit activation function, This represents a deep semantic vector, obtained after parameter adaptation layer transformation, containing specific dialect grammar and expression habits. The deep semantic vector is simultaneously input into the intent classification head and the slot filling head. The intent classification head is a fully connected layer connected to a softmax function, outputting a probability distribution representing the probability that a user's interaction intent belongs to each preset intent category. The category with the highest probability is determined as the interaction intent category. The slot filling head typically employs a bidirectional long short-term memory network connected to a conditional random field layer. Based on the deep semantic vector, the slot filling head labels each character or word in the original text sequence, marking key information fragments related to the identified interaction intent. These marked key information fragments are the slot values.
[0036] In practice, the interaction response module executes subsequent operations based on the interaction intent category and slot values output by the semantic understanding module. Internally, the interaction response module maintains a service routing table that defines the mapping relationship between different interaction intent categories and backend service types. The interaction response module queries the service routing table based on the parsed interaction intent category to determine the type of backend service to be accessed, which could be weather query, flight booking, or music playback, etc. Based on the determined backend service type and its interface specifications, the interaction response module constructs a query request that conforms to the specifications using the slot values. For example, for a weather query intent, the slot values might contain "city" and "date," and the constructed query request would be a Structured Query Language (SQL) statement or an API call parameter. The interaction response module sends the constructed query request over the network to the corresponding service interface and receives the structured data returned by the service interface, which serves as the raw feedback information. Based on the current interaction intent category, the interaction response module selects a matching text template from a pre-built text template library; the text template is a text string with placeholders. The interactive response module fills the field values from the original feedback information into the corresponding placeholders in the text template, generating grammatically correct written feedback text. Optional, a dialect identifier mapping table is provided in Table 1. Table 1: Mapping Table of Dialect Accent Category Labels and Dialect Identifiers The activation rule of the parameter adaptation layer can be exact matching, that is, only the parameter adaptation layer that is exactly the same as the dialect identifier will be activated and its weights will be loaded. The shared text encoder can adopt a transformer-based architecture, and the output categories of the intent classification head can include "QueryWeather", "PlayMusic", "SetReminder", etc. The tag system adopted by the slot filling head can be in the BIO (Begin, Inside, Outside) format. The service routing table can be stored in a memory database in the form of key-value pairs. The template in the text template library can be "{city} The weather on {date} today is {weather}, and the temperature is {temp} degrees." It can be understood that inserting the dialect identifier as a prefix into the original text sequence provides explicit prior information about the dialect context for subsequent natural language understanding processes. Through the parameter adaptation layer mechanism, the colloquial natural language understanding model can adapt different semantic understanding rules for different dialects without significantly increasing the total number of model parameters. The interaction response module converts the unstructured semantic parsing results into structured service requests and natural language feedback through the service routing table and text templates. The written feedback text is the bridge connecting semantic understanding and speech synthesis.
[0037] In an embodiment of the present invention, the target dialect speech synthesizer receives the written feedback text generated by the interaction response module. The written feedback text is a string that conforms to grammatical norms. The target dialect speech synthesizer first performs text regularization processing on the written feedback text. The text regularization processing converts numbers, symbols, abbreviations, etc. in the text into corresponding colloquial pronunciation words. For example, it converts "100" to "one hundred" and "km / h" to "kilometers per hour". Subsequently, word segmentation processing is performed on the regularized text. The word segmentation processing divides the continuous text string into a sequence of independent words or sub-word units.
[0038] In some embodiments, the processed text sequence is input to a dialect phoneme converter within the target dialect speech synthesizer. The dialect phoneme converter is one of the core components of the target dialect speech synthesizer, responsible for converting the standard phoneme sequence into a dialect phoneme sequence that conforms to the pronunciation habits of the target dialect. The dialect phoneme converter stores a dialect pronunciation rule base and a dialect-specific vocabulary mapping table. The dialect pronunciation rule base defines the mapping relationship from standard Mandarin initials, finals, and tones to target dialect pronunciation units in the form of rule sets or decision trees. The dialect-specific vocabulary mapping table stores the correspondence between standard vocabulary and dialect vocabulary in key-value pairs. For the input processed text sequence, the dialect phoneme converter first performs vocabulary substitution. According to the dialect-specific vocabulary mapping table, it scans each word in the processed text sequence. If the word exists in the key of the mapping table, it is replaced with the corresponding dialect vocabulary expression value. The text sequence after word replacement enters the rule-based initial-final conversion module. This module, based on a dialect pronunciation rule library, converts the Mandarin pinyin components (initials, finals, and tones) of each character in the text sequence into corresponding pronunciation unit descriptions for the target dialect, character by character or word by word. For literary and colloquial pronunciations or polyphonic phenomena in the target dialect, the rule-based initial-final conversion module combines contextual information and calls a recurrent neural network-based disambiguation model for judgment. This model uses the pronunciation units of adjacent characters as contextual features to output the correct pronunciation unit selection for the current character in a specific context. All converted pronunciation units are arranged chronologically to form the final dialect phoneme sequence. The symbol set used in this sequence is independently designed and differs from the standard phoneme symbol set, specifically designed to represent the phonological system of the target dialect. For details on the dialect-specific vocabulary mapping table, please refer to Table 2. Table 2: Mapping Segments of Standard Mandarin Vocabulary and Dialect Vocabulary After obtaining the dialect phoneme sequence, it is input into a duration predictor, which is a neural network-based regression model. The duration predictor predicts the duration of each dialect phoneme and its prosodic boundaries in the sequence, with the predicted duration measured in frames. Based on the predicted duration, the original dialect phoneme sequence is processed using an extended alignment algorithm. This algorithm repeatedly extends each dialect phoneme symbol to generate an aligned phoneme sequence corresponding to the target speech frame number. This aligned phoneme sequence contains the phoneme identity information for each speech frame.
[0039] Optionally, text regularization can include conversion rules for numbers, dates, currencies, abbreviations, etc. Word segmentation can employ a dictionary-based maximum forward matching algorithm. The disambiguation model based on recurrent neural networks can use a gated recurrent unit structure. The duration predictor can employ a convolutional neural network or a converter architecture. The extended alignment algorithm can be a monotonic alignment search algorithm. It can be understood that text regularization and word segmentation ensure the accuracy of text-to-phoneme conversion. The dialect phoneme converter converts standard text into a native dialect pronunciation representation through three steps: word substitution, rule transformation, and context disambiguation. The dialect phoneme sequence is the foundation for subsequent acoustic modeling, and aligning the phoneme sequence provides accurate temporal alignment information for acoustic feature prediction.
[0040] In some embodiments, the aligned phoneme sequence is input to an acoustic feature predictor, which is a sequence-to-sequence neural network model. The acoustic feature predictor receives the aligned phoneme sequence and predicts the corresponding acoustic feature sequence frame by frame. The acoustic feature sequence typically contains Mel-spectral features. The forward computation process of the acoustic feature predictor can be abstractly represented as follows: in: This represents the input aligned phoneme sequence. The parameter is Acoustic feature predictor neural network, It is an optional embedding vector related to the target dialect category. This represents the predicted acoustic feature sequence. The predicted acoustic feature sequence is input to a dialect vocoder that matches the target dialect. The dialect vocoder is a neural network or digital signal processing module that restores the Mel-spectrum feature sequence to a time-domain waveform, generating the final target synthesized speech. Different dialect speech synthesizer instances have their own independent acoustic feature predictors and dialect vocoder parameters, which are trained using speech data from the corresponding dialect.
[0041] In one embodiment of the present invention, the online adaptive module continuously monitors the output of the dialect accent classifier during the interaction process. When the confidence level of the dialect accent classifier in recognizing the dialect accent of a certain input speech segment is lower than a preset threshold, the online adaptive module marks this input speech segment as unknown dialect accent speech to be analyzed. The online adaptive module calls the improved adversarial autoencoder network in the feature decoupling module to extract the dialect accent features of the unknown dialect accent speech to be analyzed. The extracted dialect accent features are a feature vector. The online adaptive module stores the extracted dialect accent features and their corresponding original speech sample pointers into an online buffer pool, which is a circular queue or cache area located in memory or high-speed storage.
[0042] In some embodiments, the online adaptive module checks the number of unknown dialect accent speech samples accumulated in the online buffer pool in real time. When the number of unknown dialect accent speech samples accumulated in the online buffer pool reaches a set size, the online adaptive module initiates an incremental clustering analysis process. The set size is a predefined integer, such as 1000 samples. The incremental clustering analysis process first performs unsupervised clustering on all dialect accent features stored in the online buffer pool. Unsupervised clustering can use K-Means or DBSCAN algorithms. Unsupervised clustering aggregates similar dialect accent features together to form several feature clusters. For each feature cluster formed after unsupervised clustering, the online adaptive module calculates the central feature vector of the feature cluster. The central feature vector can be obtained by calculating the arithmetic mean of all dialect accent features within the feature cluster. The online adaptive module compares the calculated central feature vector with all known dialect accent feature vectors stored in the dialect accent classifier. The similarity comparison can calculate the cosine similarity. The result of the similarity comparison is used to determine the relationship between the new feature cluster and existing dialect categories.
[0043] In practice, the similarity comparison uses cosine similarity calculation, with the formula as follows: in: This represents the central feature vector calculated from the feature cluster to be compared. This represents the first term stored in the dialect accent classifier. Feature vectors of known dialect accent categories This represents the vector dot product operation. The Euclidean norm of a vector. This represents the calculated cosine similarity value. The online adaptive module presets a matching threshold, for example, 0.85. If any known dialect accent has a cosine similarity value exceeding the matching threshold, the online adaptive module classifies the current feature cluster into a subclass of the known dialect accent with the highest cosine similarity value and updates the parameters of the corresponding category in the dialect accent classifier. This update operation can fine-tune the classification boundary or feature prototype of the category. If none of the calculated cosine similarity values exceed the matching threshold, the online adaptive module registers the current feature cluster as a new dialect accent category and assigns a unique dialect accent category label to this new category.
[0044] In some embodiments, a dedicated speech synthesizer is initialized for each newly registered dialect accent category. The initialization process begins by filtering raw speech samples from an online buffer pool. The online adaptive module filters raw speech samples corresponding to the dialect accent features belonging to the new category from the online buffer pool based on feature cluster affiliation information. An improved adversarial autoencoder network is used to extract the clean dialect accent feature vector of the new category from the filtered raw speech samples. The system's pre-built standard Mandarin speech synthesizer is used as the base model, which is a pre-trained speech synthesis neural network. The online adaptive module collects a small amount of parallel speech-text corpus of the new dialect category, containing audio and corresponding text transcriptions. Using the collected parallel speech-text corpus, the acoustic feature predictor and vocoder of the base model are fine-tuned. During fine-tuning, the extracted clean dialect accent feature vector is used as a conditional input. The conditional input guides the acoustic feature predictor and vocoder to learn the pronunciation characteristics of the new dialect through feature concatenation or adaptive layer normalization. After training, the fine-tuned model parameters are saved, creating a dedicated speech synthesizer instance for the new dialect accent category. The online adaptive module adds a new record to the model registry, associating the new dialect accent category label with the storage path of the newly created dedicated speech synthesizer instance.
[0045] Optionally, the preset threshold can be set to 0.6. The size of the online buffer pool can be set to 500 samples. Unsupervised clustering can use a noisy density-based clustering method, which does not require pre-specifying the number of clusters. The matching threshold can be set to 0.8. The parallel speech-text corpus used for fine-tuning training can be only tens of minutes long, and a small learning rate is used to prevent catastrophic forgetting. It is understandable that when an unknown dialect is identified as a new category, the system needs to build speech synthesis capabilities for it. Fine-tuning using a mature Mandarin synthesizer as a starting point is an efficient way to utilize data and quickly build a new dialect synthesizer. Using the pure dialect accent feature vector as conditional input can effectively guide the model to learn the acoustic characteristics of the new dialect. Adding a record to the model registry allows the new dialect speech synthesizer to be retrieved and called normally by the interactive response module.
[0046] In practical implementation, data comparison can be reflected in the differences between clustering and classification. For example, for a speech set known as "Cantonese-Guangfu dialect" but with a slight new accent, its dialect accent features may form an independent feature cluster after clustering. However, the cosine similarity between the central feature vector of this feature cluster and the known "Cantonese-Guangfu dialect" feature vector is as high as 0.88, exceeding the matching threshold of 0.85. Therefore, this feature cluster is classified as a subclass of "Cantonese-Guangfu dialect". Conversely, for a completely unknown dialect speech set, the maximum cosine similarity between the central feature vector of its formed feature cluster and all known dialect feature vectors is only 0.45, far below the matching threshold. Therefore, this feature cluster is registered as a completely new category. Another comparison is reflected in the amount of data required for synthesizer initialization. Fine-tuning an existing basic model to build a new dialect synthesizer may only require a few hundred sentences of data, while training a synthesizer from scratch requires tens of thousands of sentences of data.
[0047] The above are merely preferred embodiments of the present invention and are not intended to limit the present invention in any other way. Any person skilled in the art may make changes or modifications to the above-disclosed technical content to create equivalent embodiments that can be applied to other fields. However, any simple modifications, equivalent changes, and modifications made to the above embodiments based on the technical essence of the present invention without departing from the scope of the present invention shall still fall within the protection scope of the present invention.
Claims
1. An intelligent voice human-computer interaction system adaptable to multiple dialects and accents, characterized in that: The system includes: The feature decoupling module collects raw speech samples containing multiple dialect accents to obtain standard acoustic feature vectors. It then performs deep feature decoupling on the standard acoustic feature vectors through an improved adversarial autoencoder network to separate text content invariant features and dialect accent features. The recognition and classification module inputs the text content invariant features into the attention-based speech recognition encoder to generate a text phoneme sequence. At the same time, it inputs the dialect accent features into the dialect accent classifier to obtain dialect accent category labels. The semantic understanding module fuses the text phoneme sequence with the dialect accent category label to form an intermediate semantic result with accent annotation, and inputs the intermediate semantic result with accent annotation into the colloquial natural language understanding model for intent and slot parsing; The interactive response module, based on the intent and slot parsing results, calls the corresponding service interface to obtain feedback information, and selects the corresponding target dialect speech synthesizer from the multi-dialect speech synthesis model based on the dialect accent category label. The feedback information is input into the target dialect speech synthesizer to generate target synthesized speech that conforms to the target dialect accent, and the target synthesized speech is output.
2. The intelligent voice human-computer interaction system adaptable to multiple dialect accents according to claim 1, characterized in that, The improved adversarial autoencoder network works as follows: The improved adversarial autoencoder network includes a shared encoder, a text content encoder, a dialect accent encoder, a shared decoder, a text content discriminator, and a dialect accent discriminator. The shared encoder performs preliminary encoding on the input standard acoustic feature vector to generate a shared hidden layer vector; The text content encoder processes the shared hidden layer vector to extract the text content invariant features, and the dialect accent encoder processes the shared hidden layer vector to extract the dialect accent features. The shared decoder receives a reconstructed feature vector composed of the text content-invariant features and a randomly sampled reference dialect accent feature, and attempts to reconstruct an output feature that is similar to the input standard acoustic feature vector. The text content discriminator is used to determine whether the input text content invariant feature comes from a real speech sample or is generated by the network, and the dialect accent discriminator is used to determine whether the input dialect accent feature comes from a real speech sample or is generated by the network. During training, by minimizing the reconstruction loss of the shared decoder and maximizing the discrimination error rate of the text content discriminator and the dialect accent discriminator, the improved adversarial autoencoder network effectively decouples the mutually separate and complete text content invariant features and dialect accent features.
3. The intelligent voice human-computer interaction system adaptable to multiple dialect accents according to claim 2, characterized in that, The process involves collecting raw speech samples containing multiple dialect accents to obtain standard acoustic feature vectors. These standard acoustic feature vectors are then subjected to deep feature decoupling using an improved adversarial autoencoder network to separate text-content-invariant features and dialect accent features, including: The original speech samples are pre-emphasized and framed with windowing to extract the pre-processed speech frame sequence. The Mel frequency cepstral coefficient features of the pre-processed speech frame sequence are calculated to obtain a standard acoustic feature vector, and the standard acoustic feature vector is input into the improved adversarial autoencoder network for deep feature decoupling. The improved adversarial autoencoder network separates text content-invariant features related to text content and dialect accent features related to speaker dialect accent from the standard acoustic feature vector.
4. The intelligent voice human-computer interaction system adaptable to multiple dialect accents according to claim 3, characterized in that, The Mel-frequency cepstral coefficients of the pre-processed speech frame sequence are calculated to obtain a standard acoustic feature vector. This standard acoustic feature vector is then input into an improved adversarial autoencoder network for deep feature decoupling, specifically including: Perform a Fast Fourier Transform on each of the pre-processed speech frames to obtain the amplitude spectrum of each speech frame. The amplitude spectrum is passed through a set of Mel-scale triangular filters, and the logarithmic energy of the output of each filter is calculated. Perform a discrete cosine transform on the logarithmic energy, retain a specific number of coefficients, and construct the Mel frequency cepstral coefficient features of the speech frame; The Mel frequency cepstral coefficient features of all speech frames are concatenated in chronological order to form the standard acoustic feature vector; After the standard acoustic feature vector is normalized, it is input into the shared encoder of the improved adversarial autoencoder network, and the shared encoder outputs the shared hidden layer vector.
5. The intelligent voice human-computer interaction system adaptable to multiple dialect accents according to claim 1, characterized in that, The text phoneme sequence is fused with the dialect accent category label to form an intermediate semantic result with accent annotation. This intermediate semantic result with accent annotation is then input into a spoken language natural language understanding model for intent and slot parsing, including: The text phoneme sequence is converted into the original text sequence using a phoneme-to-text conversion model; The dialect accent category labels are converted into corresponding dialect identifiers, and the dialect identifiers are inserted as prefixes at the beginning of the original text sequence to generate the intermediate semantic results with accent annotations. The spoken language natural language understanding model has a parameter adaptation layer for different dialect identifiers. When the intermediate semantic result with accent annotation is received, the corresponding parameter adaptation layer is activated according to the dialect identifier. The original text sequence is encoded by the activated parameter adaptation layer to extract a deep semantic vector containing dialect grammar and expression habits; The deep semantic vector is input into the intent classification head and the slot filling head to obtain the user's interaction intent category and the key information fragments related to the interaction intent in the original text sequence, and the key information fragments are the slot values.
6. The intelligent voice human-computer interaction system adaptable to multiple dialect accents according to claim 5, characterized in that, Based on the intent and slot parsing results, the corresponding service interface is invoked to obtain feedback information, and based on the dialect accent category label, the corresponding target dialect speech synthesizer is selected from the multi-dialect speech synthesis model, including: Based on the interaction intent category, determine the type of backend service to be accessed, and construct a query request that conforms to the interface specification of the backend service type based on the slot value. The query request is sent to the service interface, and the structured data returned by the service interface is received as the original feedback information. The original feedback information is filled into a pre-set text template to generate grammatically correct written feedback text. Based on the dialect accent category label, search for dialect speech synthesis model instances associated with the dialect accent category label in the preset model registry, and load the found dialect speech synthesis model instances as the target dialect speech synthesizer; The multi-dialect speech synthesis model contains multiple independently trained dialect speech synthesizer instances, each specifically trained for a particular dialect accent.
7. The intelligent voice human-computer interaction system adaptable to multiple dialect accents according to claim 6, characterized in that, The feedback information is input into the target dialect speech synthesizer to generate target synthesized speech that conforms to the target dialect accent, including: The target dialect speech synthesizer receives the written feedback text and performs text regularization and word segmentation on the written feedback text; The processed text sequence is input into the dialect phoneme converter inside the target dialect speech synthesizer to convert the standard phoneme sequence into a dialect phoneme sequence that conforms to the pronunciation habits of the target dialect. The dialect phoneme sequence is input into the duration predictor to predict the duration of each dialect phoneme and prosodic boundary. Based on the predicted duration, the dialect phoneme sequence is extended and aligned to generate an aligned phoneme sequence containing temporal information; The aligned phoneme sequence is input into the acoustic feature predictor to predict an acoustic feature sequence containing Mel spectral features; The acoustic feature sequence is input into the corresponding dialect vocoder to synthesize the final target synthesized speech.
8. The intelligent voice human-computer interaction system adaptable to multiple dialect accents according to claim 7, characterized in that, The step of inputting the processed text sequence into the dialect phoneme converter inside the target dialect speech synthesizer to convert the standard phoneme sequence into a dialect phoneme sequence that conforms to the pronunciation habits of the target dialect includes: The dialect phoneme converter stores a dialect pronunciation rule base and a dialect-specific vocabulary mapping table. For the processed input text sequence, first replace the standard words in the text with the corresponding dialect words according to the dialect-specific vocabulary mapping table; For the text sequence that has completed word replacement, the initials, finals and tones of standard Mandarin are converted into corresponding pronunciation units of the target dialect by the rule-based initial and final conversion module according to the dialect pronunciation rule library; For dialects with literary and colloquial pronunciations or polyphony, a disambiguation model based on a recurrent neural network is used to determine the correct pronunciation in the current context by combining contextual information. All the converted pronunciation units are arranged in order and output as the dialect phoneme sequence. The symbol set of the dialect phoneme sequence is independent of the standard phoneme symbol set.
9. The intelligent voice human-computer interaction system adaptable to multiple dialect accents according to claim 1, characterized in that, The system also includes: The online adaptive module, during the interaction process, marks the input speech as unknown dialect speech to be analyzed when the confidence level of the dialect accent classifier in recognizing the dialect accent of the input speech is lower than a preset threshold. Extract the dialect accent features of the unknown dialect speech to be analyzed and store them in an online buffer pool; When the number of unknown dialect accent speech samples accumulated in the online buffer pool reaches a set scale, incremental clustering analysis is initiated. Unsupervised clustering is performed on all dialect accent features in the online buffer pool to form several feature clusters; For each feature cluster, its central feature vector is calculated, and the similarity of the central feature vector with the known dialect accent feature vector is compared. If there is a known dialect accent with a similarity exceeding the matching threshold, the feature cluster is classified into a subclass of the known dialect accent, and the corresponding category parameter of the dialect accent classifier is updated. If there are no known dialect accents with similarity exceeding the matching threshold, the feature cluster is registered as a new dialect accent category, and a dedicated speech synthesizer is initialized for each new category and added to the multi-dialect speech synthesis model.
10. The intelligent voice human-computer interaction system adaptable to multiple dialect accents according to claim 9, characterized in that, The process of initializing a dedicated speech synthesizer for each new category and adding it to the multi-dialect speech synthesis model includes: Original speech samples corresponding to dialect accent features belonging to the new category are selected from the online buffer pool; Using the improved adversarial autoencoder network, a new category of pure dialect accent feature vectors is extracted from the selected original speech samples; Using a standard Mandarin speech synthesizer as the base model, and with a small amount of parallel speech text corpus of new dialects, the acoustic feature predictor and vocoder of the base model are fine-tuned and trained. During the fine-tuning training process, the pure dialect accent feature vector is used as a conditional input to guide the model to learn the pronunciation characteristics of the new dialect; After training is complete, the fine-tuned model parameters are saved to form a dedicated speech synthesizer instance for the new dialect accent category; Add a new record to the model registry, associating the new dialect accent category label with the dedicated speech synthesizer instance.