A speech synthesis model method capable of synthesizing multi-emotional audio.

By processing audio data through an emotion recognition module and a vocoder with a WaveGlow structure, the problem of existing technologies being unable to synthesize multi-emotional speech is solved, achieving high-quality multi-emotional audio synthesis, simulating real human prosody, and optimizing synthesis speed and result editability.

CN116798403BActive Publication Date: 2026-06-30UNICOM WOYUEDU TECH CULTURE CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
UNICOM WOYUEDU TECH CULTURE CO LTD
Filing Date
2023-05-06
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing speech synthesis technologies cannot effectively simulate the human vocal tract, especially when synthesizing emotional speech. There is still room for improvement in naturalness and intelligibility, and existing models cannot analyze semantics.

Method used

The system employs an emotion recognition module to process audio data, extracts emotional features through a convolutional neural network, combines variational inference and normalized stream mapping to the speech space, uses a vocoder with a WaveGlow structure for phoneme reconstruction, constructs a multi-emotion text-to-speech model, and adds a random duration predictor and offline fine-tuning capability.

Benefits of technology

It achieves high-quality synthesis of multi-emotional audio, simulates real human rhythm, reduces machine annotation errors, improves synthesis speed and editability of results, and solves the problem of multi-emotional text-to-speech.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116798403B_ABST
    Figure CN116798403B_ABST
Patent Text Reader

Abstract

This invention discloses a speech synthesis model method capable of synthesizing multi-emotion audio, relating to the field of intelligent speech technology. The method includes the following steps: processing raw data, distinguishing between training and validation sets, adding annotation files to each set, and simultaneously delivering the raw dataset to an emotion recognition module for processing; calling the emotion recognition module to preprocess the dataset, decomposing the audio into phonemes and emotion feature files; the complete multi-emotion text-to-speech model and dataset processing are specifically divided into dataset collection, unsupervised preprocessing, encoder training, and online inference. The final output includes a multi-emotion encoder with intermediate outputs and a final online synthesized independent WAV file, capable of achieving multi-emotion output and simulating prosody, making the effect close to that of a real person. No emotion annotation is required during data processing, and the method of constructing a continuous feature value spectrum greatly avoids the problem of inaccurate machine annotation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of intelligent speech technology, and in particular to a speech synthesis model method capable of synthesizing multi-emotional audio. Background Technology

[0002] Intelligent voice technology has reached its peak after several years of development, driving the expansion of its market size and the implementation of its commercial applications. With the emergence of new natural language technologies and the continuous maturation of existing technologies, intelligent voice technology has moved from its nascent stage to maturity, promoting large-scale commercial applications.

[0003] Thanks to the rapid development of deep learning, machine-synthesized voices are no longer abrupt and cold, achieving good results in terms of naturalness and intelligibility. However, current speech synthesis technology still cannot fully simulate the human vocal cords, resulting in suboptimal synthesis quality. Furthermore, synthesizing emotional speech remains a challenge. While Tacotron2 has been released, capable of synthesizing natural voices, it cannot analyze semantics to synthesize emotional speech, and its naturalness can still be improved.

[0004] Therefore, it is necessary to design a speech synthesis model that can synthesize multi-emotion audio to solve the above problems. Summary of the Invention

[0005] This invention provides a speech synthesis model method capable of synthesizing multi-emotional audio, which solves the aforementioned technical problems.

[0006] To address the aforementioned technical problems, this invention provides a speech synthesis model method capable of synthesizing multi-emotional audio, comprising the following steps:

[0007] S1. Process the raw data, distinguish between the training set and the validation set, add annotation files for each set, and deliver the raw dataset to the emotion recognition module for processing.

[0008] S2. Call the emotion recognition module to preprocess the dataset, breaking down the audio into phoneme and emotion feature files;

[0009] S3. Call the encoder to train the decomposed audio. The audio will be automatically classified by the encoder based on the feature file to obtain the trained acoustic model.

[0010] S4. Obtain the model file, insert the vocoder, input text into the input terminal, use the functions provided by the acoustic model to process the text into phonemes, add the duration information using the random duration predictor, and obtain continuous audio results.

[0011] S5. The output obtained from step S4 includes a multi-emotion encoder with intermediate outputs and a separate WAV file synthesized online.

[0012] Among them, S1 requires clear pronunciation, moderate speaking speed, balanced volume, and audio length of 5-10 seconds, and the emotional requirements include at least 5000 training samples and 500 validation samples.

[0013] In S2, the training time is represented by epochs, and it generally converges in about 10,000 epochs. Whether or not it converges is mainly indicated by the offset.

[0014] Furthermore, the emotion recognition module expression training steps in S2 are as follows:

[0015] S201. First, construct ultra-small samples in units of phonemes;

[0016] S202. Then, the emotional features of each segment are extracted using a convolutional neural network.

[0017] S203, finally converted into a NumPy array and stored in the sample set.

[0018] Furthermore, the NumPy array was divided into natural groups according to a clustering algorithm.

[0019] Furthermore, the encoder training steps for the decomposed audio in S3 are as follows:

[0020] S301. Variational inference is used to sample latent variables from the latent space;

[0021] S302, and then the latent variables in the latent space are mapped to the speech space through the normalization flow.

[0022] Furthermore, in step S301, the emotional features stored in the NumPy array in step S203 are added as a new dimension to the latent variables extracted from the standardized stream, so as to map them into the speech space.

[0023] Furthermore, the mapping process in S302 includes constructing a mel spectrum to convert general audio into data that can be used for machine learning, adding an additional random duration predictor to estimate the distribution of phoneme duration, and adding sentiment representation as a new computational dimension to the prediction.

[0024] Furthermore, the vocoder in S4 adopts a WaveGlow structure, which consists of a WaveNet decoder and a Glow stream encoder.

[0025] Furthermore, the vocoder processing steps are as follows:

[0026] S401. After breaking the text down into phonemes, Wavenet will reconstruct the phoneme sequence into a continuous waveform.

[0027] The S402 and Glow stream encoders will map continuous waveforms into discrete speech representations to build Wav files;

[0028] In S3, the encoder training results will participate in the phoneme sequence reconstruction process as latent variables.

[0029] Furthermore, the S4 online synthesis process is as follows:

[0030] S411. Send the text from the local machine to the server in paragraph form;

[0031] S412. Take the average value of the mode interval of the phoneme emotion expression of the sentence as the emotion expression of the whole sentence;

[0032] S413. Generate key-value pairs of feature values ​​and speech representations, and deliver them to the acoustic model for synthesis. The time complexity is still O(n).

[0033] In S4, the feature values ​​can be manually modified.

[0034] Compared with related technologies, the speech synthesis model method for synthesizing multi-emotion audio provided by this invention has the following beneficial effects:

[0035] This invention provides a method for processing raw data, distinguishing between training and validation sets, adding annotation files to each set, and simultaneously delivering the raw dataset to an emotion recognition module for processing. Next, an emotion feature encoder is extracted, aligning audio and phoneme sequences based on the annotation files. An acoustic model is trained using emotion features, and the trained acoustic model is obtained. The acoustic model is then inserted, and text is input at the input end. A vocoder uses functions provided by the acoustic model to process the text into phonemes, and a random duration predictor adds duration information to obtain continuous audio results. Finally, the output includes a multi-emotion encoder containing intermediate outputs and a final online synthesized independent WAV file. This method solves the problem of multi-emotion text-to-speech conversion and optimizes dataset annotation and online synthesis to a certain extent.

[0036] This invention provides a complete multi-emotion text-to-speech model and dataset processing, specifically divided into dataset collection, unsupervised preprocessing, encoder training, and online inference. The final output includes a multi-emotion encoder with intermediate outputs and an independent WAV file synthesized online. It can achieve multi-emotion output and simulate prosody, making the effect close to that of a real person. No emotion annotation is required when processing data, and the method of constructing a continuous feature value spectrum greatly avoids the problem of inaccurate machine annotation.

[0037] This invention provides a unique sentiment analysis input-output architecture for online synthesis of long texts, ensuring that the time complexity metric remains at O(n), preventing a sharp drop in synthesis speed even with the addition of a two-dimensional dimension. It also adds offline fine-tuning capabilities, enhancing the editability of the output results. Furthermore, the design of a continuous sentiment feature value spectrum makes it possible to fine-tune the sentiment, effectively solving the problem of multi-sensory text-to-speech. Additionally, it optimizes dataset annotation and online synthesis to a certain extent. Attached Figure Description

[0038] Figure 1 This is a schematic diagram illustrating the steps of a speech synthesis model method capable of synthesizing multi-emotional audio in this invention. Detailed Implementation

[0039] Examples, such as Figure 1 As shown, a speech synthesis model method capable of synthesizing multi-emotion audio includes the following steps:

[0040] S1. Process the raw data, distinguish between the training set and the validation set, add annotation files to each set, and deliver the raw dataset to the emotion recognition module for processing.

[0041] S1 requires clear pronunciation, moderate speaking speed, balanced volume, and audio length of 5-10 seconds. It also requires at least 5,000 training samples and 500 validation samples to express emotions.

[0042] S2. Call the emotion recognition module to preprocess the dataset, breaking down the audio into phoneme and emotion feature files.

[0043] Specifically, emotional expression is no different from textual expression on the data side. Therefore, it can be trained using a dialogue model. During pre-training, the emotion recognition module first constructs ultra-small samples at the phoneme level, and then uses a convolutional neural network (CNN) to extract the emotional features of each segment, converting them into NumPy arrays and storing them in the sample set. Furthermore, to achieve emotion fine-tuning, these NumPy arrays are naturally grouped according to a clustering algorithm, essentially forming a continuous feature value spectrum.

[0044] Generally, training a deep learning-based acoustic model requires 10,000 5-10 second audio clips as raw data to reach commercial application level. Achieving multi-emotion output necessitates an even larger data volume. Extensive annotation would generate an unacceptable amount of manual work; therefore, this invention employs an unsupervised approach to reduce this burden. However, it is undeniable that machine annotation errors are far greater than manual annotations, and the construction of conventional discrete emotion dictionaries amplifies this problem, leading to high bias values ​​for specific emotion features. This issue can be addressed by constructing a continuous eigenvalue spectrum over the set of natural numbers.

[0045] S3. Call the encoder to train the decomposed audio. The audio will be automatically classified by the encoder based on the feature file to obtain the trained acoustic model.

[0046] The human vocal tract operates with a degree of randomness, and the pauses, pitches, and timbre of each individual's speech constitute prosody. Mel spectrograms are typically constructed as prosodic data that machines can learn. To optimize the synthesis results, a random duration predictor can be used to estimate the distribution of phoneme durations, simulating the vocal tract's different pitches and pause durations for different phonemes, thus better learning the prosody of the data provider.

[0047] Emotional expressions can be converted into vector data, and the ability to judge emotions by combining context can be trained through random masking and prediction of masked content.

[0048] The encoder uses variational inference to sample latent variables from the latent space, and then maps the latent variables from the latent space to the speech space through a normalization flow.

[0049] To achieve emotional output, this model incorporates the emotional features stored in a NumPy array as a new dimension into the latent variables extracted from the standardized stream, and maps them into the speech space.

[0050] During the mapping process, general audio is converted into machine learning data by constructing the mel spectrum. In order to improve the simulation effect of human voice, an additional random duration predictor is added to estimate the distribution of phoneme duration. Emotion representation is also added as a new computational dimension to the prediction. This process is trained adversarially using Hifi-GAN.

[0051] S4. Obtain the model file, insert the vocoder, input text into the input terminal, use the functions provided by the acoustic model to process the text into phonemes, add the duration information using the random duration predictor, and obtain continuous audio results.

[0052] Specifically, the vocoder uses a WaveGlow architecture, which consists of a WaveNet decoder and a Glow stream encoder.

[0053] After breaking the text down into phonemes, Wavenet reconstructs the phoneme sequence into a continuous waveform, while the Glow stream encoder maps the continuous waveform into a discrete speech representation to build a Wav file.

[0054] In S3, the encoder training results will participate in the phoneme sequence reconstruction process as latent variables.

[0055] Because two additional dimensions are added to the general speech synthesis model, the training time is longer, and a longer context is needed to make judgments in order to express emotions accurately.

[0056] To implement the application, this invention implements an online synthesis process: text is sent from the local machine to the server in units of paragraphs, with the period as the breakpoint, the average value of the mode interval of the phoneme emotion representation of the sentence is taken as the emotion expression of the whole sentence, feature value-speech representation key-value pairs are generated, and the acoustic model is delivered for synthesis, with the time complexity still being O(n).

[0057] To ensure operational effectiveness, an interface for manually modifying feature values ​​is also provided. Since the feature values ​​are a series of continuous values, adjustments can be made to allow the synthesized result to gradually shift from one emotional range to another, achieving fine-tuning of emotions.

[0058] S5. Based on step S4, obtain the output, including the multi-emotion encoder with intermediate outputs and the final online synthesized independent WAV file.

[0059] In summary, the process begins by processing the raw data, distinguishing between the training and validation sets, and adding annotation files to each set. The raw dataset is then processed by the emotion recognition module. Next, the emotion feature encoder is extracted and aligned with the audio and phoneme sequences based on the annotation files. An acoustic model is then trained using emotion features. During training, the emotion feature vectors are concatenated after the phoneme output to obtain the trained acoustic model. This model is then inserted into the acoustic model, and text is input at the input end. The vocoder uses functions provided by the acoustic model to process the text into phonemes, and a random duration predictor adds duration information to obtain continuous audio output. Finally, the output consists of a multi-emotion encoder containing intermediate outputs and a final online synthesized independent WAV file.

[0060] The complete multi-emotion text-to-speech model and dataset processing in this invention are specifically divided into dataset collection, unsupervised preprocessing, encoder training, and online inference. The final output includes a multi-emotion encoder with intermediate outputs and an independent WAV file synthesized online. It can achieve multi-emotion output and simulate prosody, making the effect close to that of a real person. No emotion annotation is required when processing the data. At the same time, the method of constructing a continuous feature value spectrum greatly avoids the problem of inaccurate machine annotation. A unique emotion analysis input-output architecture is constructed for online synthesis of long texts, ensuring that the time complexity metric is maintained at O(n). This prevents the synthesis speed from dropping drastically when adding two dimensions. Offline fine-tuning capability is added to enhance the editability of the output results. At the same time, the design of the continuous emotion feature value spectrum makes it possible to fine-tune the emotion, effectively solving the problem of multi-emotion text-to-speech. At the same time, certain optimizations are made to the dataset annotation and online synthesis.

Claims

1. A speech synthesis model method capable of synthesizing multi-emotional audio, characterized in that, Includes the following steps: S1. Process the raw data, distinguish between the training set and the validation set, add annotation files for each set, and deliver the raw dataset to the emotion recognition module for processing. S2. Call the emotion recognition module to preprocess the dataset, breaking down the audio into phoneme and emotion feature files; S3. Call the encoder to train the decomposed audio. The audio will be automatically classified by the encoder based on the feature file to obtain the trained acoustic model. S4. Obtain the model file, insert the vocoder, input text into the input terminal, use the functions provided by the acoustic model to process the text into phonemes, add the duration information using the random duration predictor, and obtain continuous audio results. S5. The output obtained from step S4 includes a multi-emotion encoder with intermediate outputs and a separate WAV file synthesized online. Among them, S1 requires clear pronunciation, moderate speaking speed, balanced volume, and audio length of 5-10 seconds, and the emotional requirements include at least 5000 training samples and 500 validation samples. In S2, the training time is represented by epochs, and it generally converges in about 10,000 epochs. Whether it converges or not is mainly marked by the offset. The emotion recognition module training steps in S2 are as follows: S201. First, construct ultra-small samples in units of phonemes; S202. Then, the emotional features of each segment are extracted using a convolutional neural network. S203, Finally, it is converted into a NumPy array and stored in the sample set; The NumPy array was divided into natural groups according to a clustering algorithm; The encoder training steps for the decomposed audio in S3 are as follows: S301. Variational inference is used to sample latent variables from the latent space; S302, and then the latent variables in the latent space are mapped to the speech space through the normalization flow; In step S301, the emotional features stored in the NumPy array in step S203 are added as a new dimension to the latent variables extracted from the normalized stream, so as to map them into the speech space. The mapping process in S302 includes constructing a mel spectrum to convert general audio into data that can be used for machine learning, adding an additional random duration predictor to estimate the distribution of phoneme duration, and adding sentiment representation as a new computational dimension to the prediction. The vocoder in S4 adopts a WaveGlow structure, which consists of a WaveNet decoder and a Glow stream encoder. The vocoder processing steps are as follows: S401. After breaking the text down into phonemes, Wavenet will reconstruct the phoneme sequence into a continuous waveform. The S402 and Glow stream encoders will map continuous waveforms into discrete speech representations to build Wav files; In S3, the encoder training results will participate in the phoneme sequence reconstruction process as latent variables. The process for obtaining continuous audio results in S4 is as follows: S411. Send the text from the local machine to the server in paragraph form, with the period as the breakpoint. S412. Take the average value of the mode interval of the emotional expression of each phoneme in the sentence, with the period as the breakpoint, as the emotional expression of the whole sentence. S413. Generate key-value pairs of feature values ​​and speech representations, and deliver them to the acoustic model for synthesis. The time complexity is still O(n). In S4, the feature values ​​can be manually modified.