Speech synthesis method, device, electronic device and storage medium
A speech synthesis and audio signal technology, applied in speech synthesis, speech analysis, instruments, etc., can solve the problem that the synthesis performance or synthesis effect is difficult to achieve commercial use, so as to accelerate the real-time audio synthesis process, reduce the number of iterations, and accelerate the convergence speed Effect
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0039] Embodiment 1 provides a method of speech synthesis, aiming to obtain the target spectrum value and phase target value according to the text data, and convert the text data into a text vector, predict the spectrum and phase through the neural network model, and according to the target value of the spectrum and phase Calculate the overall loss with the predicted value to constrain the update direction of the model parameters during the training process of the neural network model. The linear spectrum and initial phase are obtained through the trained model, and the linear spectrum and initial phase are input into the Griffin-Lim vocoder for training to obtain the corresponding audio signal. This speech synthesis method can reduce the number of vocoder iterations and speed up the convergence speed of the vocoder without reducing the audio quality, thereby accelerating the audio real-time synthesis process as a whole, and is suitable for using the Griffin-Lim algorithm as a ...
Embodiment 2
[0057] Embodiment 2 is an improvement on the basis of Embodiment 1. According to the text data, the linear spectrum target value and the phase target value are obtained, which are used for the loss function of the neural network model training; the text data is converted into a text vector for the neural network model input of.
[0058] The audio data matching the text data is obtained, and the audio data is subjected to short-time Fourier transform to obtain a linear spectrum target value and a phase target value. The long-term audio data is framed and windowed by short-time Fourier transform, and each frame is Fourier transformed, and the transformation results of each frame are stacked along another dimension to obtain the corresponding audio data. Linear spectrum and phase, the linear spectrum and phase are used as the linear spectrum target value and phase target value, which are used as the target value parameters of the loss function for neural network model training. ...
Embodiment 3
[0064] Embodiment three is an improvement carried out on the basis of embodiment one or / and embodiment two. The neural network model of the speech synthesis method adopts the Tacotron model, and the text vector is input into the Tacotron model to obtain a linear spectrum prediction value and a phase prediction value, which are used for Calculate the loss function that constrains the training of the Tacotron model so that the spectrum and phase of the model are trained in the same direction.
[0065] The Tacotron model consists of sequentially connected encoders, decoders, and a post-processing network consisting of CBHG units and fully-connected layers. The text vector is calculated by the encoder and decoder in turn to obtain the mel spectrum. The mel spectrum is calculated by the CBHG unit to obtain the mel spectrum feature. The mel spectrum feature is calculated by the fully connected layer to obtain the linear spectrum prediction value and phase prediction with the same dim...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 


