Unlock instant, AI-driven research and patent intelligence for your innovation.

Speech synthesis method, device, electronic device and storage medium

A speech synthesis and audio signal technology, applied in speech synthesis, speech analysis, instruments, etc., can solve the problem that the synthesis performance or synthesis effect is difficult to achieve commercial use, so as to accelerate the real-time audio synthesis process, reduce the number of iterations, and accelerate the convergence speed Effect

Active Publication Date: 2020-05-19
杭州博盾习言科技有限公司
View PDF8 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

With the introduction of WaveNet, LpcNet and other technologies, a number of speech synthesis methods using neural networks as vocoders have appeared, but it is still difficult to achieve commercialization in terms of synthesis performance or synthesis effects. Currently, the Griffin-Lim algorithm is widely used as a vocoder. used in speech synthesis

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Speech synthesis method, device, electronic device and storage medium
  • Speech synthesis method, device, electronic device and storage medium
  • Speech synthesis method, device, electronic device and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0039] Embodiment 1 provides a method of speech synthesis, aiming to obtain the target spectrum value and phase target value according to the text data, and convert the text data into a text vector, predict the spectrum and phase through the neural network model, and according to the target value of the spectrum and phase Calculate the overall loss with the predicted value to constrain the update direction of the model parameters during the training process of the neural network model. The linear spectrum and initial phase are obtained through the trained model, and the linear spectrum and initial phase are input into the Griffin-Lim vocoder for training to obtain the corresponding audio signal. This speech synthesis method can reduce the number of vocoder iterations and speed up the convergence speed of the vocoder without reducing the audio quality, thereby accelerating the audio real-time synthesis process as a whole, and is suitable for using the Griffin-Lim algorithm as a ...

Embodiment 2

[0057] Embodiment 2 is an improvement on the basis of Embodiment 1. According to the text data, the linear spectrum target value and the phase target value are obtained, which are used for the loss function of the neural network model training; the text data is converted into a text vector for the neural network model input of.

[0058] The audio data matching the text data is obtained, and the audio data is subjected to short-time Fourier transform to obtain a linear spectrum target value and a phase target value. The long-term audio data is framed and windowed by short-time Fourier transform, and each frame is Fourier transformed, and the transformation results of each frame are stacked along another dimension to obtain the corresponding audio data. Linear spectrum and phase, the linear spectrum and phase are used as the linear spectrum target value and phase target value, which are used as the target value parameters of the loss function for neural network model training. ...

Embodiment 3

[0064] Embodiment three is an improvement carried out on the basis of embodiment one or / and embodiment two. The neural network model of the speech synthesis method adopts the Tacotron model, and the text vector is input into the Tacotron model to obtain a linear spectrum prediction value and a phase prediction value, which are used for Calculate the loss function that constrains the training of the Tacotron model so that the spectrum and phase of the model are trained in the same direction.

[0065] The Tacotron model consists of sequentially connected encoders, decoders, and a post-processing network consisting of CBHG units and fully-connected layers. The text vector is calculated by the encoder and decoder in turn to obtain the mel spectrum. The mel spectrum is calculated by the CBHG unit to obtain the mel spectrum feature. The mel spectrum feature is calculated by the fully connected layer to obtain the linear spectrum prediction value and phase prediction with the same dim...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a speech synthesis method, and relates to the field of the speech synthesis. The method comprises the following steps of acquiring text data, obtaining target values of a linear frequency spectrum and phase according to text data, and converting the text data into a text vector; inputting the text vector into a neural network model to obtain prediction values of the linearfrequency spectrum and the phase, and then calculating the whole loss for training the neural network model, and obtaining the linear frequency spectrum and the initial phase through the trained neural network model; and inputting the linear frequency spectrum and the initial phase into a Griffin-Lim vocoder to train to obtain an audio signal corresponding to the text data. Through the method disclosed by the invention, the number of iterations of the vocoder can be reduced by training the Griffin-Lim vocoder according to the linear frequency spectrum and the initial phase, the convergence speed of the vocoder is accelerated, and the audio real-time synthesis process is accelerated under the condition of not reducing the audio quality; and the method is suitable for a speech synthesis device taking the Griffin-Lim algorithm as the vocoder. The invention further discloses a speech synthesis device, electronic equipment and a computer storage medium.

Description

technical field [0001] The invention relates to the field of speech synthesis, in particular to a speech synthesis method, device, electronic equipment and storage medium. Background technique [0002] Speech synthesis is a cutting-edge technology in the field of Chinese information processing. It mainly decomposes the given text input into feature vectors according to characters or words, then converts the feature vectors into audio features, and finally uses the vocoder to restore the audio features to the corresponding audio file output. With the introduction of WaveNet, LpcNet and other technologies, a number of speech synthesis methods using neural networks as vocoders have appeared, but it is still difficult to achieve commercialization in terms of synthesis performance or synthesis effects. Currently, the Griffin-Lim algorithm is widely used as a vocoder. used in speech synthesis methods. The Griffin-Lim algorithm is an iterative algorithm that uses the spectrum to ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G10L13/02G10L25/30
CPCG10L13/02G10L25/30
Inventor 顾王一
Owner 杭州博盾习言科技有限公司