Speech synthesis method, neural network model training method, and speech synthesis model
A speech synthesis and speech technology, applied in the field of neural network, can solve the problems of robust effect, huge database, short time and speed, etc., to achieve the effect of improving richness
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0027] In this embodiment, a new neural network model (also called a speech synthesis model) capable of synthesizing target speech with an accent (that is, non-Mandarin Chinese) is provided. In order to facilitate understanding, the implementation of the speech synthesis method Before describing the procedure, the speech synthesis model will be described.
[0028] refer to Figure 1A , shows a schematic diagram of a speech synthesis model. The model includes encoder, decoder and vocoder.
[0029] where the encoder (i.e. Figure 1A The encoder shown in ) is used to predict phonetic features and phonetic posterior graphs from the phoneme vectors of the text to be synthesized, and the phonetic posterior graphs carry accent information.
[0030] The speech features may include the fundamental frequency (F0) and energy information (energy) of each phoneme in the target speech to be synthesized, but is not limited thereto. The encoder of this embodiment can not only predict fundam...
Embodiment 2
[0074] refer to Figure 5 , shows a schematic flowchart of steps of a method for training a neural network model according to Embodiment 2 of the present application.
[0075] This method is used for training aforementioned speech synthesis model, and it comprises the following steps:
[0076] Step S502: Using the audio samples corresponding to the first accent to train the speech synthesis model to obtain an initially trained speech synthesis model.
[0077] Among them, the first accent can be Mandarin, or other accent types with a relatively large sample size (that is, a relatively long audio duration).
[0078] In this embodiment, in order to solve the problem that the audio samples of non-Mandarin accents are insufficient and it is difficult to use audio samples of non-Mandarin accents to train a speech synthesis model with an accent that can be used, a pair of audio samples of Mandarin with a larger sample size is used. The speech synthesis model is trained to obtain th...
Embodiment 3
[0108] refer to Figure 8 , shows a structural block diagram of the speech synthesis device according to Embodiment 3 of the present application.
[0109] In this embodiment, the device includes:
[0110] Obtaining module 802, for obtaining the phoneme vector of the text to be synthesized;
[0111] A prediction module 804, configured to predict the speech features and speech a posteriori graph corresponding to each phoneme from the phoneme vector, and the speech a posteriori graph carries accent information;
[0112] A generation module 806, configured to generate a speech spectrum according to the speech features and the speech posterior map;
[0113] The synthesis module 808 is configured to output a target speech corresponding to the text to be synthesized based on the speech spectrum, and the accent of the target speech matches the accent information.
[0114] Optionally, the speech features include the fundamental frequency corresponding to each phoneme and the energy ...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 


