Unlock instant, AI-driven research and patent intelligence for your innovation.

Speech synthesis method, neural network model training method, and speech synthesis model

A speech synthesis and speech technology, applied in the field of neural network, can solve the problems of robust effect, huge database, short time and speed, etc., to achieve the effect of improving richness

Active Publication Date: 2022-05-10
ALIBABA DAMO (HANGZHOU) TECH CO LTD
View PDF14 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0002] At present, the end-to-end model based on the neural network is constantly improving, and the modeling ability of the speech synthesis model is continuously improved, which makes the time for synthesizing speech shorter, faster, and the effect is more robust, and the synthesized speech is becoming more and more natural pronunciation. , but the existing speech synthesis models require a huge database and a large amount of computing resources; on the other hand, in daily life, influenced by geography, dialects with heavy accents are widely used, but the existing speech synthesis models are difficult to Synthesize Speech Audio with Accents

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Speech synthesis method, neural network model training method, and speech synthesis model
  • Speech synthesis method, neural network model training method, and speech synthesis model
  • Speech synthesis method, neural network model training method, and speech synthesis model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0027] In this embodiment, a new neural network model (also called a speech synthesis model) capable of synthesizing target speech with an accent (that is, non-Mandarin Chinese) is provided. In order to facilitate understanding, the implementation of the speech synthesis method Before describing the procedure, the speech synthesis model will be described.

[0028] refer to Figure 1A , shows a schematic diagram of a speech synthesis model. The model includes encoder, decoder and vocoder.

[0029] where the encoder (i.e. Figure 1A The encoder shown in ) is used to predict phonetic features and phonetic posterior graphs from the phoneme vectors of the text to be synthesized, and the phonetic posterior graphs carry accent information.

[0030] The speech features may include the fundamental frequency (F0) and energy information (energy) of each phoneme in the target speech to be synthesized, but is not limited thereto. The encoder of this embodiment can not only predict fundam...

Embodiment 2

[0074] refer to Figure 5 , shows a schematic flowchart of steps of a method for training a neural network model according to Embodiment 2 of the present application.

[0075] This method is used for training aforementioned speech synthesis model, and it comprises the following steps:

[0076] Step S502: Using the audio samples corresponding to the first accent to train the speech synthesis model to obtain an initially trained speech synthesis model.

[0077] Among them, the first accent can be Mandarin, or other accent types with a relatively large sample size (that is, a relatively long audio duration).

[0078] In this embodiment, in order to solve the problem that the audio samples of non-Mandarin accents are insufficient and it is difficult to use audio samples of non-Mandarin accents to train a speech synthesis model with an accent that can be used, a pair of audio samples of Mandarin with a larger sample size is used. The speech synthesis model is trained to obtain th...

Embodiment 3

[0108] refer to Figure 8 , shows a structural block diagram of the speech synthesis device according to Embodiment 3 of the present application.

[0109] In this embodiment, the device includes:

[0110] Obtaining module 802, for obtaining the phoneme vector of the text to be synthesized;

[0111] A prediction module 804, configured to predict the speech features and speech a posteriori graph corresponding to each phoneme from the phoneme vector, and the speech a posteriori graph carries accent information;

[0112] A generation module 806, configured to generate a speech spectrum according to the speech features and the speech posterior map;

[0113] The synthesis module 808 is configured to output a target speech corresponding to the text to be synthesized based on the speech spectrum, and the accent of the target speech matches the accent information.

[0114] Optionally, the speech features include the fundamental frequency corresponding to each phoneme and the energy ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The embodiment of the invention provides a speech synthesis method, a neural network model training method and a speech synthesis model. The speech synthesis method comprises the following steps: acquiring a phoneme vector of a to-be-synthesized text; predicting a voice feature and a voice posterior image corresponding to each phoneme from the phoneme vector, wherein the voice posterior image carries accent information; generating a speech spectrum according to the speech features and the speech posterior graph; and outputting a target voice corresponding to the to-be-synthesized text based on the voice spectrum, wherein accent of the target voice is matched with the accent information. The method can be used for synthesizing the repeated accent speech.

Description

technical field [0001] The embodiments of the present application relate to the technical field of neural networks, and in particular to a speech synthesis method, a neural network model training method, and a speech synthesis model. Background technique [0002] At present, the end-to-end model based on the neural network is constantly improving, and the modeling ability of the speech synthesis model is continuously improved, which makes the time for synthesizing speech shorter, faster, and the effect is more robust, and the synthesized speech is becoming more and more natural pronunciation. , but the existing speech synthesis models require a huge database and a large amount of computing resources; on the other hand, in daily life, influenced by geography, dialects with heavy accents are widely used, but the existing speech synthesis models are difficult to Synthesize spoken audio with accents. Contents of the invention [0003] In view of this, an embodiment of the pre...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G10L13/08G10L13/027G10L25/30
CPCG10L13/08G10L13/027G10L25/30
Inventor 柴萌鑫林羽钦黄智颖
Owner ALIBABA DAMO (HANGZHOU) TECH CO LTD