Multi-speaker voice synthesis method based on variational auto-encoder

An autoencoder and speech synthesis technology, applied in speech synthesis, speech analysis, instruments, etc., can solve the problems of low efficiency, high recording cost, and only the speech of a single speaker can be synthesized.

Pending Publication Date: 2021-01-29
INST OF ACOUSTICS CHINESE ACAD OF SCI +1
View PDF0 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The traditional speech synthesis algorithm needs to record a single-speaker sound library with a relatively comprehensive phoneme coverage to ensure that it can synthesize speech from various texts, but it will cause high recording costs, low efficiency, and the problem of only synthesizing the speech of a single speaker

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multi-speaker voice synthesis method based on variational auto-encoder
  • Multi-speaker voice synthesis method based on variational auto-encoder
  • Multi-speaker voice synthesis method based on variational auto-encoder

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0070] The technical scheme of the present invention will be further described below in conjunction with the accompanying drawings.

[0071] The present invention proposes a multi-speaker speech synthesis method based on a variational autoencoder, which includes a training stage and a synthesis stage;

[0072] Such as figure 1 As shown, the training phase includes:

[0073] Step 101) Extracting frame-level acoustic parameters, phoneme-level duration parameters and frame-level, phoneme-level linguistic features from the recorded speech signals containing multiple speakers, and normalizing them respectively.

[0074] The frame-level acoustic parameters have a total of 187 dimensions, including: 60-dimensional Mel cepstral coefficients and their first-order and second-order differences, 1-dimensional fundamental frequency parameters and their first-order and second-order differences, 1-dimensional aperiodic parameters and their First-order and second-order difference, 1-dimensi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a multi-speaker voice synthesis method based on a variational auto-encoder. The method comprises the following steps: extracting a phoneme-level duration parameter and a frame-level acoustic parameter of a to-be-synthesized speaker clean voice, inputting the normalized phoneme-level duration parameter into a first variational auto-encoder, and outputting a duration speakerlabel; inputting the normalized frame-level acoustic parameter into a second variational auto-encoder, and outputting an acoustic speaker label; extracting frame-level linguistic features and phoneme-level linguistic features from speech signals to be synthesized, wherein the speech signals include a plurality of speakers; inputting the duration speaker label and the normalized phoneme-level linguistic features into a duration prediction network, and outputting a current phoneme prediction duration; obtaining the frame-level linguistic characteristics of the phoneme through the current phonemeprediction duration, inputting the frame-level linguistic characteristics and the acoustic speaker label into an acoustic parameter prediction network, and outputting the normalized acoustic parameters of the prediction voice; and inputting the normalized predicted speech acoustic parameters into a vocoder, and outputting a synthesized speech signal.

Description

technical field [0001] The invention relates to a speech synthesis method, in particular to a multi-speaker speech synthesis method based on a variational autoencoder. Background technique [0002] Speech synthesis technology is an important technology for converting input text into speech, and it is also an important research content in the field of human-computer interaction. [0003] Traditional speech synthesis algorithms need to record a single-speaker sound library with a relatively comprehensive phoneme coverage to ensure that it can synthesize speech from various texts, but it will cause problems such as high recording costs, low efficiency, and only the speech of a single speaker. The multi-speaker speech synthesis supports parallel recording of speech by different speakers, and can synthesize speech from different speakers. Traditional multi-speaker speech synthesis often needs to obtain the speaker information of the current speech, and manually mark a speaker la...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G10L13/08G10L13/10G10L25/03G10L25/27
CPCG10L13/08G10L13/086G10L13/10G10L25/03G10L25/27
Inventor 张鹏远蒿晓阳颜永红
Owner INST OF ACOUSTICS CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products