Multi-speaker voice synthesis method based on variational auto-encoder

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
An autoencoder and speech synthesis technology, applied in speech synthesis, speech analysis, instruments, etc., can solve the problems of low efficiency, high recording cost, and only the speech of a single speaker can be synthesized.

Pending Publication Date: 2021-01-29

INST OF ACOUSTICS CHINESE ACAD OF SCI +1

View PDF0 Cites 1 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0003] The traditional speech synthesis algorithm needs to record a single-speaker sound library with a relatively comprehensive phoneme coverage to ensure that it can synthesize speech from various texts, but it will cause high recording costs, low efficiency, and the problem of only synthesizing the speech of a single speaker

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0070] The technical scheme of the present invention will be further described below in conjunction with the accompanying drawings.

[0071] The present invention proposes a multi-speaker speech synthesis method based on a variational autoencoder, which includes a training stage and a synthesis stage;

[0072] Such as figure 1 As shown, the training phase includes:

[0073] Step 101) Extracting frame-level acoustic parameters, phoneme-level duration parameters and frame-level, phoneme-level linguistic features from the recorded speech signals containing multiple speakers, and normalizing them respectively.

[0074] The frame-level acoustic parameters have a total of 187 dimensions, including: 60-dimensional Mel cepstral coefficients and their first-order and second-order differences, 1-dimensional fundamental frequency parameters and their first-order and second-order differences, 1-dimensional aperiodic parameters and their First-order and second-order difference, 1-dimensi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a multi-speaker voice synthesis method based on a variational auto-encoder. The method comprises the following steps: extracting a phoneme-level duration parameter and a frame-level acoustic parameter of a to-be-synthesized speaker clean voice, inputting the normalized phoneme-level duration parameter into a first variational auto-encoder, and outputting a duration speakerlabel; inputting the normalized frame-level acoustic parameter into a second variational auto-encoder, and outputting an acoustic speaker label; extracting frame-level linguistic features and phoneme-level linguistic features from speech signals to be synthesized, wherein the speech signals include a plurality of speakers; inputting the duration speaker label and the normalized phoneme-level linguistic features into a duration prediction network, and outputting a current phoneme prediction duration; obtaining the frame-level linguistic characteristics of the phoneme through the current phonemeprediction duration, inputting the frame-level linguistic characteristics and the acoustic speaker label into an acoustic parameter prediction network, and outputting the normalized acoustic parameters of the prediction voice; and inputting the normalized predicted speech acoustic parameters into a vocoder, and outputting a synthesized speech signal.

Description

technical field [0001] The invention relates to a speech synthesis method, in particular to a multi-speaker speech synthesis method based on a variational autoencoder. Background technique [0002] Speech synthesis technology is an important technology for converting input text into speech, and it is also an important research content in the field of human-computer interaction. [0003] Traditional speech synthesis algorithms need to record a single-speaker sound library with a relatively comprehensive phoneme coverage to ensure that it can synthesize speech from various texts, but it will cause problems such as high recording costs, low efficiency, and only the speech of a single speaker. The multi-speaker speech synthesis supports parallel recording of speech by different speakers, and can synthesize speech from different speakers. Traditional multi-speaker speech synthesis often needs to obtain the speaker information of the current speech, and manually mark a speaker la...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G10L13/08G10L13/10G10L25/03G10L25/27

CPCG10L13/08G10L13/086G10L13/10G10L25/03G10L25/27

Inventor 张鹏远蒿晓阳颜永红

Owner INST OF ACOUSTICS CHINESE ACAD OF SCI

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Multi-speaker voice synthesis method based on variational auto-encoder

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology