A unified training method and system for speech synthesis and speech conversion

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A speech conversion and speech synthesis technology, which is applied in speech synthesis, speech analysis, instruments, etc., can solve the problem that it is impossible to learn the representation of speech content, and achieve the effect of improving performance

Active Publication Date: 2022-07-01

INST OF AUTOMATION CHINESE ACAD OF SCI

View PDF9 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

The attention mechanism is often affected by speaker information, so it is impossible to learn a speaker-independent speech content representation

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment example

[0107] The proposed unified TTS and VC training framework is as follows figure 2 shown. Specifically, the structure of each sub-module can be as follows image 3 Show. The number of feedforward transformer (FFT) blocks is 2 in the text encoder and 6 in the decoder module. In each FFT block, the dimension of the hidden state is 256. The kernel size of all 1D convolutions is set to 3. The dropout rate is set to 0.5. The dimension of the last linear layer in the decoder is 80, which is consistent with the Mel spectral dimension. The size of the last linear layer in the encoders (text encoder, prosodic information encoder, content information encoder) is 256. Adam optimizer is used to update parameters. The initial learning rate is 0.001 and the learning rate decreases exponentially. In the inference stage, hifigan is used as a vocoder.

[0108] In addition, an additional duration model needs to be trained, which is very common in speech synthesis tasks, and the example ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to view more

PUM

Login to view more

Abstract

The present invention provides a unified training method and system for speech synthesis and speech conversion. The method includes: decoupling the coding task of speech synthesis and speech conversion into three sub-tasks, which are the extraction of content information, the extraction of speaker information and the extraction of prosody information; the content information is a language irrelevant to the speaker. information; the speaker information includes: the characteristics of the speaker; the prosody information represents how the speaker speaks the content information, reflecting the rhythm of the speech; the content information, speaker information and prosody information obtained by extraction are input into the decoding task to get the restored voice information. The solution proposed by the present invention unifies the speech synthesis and speech conversion models, avoiding the difficulty of independent construction, and improves the performance of speech synthesis and speech conversion by using unmarked speech.

Description

technical field [0001] The invention belongs to the technical field of speech cloning, and in particular relates to a unified training method and system for speech synthesis and speech conversion. Background technique [0002] Cloning the target speaker's voice is an attractive technology that can be applied in various scenarios, such as entertainment creation, personalized mobile assistants, security, etc. The most ideal voice cloning operation is to only give a sentence of speech of the target speaker that has not been seen before as a reference, and then any speech of the target speaker can be synthesized, which is called single-sample voice cloning. In the field of speech research, speech synthesis technology and speech conversion technology are two mainstream ways to realize speech cloning. The two technologies were previously researched and developed separately as separate tasks. [0003] TTS (text-to-speech): speech synthesis; [0004] VC (voice conversion): voice ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to view more

Application Information

Patent Timeline

Login to view more

Patent Type & Authority Patents(China)

IPC IPC(8): G10L13/02G10L13/027G10L13/08

CPCG10L13/02G10L13/027G10L13/08

Inventor 陶建华汪涛易江燕傅睿博张震

Owner INST OF AUTOMATION CHINESE ACAD OF SCI

Who we serve

R&D Engineer
R&D Manager
IP Professional

Why Eureka

Industry Leading Data Capabilities
Powerful AI technology
Patent DNA Extraction

Social media

Try Eureka

PatSnap group products

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic.

A unified training method and system for speech synthesis and speech conversion

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment example

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology