A unified training method and system for speech synthesis and speech conversion

A speech conversion and speech synthesis technology, which is applied in speech synthesis, speech analysis, instruments, etc., can solve the problem that it is impossible to learn the representation of speech content, and achieve the effect of improving performance

Active Publication Date: 2022-07-01
INST OF AUTOMATION CHINESE ACAD OF SCI
View PDF9 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The attention mechanism is often affected by speaker information, so it is impossible to learn a speaker-independent speech content representation

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A unified training method and system for speech synthesis and speech conversion
  • A unified training method and system for speech synthesis and speech conversion
  • A unified training method and system for speech synthesis and speech conversion

Examples

Experimental program
Comparison scheme
Effect test

Embodiment example

[0107] The proposed unified TTS and VC training framework is as follows figure 2 shown. Specifically, the structure of each sub-module can be as follows image 3 Show. The number of feedforward transformer (FFT) blocks is 2 in the text encoder and 6 in the decoder module. In each FFT block, the dimension of the hidden state is 256. The kernel size of all 1D convolutions is set to 3. The dropout rate is set to 0.5. The dimension of the last linear layer in the decoder is 80, which is consistent with the Mel spectral dimension. The size of the last linear layer in the encoders (text encoder, prosodic information encoder, content information encoder) is 256. Adam optimizer is used to update parameters. The initial learning rate is 0.001 and the learning rate decreases exponentially. In the inference stage, hifigan is used as a vocoder.

[0108] In addition, an additional duration model needs to be trained, which is very common in speech synthesis tasks, and the example ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention provides a unified training method and system for speech synthesis and speech conversion. The method includes: decoupling the coding task of speech synthesis and speech conversion into three sub-tasks, which are the extraction of content information, the extraction of speaker information and the extraction of prosody information; the content information is a language irrelevant to the speaker. information; the speaker information includes: the characteristics of the speaker; the prosody information represents how the speaker speaks the content information, reflecting the rhythm of the speech; the content information, speaker information and prosody information obtained by extraction are input into the decoding task to get the restored voice information. The solution proposed by the present invention unifies the speech synthesis and speech conversion models, avoiding the difficulty of independent construction, and improves the performance of speech synthesis and speech conversion by using unmarked speech.

Description

technical field [0001] The invention belongs to the technical field of speech cloning, and in particular relates to a unified training method and system for speech synthesis and speech conversion. Background technique [0002] Cloning the target speaker's voice is an attractive technology that can be applied in various scenarios, such as entertainment creation, personalized mobile assistants, security, etc. The most ideal voice cloning operation is to only give a sentence of speech of the target speaker that has not been seen before as a reference, and then any speech of the target speaker can be synthesized, which is called single-sample voice cloning. In the field of speech research, speech synthesis technology and speech conversion technology are two mainstream ways to realize speech cloning. The two technologies were previously researched and developed separately as separate tasks. [0003] TTS (text-to-speech): speech synthesis; [0004] VC (voice conversion): voice ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G10L13/02G10L13/027G10L13/08
CPCG10L13/02G10L13/027G10L13/08
Inventor 陶建华汪涛易江燕傅睿博张震
Owner INST OF AUTOMATION CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products