Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Systems and methods for neural voice cloning with a few samples

An audio and text technology, applied in the field of computer learning systems, capable of solving complex problems

Active Publication Date: 2019-08-16
BAIDU USA LLC
View PDF10 Cites 22 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Traditional TTS systems are based on complex multi-stage artificially engineered pipelines

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Systems and methods for neural voice cloning with a few samples
  • Systems and methods for neural voice cloning with a few samples
  • Systems and methods for neural voice cloning with a few samples

Examples

Experimental program
Comparison scheme
Effect test

Embodiment approach

[0114] In one or more implementations, for a speaker encoder A neural network architecture consists of three parts (e.g., Figure 12 One implementation is shown in ):

[0115] (i) Spectral processing: In one or more implementations, a mel spectrogram 1205 for cloning an audio sample is computed and passed to a PreNet (pre-network) 1210, which contains Fully Connected (FC) layer of Exponential Linear Unit (ELU).

[0116] (ii) Temporal processing: In one or more embodiments, several convolutional layers 1220 with gated linear units and residual connections are used to incorporate temporal context. Next, average pooling can be applied 1225 to summarize the entire utterance.

[0117] (iii) Clone Sample Attention: Considering that different cloned audios contain different amounts of speaker information, in one or more implementations, a multi-head self-attention mechanism 1230 can be used to calculate the weights of different audios and obtain aggregated embeddings 1235 .

[0...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

Voice cloning is a highly desired capability for personalized speech interfaces. Neural network-based speech synthesis has been shown to generate high quality speech for a large number of speakers. Neural voice cloning systems that take a few audio samples as input are presented herein. Two approaches, speaker adaptation and speaker encoding, are disclosed. Speaker adaptation embodiments are basedon fine-tuning a multi-speaker generative model with a few cloning samples. Speaker encoding embodiments are based on training a separate model to directly infer a new speaker embedding from cloningaudios, which is used in or with a multi-speaker generative model. Both approaches achieve good performance in terms of naturalness of the speech and its similarity to original speaker-even with veryfew cloning audios.

Description

technical field [0001] The present disclosure generally relates to systems and methods for computer learning that can provide improved computer performance, features and usage. More specifically, the present disclosure relates to systems and methods for text-to-speech via deep neural networks. Background technique [0002] Artificial speech synthesis systems, often referred to as text-to-speech (TTS) systems, convert written language into human speech. TTS systems are used in various applications such as human-machine interface, accessibility for the visually impaired, media and entertainment. Fundamentally, it allows human-computer interaction without a visual interface. Traditional TTS systems are based on complex multi-stage artificially engineered pipelines. Typically, these systems first convert text into a compact audio representation, and then convert this representation into audio using an audio waveform synthesis method called a vocoder. [0003] One goal of a T...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G10L13/10G10L13/08G10L13/033G10L25/60G06N3/08
CPCG10L13/10G10L13/08G10L13/0335G10L13/033G10L25/60G06N3/08G10L13/047G10L13/027
Inventor 塞尔坎·O·安瑞克陈吉彤彭开南平伟周彥祺
Owner BAIDU USA LLC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products