Systems and methods for neural voice cloning with a few samples

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
An audio and text technology, applied in the field of computer learning systems, capable of solving complex problems

Active Publication Date: 2019-08-16

BAIDU USA LLC

View PDF10 Cites 22 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

Traditional TTS systems are based on complex multi-stage artificially engineered pipelines

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment approach

[0114] In one or more implementations, for a speaker encoder A neural network architecture consists of three parts (e.g., Figure 12 One implementation is shown in ):

[0115] (i) Spectral processing: In one or more implementations, a mel spectrogram 1205 for cloning an audio sample is computed and passed to a PreNet (pre-network) 1210, which contains Fully Connected (FC) layer of Exponential Linear Unit (ELU).

[0116] (ii) Temporal processing: In one or more embodiments, several convolutional layers 1220 with gated linear units and residual connections are used to incorporate temporal context. Next, average pooling can be applied 1225 to summarize the entire utterance.

[0117] (iii) Clone Sample Attention: Considering that different cloned audios contain different amounts of speaker information, in one or more implementations, a multi-head self-attention mechanism 1230 can be used to calculate the weights of different audios and obtain aggregated embeddings 1235 .

[0...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

Voice cloning is a highly desired capability for personalized speech interfaces. Neural network-based speech synthesis has been shown to generate high quality speech for a large number of speakers. Neural voice cloning systems that take a few audio samples as input are presented herein. Two approaches, speaker adaptation and speaker encoding, are disclosed. Speaker adaptation embodiments are basedon fine-tuning a multi-speaker generative model with a few cloning samples. Speaker encoding embodiments are based on training a separate model to directly infer a new speaker embedding from cloningaudios, which is used in or with a multi-speaker generative model. Both approaches achieve good performance in terms of naturalness of the speech and its similarity to original speaker-even with veryfew cloning audios.

Description

technical field [0001] The present disclosure generally relates to systems and methods for computer learning that can provide improved computer performance, features and usage. More specifically, the present disclosure relates to systems and methods for text-to-speech via deep neural networks. Background technique [0002] Artificial speech synthesis systems, often referred to as text-to-speech (TTS) systems, convert written language into human speech. TTS systems are used in various applications such as human-machine interface, accessibility for the visually impaired, media and entertainment. Fundamentally, it allows human-computer interaction without a visual interface. Traditional TTS systems are based on complex multi-stage artificially engineered pipelines. Typically, these systems first convert text into a compact audio representation, and then convert this representation into audio using an audio waveform synthesis method called a vocoder. [0003] One goal of a T...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G10L13/10G10L13/08G10L13/033G10L25/60G06N3/08

CPCG10L13/10G10L13/08G10L13/0335G10L13/033G10L25/60G06N3/08G10L13/047G10L13/027

Inventor塞尔坎·O·安瑞克陈吉彤彭开南平伟周彥祺

OwnerBAIDU USA LLC

Systems and methods for neural voice cloning with a few samples

What is AI technical title? AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document. An audio and text technology, applied in the field of computer learning systems, capable of solving complex problems

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment approach

PUM

Abstract

Description

Claims

Application Information

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
An audio and text technology, applied in the field of computer learning systems, capable of solving complex problems

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology