Clockwork hierarchical variational encoder

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
A hierarchical and hierarchical technology, applied in the field of clock-based hierarchical variational encoders, can solve problems such as invalidity and lack of expressiveness of synthesized speech.

Pending Publication Date: 2020-11-27

GOOGLE LLC

View PDF7 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

While traditional concatenative and parametric synthesis models are able to provide intelligible speech, and recent advances in neural modeling of speech have significantly improved the naturalness of synthesized speech, most existing TTS models are poor at modeling prosodic aspect is ineffective, resulting in a lack of expressiveness in the synthesized speech used by important applications

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0025] Text-to-speech (TTS) models commonly used by speech synthesis systems are generally given only a textual input at runtime without any reference acoustic representation, and must impute many linguistic factors not provided by the textual input in order to produce a listening It sounds like real synthetic speech. A subset of these linguistic factors is collectively called prosody, and can include intonation (pitch change), stress (stressed versus unstressed syllables), sound duration, volume, pitch, rhythm, and style of speech. Prosody may indicate the emotional state of speech, the form of speech (eg, statement, question, command, etc.), the presence of speech sarcasm or sarcasm, uncertainty in knowledge of speech, or other language element. Thus, a given textual input associated with high prosody changes can produce a synthesized speech with local changes in pitch and utterance duration to convey different semantic meanings, and also with global changes in the overall ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

A method (400) for representing an intended prosody in synthesized speech (152) includes receiving a text utterance (320) having at least one word (250), and selecting an utterance embedding (260) forthe text utterance. Each word in the text utterance has at least one syllable (240) and each syllable has at least one phoneme (230). The utterance embedding represents an intended prosody. For eachsyllable, using the selected utterance embedding, the method also includes: predicting a duration of the syllable by encoding linguistic features (232) of each phoneme of the syllable with a corresponding prosodic syllable embedding (245) for the syllable; predicting a pitch contour of the syllable based on the predicted duration for the syllable; and generating a plurality of fixed-length predicted pitch frames (280) based on the predicted duration for the syllable. Each fixed-length predicted pitch frame represents part of the predicted pitch contour of the syllable.

Description

technical field [0001] The present disclosure relates to a clockwork hierarchical variational encoder for predicting prosody. Background technique [0002] Speech synthesis systems use a text-to-speech (TTS) model to generate speech from textual input. The generated / synthesized speech should convey information accurately (intelligibility) while sounding like human speech (naturalness) with intended prosody (expressiveness). While traditional concatenative and parametric synthesis models can provide intelligible speech, and recent advances in neural modeling of speech have significantly improved the naturalness of synthesized speech, most existing TTS models are Aspects are ineffective, resulting in a lack of expressiveness in the synthesized speech used by important applications. For example, for applications such as conversational assistants and long-form readers, it is desirable to generate authentic speech by entering prosodic features that are not conveyed in the text ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G10L13/047G10L13/08G06N3/02G10L13/10

CPCG10L13/047G10L13/10G06N3/044G06N3/045G06N3/047G06N3/084G10L2013/105G06N3/0442G06N3/0455G06N3/0475

Inventor罗伯特·克拉克詹竣安文森特·万

OwnerGOOGLE LLC

Clockwork hierarchical variational encoder

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology