Clockwork hierarchical variational encoder

A hierarchical and hierarchical technology, applied in the field of clock-based hierarchical variational encoders, can solve problems such as invalidity and lack of expressiveness of synthesized speech.

Pending Publication Date: 2020-11-27
GOOGLE LLC
View PDF7 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

While traditional concatenative and parametric synthesis models are able to provide intelligible speech, and recent advances in neural modeling of speech have significantly improved the naturalness of synthesized speech, most existing TTS models are poor at modeling prosodic aspect is ineffective, resulting in a lack of expressiveness in the synthesized speech used by important applications

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Clockwork hierarchical variational encoder
  • Clockwork hierarchical variational encoder
  • Clockwork hierarchical variational encoder

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025] Text-to-speech (TTS) models commonly used by speech synthesis systems are generally given only a textual input at runtime without any reference acoustic representation, and must impute many linguistic factors not provided by the textual input in order to produce a listening It sounds like real synthetic speech. A subset of these linguistic factors is collectively called prosody, and can include intonation (pitch change), stress (stressed versus unstressed syllables), sound duration, volume, pitch, rhythm, and style of speech. Prosody may indicate the emotional state of speech, the form of speech (eg, statement, question, command, etc.), the presence of speech sarcasm or sarcasm, uncertainty in knowledge of speech, or other language element. Thus, a given textual input associated with high prosody changes can produce a synthesized speech with local changes in pitch and utterance duration to convey different semantic meanings, and also with global changes in the overall ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A method (400) for representing an intended prosody in synthesized speech (152) includes receiving a text utterance (320) having at least one word (250), and selecting an utterance embedding (260) forthe text utterance. Each word in the text utterance has at least one syllable (240) and each syllable has at least one phoneme (230). The utterance embedding represents an intended prosody. For eachsyllable, using the selected utterance embedding, the method also includes: predicting a duration of the syllable by encoding linguistic features (232) of each phoneme of the syllable with a corresponding prosodic syllable embedding (245) for the syllable; predicting a pitch contour of the syllable based on the predicted duration for the syllable; and generating a plurality of fixed-length predicted pitch frames (280) based on the predicted duration for the syllable. Each fixed-length predicted pitch frame represents part of the predicted pitch contour of the syllable.

Description

technical field [0001] The present disclosure relates to a clockwork hierarchical variational encoder for predicting prosody. Background technique [0002] Speech synthesis systems use a text-to-speech (TTS) model to generate speech from textual input. The generated / synthesized speech should convey information accurately (intelligibility) while sounding like human speech (naturalness) with intended prosody (expressiveness). While traditional concatenative and parametric synthesis models can provide intelligible speech, and recent advances in neural modeling of speech have significantly improved the naturalness of synthesized speech, most existing TTS models are Aspects are ineffective, resulting in a lack of expressiveness in the synthesized speech used by important applications. For example, for applications such as conversational assistants and long-form readers, it is desirable to generate authentic speech by entering prosodic features that are not conveyed in the text ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G10L13/047G10L13/08G06N3/02G10L13/10
CPCG06N3/084G10L13/047G10L13/10G10L2013/105G06N3/047G06N3/044G06N3/045
Inventor 罗伯特·克拉克詹竣安文森特·万
Owner GOOGLE LLC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products