Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

Clock level variational encoder based on attention

A technology of attention and duration, applied in instruments, speech analysis, biological neural network models, etc., can solve problems such as invalid prosody modeling and lack of expressiveness in synthesized speech

Pending Publication Date: 2022-07-12
GOOGLE LLC
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

While traditional splicing and parametric synthesis models are able to provide intelligible speech, and recent advances in neural modeling of speech have significantly improved the naturalness of synthesized speech, most existing TTS models are ineffective at modeling prosody , resulting in a lack of expressiveness in the synthesized speech used by important applications

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Clock level variational encoder based on attention
  • Clock level variational encoder based on attention
  • Clock level variational encoder based on attention

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025] Text-to-speech (TTS) models commonly used by speech synthesis systems are typically run-time given only a textual input, without any reference acoustic representation, and in order to produce synthetic speech that sounds realistic, many must be introduced that are not provided by the textual input language factor. A subset of these linguistic factors is collectively referred to as prosody, and can include intonation (pitch changes), stress (stressed versus unstressed syllables), voice duration, loudness, pitch, rhythm, and voice style. Prosody may indicate the emotional state of speech, the form of speech (eg, statements, questions, commands, etc.), the presence of sarcasm or sarcasm in speech, uncertainty in speech knowledge, or other factors that cannot be encoded by the grammar or lexical choices of the input text language element. Thus, a given text input associated with a high degree of prosodic variability can produce synthetic speech with local variations in pit...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A method (400) for representing an expected rhythm in synthetic speech includes receiving a textual utterance (310) having at least one word (240), and selecting an utterance insert (204) for the textual utterance. Each word in the text utterance has at least one syllable (230), and each syllable has at least one phoneme (220). The utterance embedding represents an expected rhythm. For each syllable, using the selected utterance embedding, the method further includes predicting a duration (238) of the syllable by decoding a rhythm syllable embedding (232, 234) of the syllable based on attention of a linguistic feature (222) of each phoneme of the syllable by an attention mechanism (340), and generate a plurality of fixed length prediction frames (260) based on the predicted duration of the syllable.

Description

technical field [0001] The present disclosure relates to an attention-based clock-level variational encoder. Background technique [0002] Speech synthesis systems use a text-to-speech (TTS) model to generate speech from textual input. The generated / synthesized speech should accurately convey the message (intelligibility) while sounding like human speech (naturalness) with the expected prosody (expressiveness). While traditional concatenation and parametric synthesis models are able to provide intelligible speech, and recent advances in neural modeling of speech have significantly improved the naturalness of synthesized speech, most existing TTS models are ineffective in modeling prosody , resulting in the lack of expressiveness of synthesized speech used in important applications. For example, for applications such as conversational assistants and long-form readers, it is desirable to generate authentic speech from prosodic features that are not conveyed in the input text...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G10L13/10G06N3/04G10L13/047
CPCG10L13/10G10L13/047G10L2013/105G06N3/084G06N3/047G06N3/044G06N3/045G10L25/30
Inventor 罗伯特·克拉克詹竣安文森特·万
Owner GOOGLE LLC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products