Prosodic hierarchy model training method, text-to-speech method and text-to-speech device

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A prosodic level and speech synthesis technology, applied in the field of speech, can solve problems such as limited computing resources and storage space, error transmission step by step, and prediction errors, etc., to reduce the requirements of computing resources and storage space, reduce influencing factors, reduce The effect of small item size

Active Publication Date: 2016-01-13

BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD

View PDF2 Cites 63 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0007] (1) Although the method of feature left and right expansion can introduce contextual relations to a certain extent, in order to reduce the scale of the model and the complexity of training, the size of the expansion is often limited, so it is impossible to build a long-distance contextual relationship between words;

[0008] (2) Adopting the method of level-by-level training may lead to wrong level-by-level transmission. Once an error occurs in the prosody prediction of the previous level, this error is easily passed down, causing subsequent prediction errors;

[0009] (3) Since the training and prediction of the prosody prediction model is based on word granularity, the performance of the prosody prediction model depends on the performance of the word segmentation system. Due to the limitation of computing resources and storage space, the performance of the word segmentation system in offline speech synthesis is limited. It is lower than the word segmentation system in the online speech synthesis system, thus affecting the final prosody prediction performance;

[0010] (4) For the offline synthesis system, due to the limited computing resources and storage space, there are strict requirements on the size of the model and resource files. The prediction model using word granularity needs to rely on hundreds of thousands of dictionary file entries. For storage space and take up a lot of computing resources

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0034]Embodiments of the present invention are described in detail below, examples of which are shown in the drawings, wherein the same or similar reference numerals designate the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the figures are exemplary and are intended to explain the present invention and should not be construed as limiting the present invention.

[0035] It can be understood that the purpose of speech synthesis is to convert text into speech and play it to the user, and the goal is to achieve the effect of live text broadcasting. At present, speech synthesis has gradually matured in terms of intelligibility, but there is still a big gap between natural fluency and live broadcasting. A key factor affecting the natural fluency is the prosodic pause in the synthesized speech, and the key factor affecting the fluency of the speech synthesis system is the accuracy of prosodic le...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a prosodic hierarchy model training method for text-to-speech and a text-to-speech method and a text-to-speech device by a prosodic hierarchy model. The training method includes: training massive non-tagged corpus data to obtain character vectors of individual characters; obtaining textual features and tagging corresponding to the trained data according to the character vectors and prosodic tagging data; based on a deep neural network and a bidirectional LSTM (long-short term memory) neural network, training the prosodic hierarchy model according to the textual features and tagging of the trained data. The training method has the advantages that when a character granularity based dictionary in the training method is compared with a traditional word granularity based dictionary, entry scale is decreased effectively, requirements of the model and resource files on computing resources and memory spaces are lowered, and usability of a prosodic prediction model in embedded intelligent devices is guaranteed while performance of the prosodic prediction model is improved.

Description

technical field [0001] The invention relates to the field of speech technology, in particular to a prosody-level model training method for speech synthesis, a method and a device for speech synthesis using the prosody-level model. Background technique [0002] Speech synthesis, also known as text-to-speech technology, is a technology that can convert text information into speech and read it aloud. In the speech synthesis system, since prosodic level prediction is the basis of the whole system, the key to how to improve the effect of speech synthesis is how to improve the accuracy of prosodic level prediction. [0003] In related technologies, there are mainly two methods for prosodic level prediction: [0004] First, the prosodic level prediction usually uses the CRF (Conditional Random Field, conditional random field) model for prediction, that is, the prosodic level prediction method based on CRF needs to expand the training features left and right in order to introduce c...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G10L13/08G10L13/10

Inventor 徐扬凯李秀林付晓寅陈志杰

Owner BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Prosodic hierarchy model training method, text-to-speech method and text-to-speech device

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology