Speech synthesis device supporting styles of multiple speakers, language switching and controllable rhythm

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A speech synthesis and speaker technology, applied in speech synthesis, speech analysis, instruments, etc., can solve the problems of inability to decouple and separate multiple speakers, inability to mix speech synthesis, single control of synthesized speech, etc., to achieve rich functions and reduce deployment Cost, the effect of improving fault tolerance

Active Publication Date: 2021-05-28

杭州一知智能科技有限公司

View PDF9 Cites 7 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0003] Existing speech synthesis methods only control the synthesized speech, and cannot synthesize speech mixed with multiple languages, nor can they decouple the styles of multiple speakers and apply them to other speakers.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment

[0114] The present invention is tested on a text data set containing 32,500 audio and corresponding prosody annotations from six speakers, including 30,000 in Chinese, 2,000 in English, and 500 mixed in Chinese and English. The present invention carries out following pretreatment to data set:

[0115] 1) Extract Chinese and English phoneme files and corresponding audio, and use the open source tool Montreal-forced-aligner to extract the pronunciation duration of the phoneme.

[0116] 2) Extract the mel spectrum for each audio, where the window size is 50 milliseconds, the size of the frame shift is 12.5 milliseconds, and the dimension is 80 dimensions.

[0117] 3) For each audio, the pitch of the audio is extracted using the World vocoder.

[0118] 4) Summing the mel-spectrum extracted from the audio in dimensions to obtain the energy of the mel-spectrum.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a speech synthesis device supporting styles of multiple speakers, language switching and controllable rhythm, and belongs to the field of speech synthesis. The device comprises a text acquisition unit and a text preprocessing unit which are used for acquiring and preprocessing different text data; a language switching unit used for storing and displaying speaker tags corresponding to the training data of different language types and automatically identifying the language type of the text to be synthesized; a style switching unit used for specifying a speech synthesis style according to the language type; a speaker switching unit for specifying a speaker; a coding-decoding unit for obtaining a predicted Mel spectrum; a training unit for training the encoding-decoding unit; and a voice synthesis unit which is used for generating the predicted Mel frequency spectrum and converting the predicted Mel frequency spectrum into a sound signal for voice playing. According to the invention, the speaker and the style of the speaker can be respectively controlled while the voice with richer rhythm change is generated.

Description

technical field [0001] The invention belongs to the field of speech synthesis, in particular to a speech synthesis device supporting multi-speaker styles, language switching and rhythm controllable. Background technique [0002] In recent years, with the development of deep learning, speech synthesis technology has also been greatly improved. Speech synthesis has moved from the traditional parametric method and concatenation method to an end-to-end method. They usually first generate the mel spectrum from the text features, and then use the vocoder image to synthesize the speech from the mel spectrum. According to the structure, these end-to-end methods can be divided into autoregressive models and non-autoregressive models. Autoregressive models usually use the Encoder-Attention-Decoder mechanism for autoregressive generation: to generate the current data point, all previous data points in the time series must be generated as model input, like Taoctron, Taoctron 2, Deep ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G10L13/10G10L19/02G10L19/26G10L25/30G06N3/04G06N3/08

CPCG10L13/10G10L19/02G10L19/26G10L25/30G06N3/08G06N3/045

Inventor 盛乐园

Owner 杭州一知智能科技有限公司

Speech synthesis device supporting styles of multiple speakers, language switching and controllable rhythm

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology