Speech synthesis device supporting styles of multiple speakers, language switching and controllable rhythm

A speech synthesis and speaker technology, applied in speech synthesis, speech analysis, instruments, etc., can solve the problems of inability to decouple and separate multiple speakers, inability to mix speech synthesis, single control of synthesized speech, etc., to achieve rich functions and reduce deployment Cost, the effect of improving fault tolerance

Active Publication Date: 2021-05-28
杭州一知智能科技有限公司
View PDF9 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Existing speech synthesis methods only control the synthesized speech, and cannot synthesize speech mixed

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Speech synthesis device supporting styles of multiple speakers, language switching and controllable rhythm
  • Speech synthesis device supporting styles of multiple speakers, language switching and controllable rhythm
  • Speech synthesis device supporting styles of multiple speakers, language switching and controllable rhythm

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0114] The present invention is tested on a text data set containing 32,500 audio and corresponding prosody annotations from six speakers, including 30,000 in Chinese, 2,000 in English, and 500 mixed in Chinese and English. The present invention carries out following pretreatment to data set:

[0115] 1) Extract Chinese and English phoneme files and corresponding audio, and use the open source tool Montreal-forced-aligner to extract the pronunciation duration of the phoneme.

[0116] 2) Extract the mel spectrum for each audio, where the window size is 50 milliseconds, the size of the frame shift is 12.5 milliseconds, and the dimension is 80 dimensions.

[0117] 3) For each audio, the pitch of the audio is extracted using the World vocoder.

[0118] 4) Summing the mel-spectrum extracted from the audio in dimensions to obtain the energy of the mel-spectrum.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a speech synthesis device supporting styles of multiple speakers, language switching and controllable rhythm, and belongs to the field of speech synthesis. The device comprises a text acquisition unit and a text preprocessing unit which are used for acquiring and preprocessing different text data; a language switching unit used for storing and displaying speaker tags corresponding to the training data of different language types and automatically identifying the language type of the text to be synthesized; a style switching unit used for specifying a speech synthesis style according to the language type; a speaker switching unit for specifying a speaker; a coding-decoding unit for obtaining a predicted Mel spectrum; a training unit for training the encoding-decoding unit; and a voice synthesis unit which is used for generating the predicted Mel frequency spectrum and converting the predicted Mel frequency spectrum into a sound signal for voice playing. According to the invention, the speaker and the style of the speaker can be respectively controlled while the voice with richer rhythm change is generated.

Description

technical field [0001] The invention belongs to the field of speech synthesis, in particular to a speech synthesis device supporting multi-speaker styles, language switching and rhythm controllable. Background technique [0002] In recent years, with the development of deep learning, speech synthesis technology has also been greatly improved. Speech synthesis has moved from the traditional parametric method and concatenation method to an end-to-end method. They usually first generate the mel spectrum from the text features, and then use the vocoder image to synthesize the speech from the mel spectrum. According to the structure, these end-to-end methods can be divided into autoregressive models and non-autoregressive models. Autoregressive models usually use the Encoder-Attention-Decoder mechanism for autoregressive generation: to generate the current data point, all previous data points in the time series must be generated as model input, like Taoctron, Taoctron 2, Deep ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G10L13/10G10L19/02G10L19/26G10L25/30G06N3/04G06N3/08
CPCG10L13/10G10L19/02G10L19/26G10L25/30G06N3/08G06N3/045
Inventor 盛乐园
Owner 杭州一知智能科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products