Rhythm-controllable Chinese and English mixed speech synthesis method and system

A speech synthesis, Chinese and English technology, applied in speech synthesis, speech analysis, neural learning methods, etc., can solve problems such as unusable, naturalness of long sentence synthesis, complex models, etc., to reduce computing resource requirements and simplify training The complexity and the effect of improving fault tolerance

Active Publication Date: 2021-05-14
HANGZHOU YIWISE INTELLIGENT TECH CO LTD
View PDF10 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, due to its complex network structure and autoregressive structural form, the existing autoregressive speech synthesis model still has some shortcomings in actual production:
[0004] (1) The model is complex and has high requirements for computing resources, so it cannot be used on hardware with low computing resources
[0005] (2) Due to the defect of autoregressive structure, the naturalness of long sentence synthesis decreases
[0006] (3) In terms of voice control, most of them only use duration, energy and pitch to control, and the control of voice rhythm is not comprehensive;
[0007] (4) It is impossible to synthesize texts that are mixed in Chinese and English;

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Rhythm-controllable Chinese and English mixed speech synthesis method and system
  • Rhythm-controllable Chinese and English mixed speech synthesis method and system
  • Rhythm-controllable Chinese and English mixed speech synthesis method and system

Examples

Experimental program
Comparison scheme
Effect test

preparation example Construction

[0036] Such as figure 1 As shown, a kind of rhythm controllable Chinese-English mixed speech synthesis method of the present invention comprises the following steps:

[0037] Step 1, carry out preprocessing for the Chinese-English text data sequence of the band prosodic label of input, as the input of jumping neural network coder; Described jumping neural network coder is made of Embedding embedding layer, CBHG module, jumping module;

[0038] Step 2, for the output of the jumping neural network encoder, combined with the output of the CBHG module, through the adjustment of the duration, the text encoding information after the duration adjustment is obtained;

[0039]Step 3. The time-adjusted text coding information is used as the input of the pitch prediction module and the energy prediction module respectively to obtain the predicted pitch and predicted energy; combine the predicted pitch, predicted energy and the time-adjusted text coding information Finally, as the input ...

Embodiment

[0090] The present invention is tested on a text dataset containing 12,500 pieces of audio and corresponding prosodic annotations, including 10,000 pieces in Chinese, 2,000 pieces in English, and 500 pieces mixed in Chinese and English. The present invention carries out following pretreatment to data set:

[0091] 1) Extract Chinese and English phoneme files and corresponding audio, and use the open source tool Montreal-forced-aligner to extract the pronunciation duration of the phoneme.

[0092] 2) Extract the mel spectrum for each audio, where the window size is 50 milliseconds, the size of the frame shift is 12.5 milliseconds, and the dimension is 80 dimensions.

[0093] 3) For each audio, the pitch of the audio is extracted using the World vocoder.

[0094] 4) Summing the mel-spectrum extracted from the audio in dimensions to obtain the energy of the mel-spectrum.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a rhythm-controllable Chinese and English mixed speech synthesis method and system, and belongs to the field of speech synthesis. The method comprises the following steps of: 1) extracting phoneme pronunciation duration, energy, pitch and Mel spectrum from Chinese and English texts and audios with rhythm marks as a training set, and learning a text coding representation aligned with the Mel spectrum in length; and 2) generating predicted energy and pitch through an energy and pitch prediction model to realize energy and pitch control. And 3) combining the predicted energy, pitch and text coding representation, outputting a synthesized Mel spectrum through a decoder, and synthesizing speech through a vocoder. According to the method, the rhythm pause information in the synthetic speech is better controlled by using a jump neural network encoder, the rhythm pronunciation information of each frame in the synthetic speech is finely controlled by using the predicted duration, energy and pitch, and the speech with richer rhythm change is generated. The method is completely completed by one model, and language distinguishing does not need to be carried out on the texts firstly.

Description

technical field [0001] The invention belongs to the field of speech synthesis, and relates to a Chinese-English mixed speech synthesis, in particular to a rhythm-controllable Chinese-English mixed speech synthesis method and a system thereof. Background technique [0002] In recent years, with the development of deep learning, speech synthesis technology has also been greatly improved. Speech synthesis has moved from the traditional parametric method and concatenation method to an end-to-end method. They usually first generate the mel spectrum from the text features, and then use the vocoder image to synthesize the speech from the mel spectrum. According to the structure, these end-to-end methods can be divided into autoregressive models and non-autoregressive models. Autoregressive models usually use the Encoder-Attention-Decoder mechanism for autoregressive generation: to generate the current data point, all previous data points in the time series must be generated as mo...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G10L13/10G10L25/30G06N3/04G06N3/08
CPCG10L13/10G10L25/30G06N3/084G06N3/044G06N3/045
Inventor 盛乐园
Owner HANGZHOU YIWISE INTELLIGENT TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products