Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

End-to-end voice synthesis method and system based on DNN-HMM bimodal alignment network

A speech synthesis, dual-modal technology, applied in the field of intelligent speech interaction and computer intelligent speech synthesis, can solve the problems of unsatisfactory, unusable, and high model complexity of long sentence synthesis, to improve pronunciation, reduce complexity, The effect of reducing model parameters

Active Publication Date: 2020-10-02
ZHEJIANG UNIV
View PDF9 Cites 19 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] In order to solve the problem that in the existing speech synthesis technology, due to the high complexity of the model, it cannot be used on low computing resources, and the effect of synthesizing long sentences is not ideal, the present invention proposes a dual-modal alignment based on DNN-HMM The end-to-end speech synthesis method and system of the network, on the basis of the DNN-HMM dual-modal alignment network training to obtain the phoneme frame length sequence, and then train the end-to-end speech synthesis model, thereby avoiding the traditional end-to-end speech synthesis model A process for informing text and audio alignment via autoregressive attention

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • End-to-end voice synthesis method and system based on DNN-HMM bimodal alignment network
  • End-to-end voice synthesis method and system based on DNN-HMM bimodal alignment network
  • End-to-end voice synthesis method and system based on DNN-HMM bimodal alignment network

Examples

Experimental program
Comparison scheme
Effect test

preparation example Construction

[0035] An end-to-end speech synthesis method based on a DNN-HMM bimodal alignment network mainly includes the following steps.

[0036]1. Convert the text into a phoneme input sequence, and convert the standard speech audio corresponding to the text into a standard mel spectrum.

[0037] 2. Text-to-speech alignment is performed through the DNN-HMM dual-modal alignment network to obtain a standard phoneme frame length sequence.

[0038] 3. Build a speech synthesis model.

[0039] 4. End-to-end training of the speech synthesis model.

[0040] 5. Convert the text to be processed into a phoneme input sequence to be processed and use it as the input of the trained speech synthesis model to obtain the corresponding speech of the text.

[0041] In a specific implementation of the present invention, the text preprocessing process is introduced.

[0042] Step 1-1, get the text data passed in by the interface, normalize the text, find out whether there are XML tags, if there are XML ...

Embodiment

[0125] In order to verify the implementation effect of the present invention, Figure 5 It is a test comparison made on the basis of domestic Chinese open source data. The Chinese open source data mainly uses the Chinese standard female voice database database open sourced by Biaobei Company. Its voice data is monophonic recording, using 48KHz 16-bit sampling frequency, PCM WAV format, a total of 10,000 sentences of Chinese girl voice data and corresponding text. Taking the open source data as the implementation method of this embodiment for further comparison and description, the specific data set segmentation method is shown in Table 1.

[0126] Table 1

[0127] data set training data Test Data Sampling Rate Biaobei open source female voice 9500 500 16K

[0128] According to the above-mentioned data distribution method, 9500 sentences are used as training data, and the training comparison is carried out with Tacotron2, the parameter method speec...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses an end-to-end voice synthesis method and system based on a DNN-HMM bimodal alignment network, and belongs to the field of intelligent voice interaction. According to the method, a frame length prediction module is used for replacing a traditional end-to-end attention autoregressive structure, and a convolutional change module and a bidirectional long-short-term memory network are used for constructing an encoder and a decoder, so that a large number of model parameters are reduced. On the basis that a phoneme frame length sequence is obtained through DNN-HMM bimodal alignment network training, an end-to-end voice synthesis model is trained, and thus the process that a traditional end-to-end voice synthesis model obtains text and audio alignment information in an autoregressive attention mode is avoided. The trained model not only can ensure the high naturalness of the audio synthesized by the end-to-end model, but also can greatly reduce the consumption of computing resources and the time proportion during voice synthesis, so that the end-to-end voice synthesis technology can be deployed on hardware with low computing resources.

Description

technical field [0001] The present invention relates to the field of intelligent voice interaction, further to the field of computer intelligent voice synthesis, and in particular to an end-to-end voice synthesis method and system based on a DNN-HMM dual-mode alignment network. Background technique [0002] In recent years, with the rise of deep learning, deep network models have dominated many fields of machine learning. Text to Speech (TTS), the process of synthesizing artificial speech from text symbols, is gradually being replaced by end-to-end deep neural networks. In the early days when people explored speech synthesis, scholars proposed speech synthesis methods based on statistical parameters. The speech synthesis method based on statistical parameters is mainly based on the parameter representation of speech features, such as Mel spectrum, fundamental frequency and other acoustic feature parameters, through hidden Markov model (HMM) modeling and related features of ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G10L13/02G10L15/06G10L15/14G06N3/04G06N3/08
CPCG10L13/02G10L15/144G10L15/063G06N3/08G10L2015/0635G10L2015/0631G06N3/045
Inventor 陈飞扬赵洲
Owner ZHEJIANG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products