Voice synthesis model training method and device, electronic equipment and storage medium

A technology of speech synthesis and training method, applied in the field of artificial intelligence and computer intelligent speech, can solve the problems of poor synthesis effect and insufficient pronunciation of finals, and achieve the effect of improving human-computer interaction experience, reducing pronunciation problems, and improving user stickiness

Active Publication Date: 2019-12-27
BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
View PDF5 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The Tacotron model tends to learn the pronunciation of function words, resulting in poor synthesis effect
In addition, there is another problem with the phoneme as the input unit. Some finals can be used as a complete syllable alone. In these two cases, the pronunciation of the finals is actually different. The finals that are independent as syllables need a more complete Pronunciation process, however, the phoneme-based model cannot distinguish between the two situations, resulting in insufficient full pronunciation when the finals are independent

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Voice synthesis model training method and device, electronic equipment and storage medium
  • Voice synthesis model training method and device, electronic equipment and storage medium
  • Voice synthesis model training method and device, electronic equipment and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0051] figure 1 It is a schematic flowchart of a speech synthesis model training method provided in Embodiment 1 of the present application. The method can be executed by a speech synthesis model training device or an electronic device, and the device or electronic device can be implemented by software and / or hardware. This device or electronic equipment can be integrated in any intelligent equipment with network communication function. Such as figure 1 As shown, the training method of the speech synthesis model may include the following steps:

[0052] S101. Use the syllable input sequence, phoneme input sequence and Chinese character input sequence of the current sample as the input of the encoder of the model to be trained, and obtain the encoded representations of the syllable input sequence, phoneme input sequence and Chinese character input sequence at the output end of the encoder.

[0053] In a specific embodiment of the present application, the electronic device can...

Embodiment 2

[0063] figure 2 It is a schematic flowchart of the training method of the speech synthesis model provided in the second embodiment of the present application. Such as figure 2 As shown, the training method of the speech synthesis model may include the following steps:

[0064] S201. Convert the phonemes, syllables and Chinese characters in the current sample into their respective fixed-dimensional vector representations.

[0065] In a specific embodiment of the present application, the electronic device may convert the phonemes, syllables and Chinese characters in the current sample into respective fixed-dimensional vector representations. Specifically, the electronic device may convert phonemes in the current sample into vector representations of a first length; convert syllables and Chinese characters in the current sample into vector representations of a second length; wherein the first length is greater than the second length.

[0066] S202. Convert the syllable vecto...

Embodiment 3

[0080] Figure 4 It is a schematic structural diagram of a training device for a speech synthesis model provided in Embodiment 3 of the present application. Such as Figure 4 As shown, the device 400 includes: an input module 401, a fusion module 402 and an output module 403; wherein,

[0081] The input module 401 is configured to use the syllable input sequence, the phoneme input sequence and the Chinese character input sequence of the current sample as the input of the encoder of the model to be trained, and obtain the syllable input sequence, the An encoded representation of the phoneme input sequence and the Chinese character input sequence;

[0082] The fusion module 402 is configured to fuse the syllable input sequence, the phoneme input sequence, and the Chinese character input sequence represented by the code, to obtain the syllable input sequence, the phoneme input sequence, and the Chinese character input sequence A weighted combination; the weighted combination o...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a voice synthesis model training method and device, electronic equipment and a storage medium, and relates to the field of computer intelligent voice. According to the specificimplementation scheme, a syllable input sequence, a phoneme input sequence and a Chinese character input sequence of a current sample serve as input of an encoder of a model to-be trained, and encoding representation of all the sequences is obtained at the output end of the encoder; the method also has the following steps: fusing the three sequences represented by the codes to obtain a weighted combination of the three sequences; taking the weighted combination as the input of an attention module, and obtaining the weighted average of the weighted combination of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at each moment at the output end of the attention module; and taking the weighted average as the input of a decoder of the model to-be trained, and obtaining the voice Mel spectrum output of the current sample at the output end of the decoder. According to the embodiment of the invention, the pronunciation effect can be effectivelyimproved, and high-expressive-force and high-naturalness Chinese synthetic voice is provided for voice products.

Description

technical field [0001] The present application relates to the field of artificial intelligence technology, and further relates to the field of computer intelligent speech, especially a training method, device, electronic equipment and storage medium for a speech synthesis model. Background technique [0002] In the field of speech synthesis, neural network-based methods such as WaveNet and WaveRNN have greatly improved the sound quality and naturalness of synthesized speech. Such methods usually require the front-end system to extract language features based on the text, and predict information such as fundamental frequency and duration. The end-to-end modeling Tacotron model proposed by Google gets rid of the complex front-end system that requires a lot of expert knowledge, and automatically learns the rhythm and emotion of the voice in the sound library through the sequence conversion model. The synthesized voice is particularly outstanding in terms of expressiveness. . ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G10L13/02G10L13/08G06N3/08
CPCG10L13/02G10L13/08G06N3/08G10L25/30G06N3/044G06N3/045G10L13/047G10L13/06
Inventor 陈智鹏白锦峰贾磊
Owner BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products