Voice synthesis model training method and system, voice synthesis method and system, equipment and medium

A technology of speech synthesis and training methods, applied in speech synthesis, speech analysis, biological neural network models, etc., which can solve problems such as high cost, slow synthesis speed, and inability to meet order call requirements

Active Publication Date: 2020-09-04
CTRIP COMP TECH SHANGHAI
View PDF6 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The technical problem to be solved by the present invention is to overcome the defect that the cost of speech synthesis technology in the prior art is high, the synthesis speed is slow, and the actual order call requirements cannot be met , the purpose of which is to provide a training method, synthesis method, system, device and medium for a speech synthesis model

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Voice synthesis model training method and system, voice synthesis method and system, equipment and medium
  • Voice synthesis model training method and system, voice synthesis method and system, equipment and medium
  • Voice synthesis model training method and system, voice synthesis method and system, equipment and medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0091] Such as figure 1 As shown, the training method of the speech synthesis model of the present embodiment includes:

[0092] S101. Obtain several pieces of historical text information and historical voice information corresponding to the historical text information;

[0093] Among them, the historical text information is obtained from the hotel customer service of the e-commerce platform and the call records of the hotel merchants; the historical voice information (historical audio files) corresponding to the historical text information is recorded in a recording studio by a special manual customer service. For example, a total of 10,000 16KHz historical audio files were recorded, with a total audio duration of about 10 hours, and the text corresponding to each audio was checked by a special manual.

[0094] S102. Obtain a historical text vector corresponding to each historical text information;

[0095] S103, constructing an initial acoustic model based on a CNN network...

Embodiment 2

[0102] Such as figure 2 As shown, the training method of the speech synthesis model of the present embodiment is a further improvement to Embodiment 1, specifically:

[0103] Step S102 includes:

[0104] S1021. Preprocessing the historical text information;

[0105] Preprocessing operations include removing garbled characters and non-standard punctuation marks in historical text information, and converting Chinese punctuation into English punctuation; considering that numbers have different pronunciations in different scenarios, the numbers are replaced with different Chinese characters according to the keywords of matching statistics; Among them, the number conversion rules in different scenarios are inconsistent. For example, "the house price is 318 yuan" should be converted to "the house price is 318 yuan", and "room number 318" should be converted to "room number 318".

[0106] S1022. Perform word segmentation processing on the preprocessed historical text information t...

Embodiment 3

[0140] The speech synthesis method in this embodiment is realized by using the speech synthesis model training method in Embodiment 1 or 2.

[0141] Such as Figure 6 As shown, when the target vocoder model includes a generative model, the speech synthesis method of the present embodiment includes:

[0142] S201. Obtain target text information;

[0143] S202. Generate a target text vector according to the target text information;

[0144] S203. Input the target text vector into the target acoustic model in the speech synthesis model, output the target mel spectrum according to the input target text vector through the target acoustic model and transfer it to the target vocoder model;

[0145] S204. Using the generative model in the target vocoder model, convert the target mel spectrum to obtain target speech synthesis information corresponding to the target text information.

[0146] In this embodiment, based on the speech synthesis model obtained through training, the targe...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a voice synthesis model training method and system, a voice synthesis method and system, equipment and a medium. The training method comprises the following steps: acquiring aplurality of pieces of historical text information and historical speech information thereof; obtaining a historical text vector of the historical text information; constructing an initial acoustic model based on a CNN and a bidirectional LSTM network; based on the historical text vector and the first Mel spectrum of the historical voice information, performing model training on the initial acoustic model to obtain a target acoustic model; and based on the second Mel spectrum and the historical voice information, performing model training on a preset neural network model to obtain a target vocoder model. According to the invention, the acoustic model is built based on the CNN, the bidirectional LSTM network and a linear layer, and the vocoder model is built based on a generative adversarial network GAN, thereby greatly improving the voice synthesis speed while guaranteeing the voice synthesis quality, and meeting the demands of an e-commerce platform for a large number of outgoing calls.

Description

technical field [0001] The invention relates to the technical field of speech processing, in particular to a speech synthesis model training method, synthesis method, system, device and medium. Background technique [0002] For the e-commerce service platform, a large number of outbound calls to hotels and customers are required every day. In order to save labor costs, the existing smart outbound calls to hotels and customers are mainly realized through speech synthesis technology. [0003] At present, speech synthesis is mainly realized based on the splicing method. This splicing method is based on pre-recording a large amount of speech, and then splicing the speech of the required basic unit according to the text to be synthesized to synthesize the speech. Although the synthesized speech quality of this method is relatively high , but the amount of audio data to be recorded is huge and the cost is high. In addition, the speech synthesized by the existing speech synthesis ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G10L13/02G10L13/04G10L13/08G06N3/04
CPCG10L13/02G10L13/04G10L13/08G06N3/049G06N3/045
Inventor 周明康罗超吉聪睿李巍胡泓
Owner CTRIP COMP TECH SHANGHAI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products