Speech synthesis model training method and speech synthesis method

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A speech synthesis and model technology, applied in speech synthesis, speech analysis, instruments, etc., can solve the problems of limited speaker timbre information, inability to fit new data well, and limited improvement of speaker similarity outside the set

Active Publication Date: 2021-04-09

AISPEECH CO LTD

View PDF12 Cites 13 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0005] Global-level and sentence-level speaker embeddings, because they provide limited speaker timbre information, can only have a relatively good similarity in the test data set that is relatively similar to the data in the set. For some test data that cannot be fully fitted , it will make the similarity of the synthesized sound very poor

Specifically, because the training criterion of the pre-trained speaker embedding method does not require it to reconstruct the audio features, only the discriminative criterion is used, and the sentence-level timbre information it can provide is very limited. Similarity does not achieve a satisfactory result

Although the reference encoder method of joint training can provide more speaker information, since the number of speakers in TTS data will be much less than the number of speakers in the verification task, it will not be effective in the voice synthesis of unseen speakers. It will be better than the pre-trained speaker embedding method d-vector, x-vector, or even worse

[0006] Although the frame-level speaker embedding increases the speaker information in granularity, the similarity improvement is limited by the reference audio and the unstable attention mechanism calculation, so the similarity improvement of out-of-set speakers is very limited

[0007] The method of directly updating model parameters is prone to problems such as overfitting and very unstable synthetic sound quality due to inaccurate labeling of test data and small amount of data.

Using LHUC and other methods will reduce the amount of updated model parameters, which can alleviate the problem of overfitting to a certain extent, but if the distribution of the target data is very different from the original data, it will not be able to fit the new data well.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0028] In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

[0029] It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other.

[0030] The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, progr...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a speech synthesis model training method. A speech synthesis model comprises an encoder, a speaker embedding prediction network, a duration expansion module and a decoder. The method comprises the following steps of: preprocessing training data to obtain a sample training data set and a target speaker data set; training the speech synthesis model based on the sample training data set; and performing adaptive training on the speaker embedding prediction network based on the target speaker data set so as to predict a speaker embedding prediction value based on a to-be-synthesized text. According to the embodiment of the invention, a speech synthesis model is integrally trained based on sample training data to obtain a general speech synthesis model, and adaptive training is further carried out on a speaker embedded prediction network in the general speech synthesis model obtained through training based on a target speaker data set, so that the speaker embedded prediction network can learn the timbre characteristics of a target speaker, and audio signals closer to the target speaker are synthesized during speech synthesis.

Description

technical field [0001] The invention relates to the technical field of speech synthesis, in particular to a speech synthesis model training method, a speech synthesis method, electronic equipment and a computer-readable storage medium. Background technique [0002] In recent years, with the popularization of mobile devices, human-computer interaction scenarios using voice have become more and more common. Speech, as the most important and natural way of communication for human beings, is considered to be the most natural entrance for human-computer interaction applications, and is currently widely used in different human-computer interaction scenarios. A complete voice-based human-computer interaction system includes user inquiries, machine recognition and understanding, and then generates text through natural language, and finally feedbacks to users through speech synthesis (Text-To-Speech, TTS) to reply. Therefore, synthesizing high-definition, high-naturalness and divers...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G10L13/02G10L13/08G10L19/16G10L25/30

CPCG10L13/02G10L13/08G10L19/16G10L25/30

Inventor 俞凯徐志航陈博张辉

Owner AISPEECH CO LTD

Speech synthesis model training method and speech synthesis method

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology