Chinese and English cross-language speech synthesis method and device, electronic equipment and storage medium

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A speech synthesis and cross-language technology, applied in the field of Chinese-English cross-language speech synthesis, can solve the problems of high cost and high price of mixed-reading audio labeling, increasing the difficulty of cross-language synthesis tasks, etc.

Pending Publication Date: 2022-06-24

HARBIN INST OF TECH SHENZHEN GRADUATE SCHOOL

View PDF0 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

Due to the scarcity of professional recording personnel who are proficient in multiple languages, there are few high-quality Chinese-English mixed reading recordings, and the recording and labeling of mixed reading audio is costly and expensive, which increases the difficulty of cross-language synthesis tasks

Fortunately, the open source of a large amount of high-quality monolingual speech data makes it possible to realize a cross-language speech synthesis system

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0040] This embodiment is used to illustrate the principles and steps of the present invention for solving technical problems, such as figure 1 As shown, it is a flowchart of a method for cross-language speech synthesis in Chinese and English according to Embodiment 1 of the present invention, and the specific steps are:

[0041] S1. Use the sequence-to-sequence task in deep learning to build the first cross-language acoustic model;

[0042] Further, the first cross-language acoustic model is based on the Tacotron model, including: a CBHG-based encoder, a Gaussian mixture distribution-based GMMv2b attention mechanism module, and a decoder.

[0043] In the specific implementation process, such as figure 2 As shown, the first cross-linguistic acoustic model CS-Tacotron includes: an encoder, which is based on CBHG, and the language embedding is added to the convolutional network after activation by different linear and nonlinear layers, serving as the gate of the high-speed net...

Embodiment 2

[0055] This embodiment is used to fine-tune the constructed CS-Tacotron model, the first cross-language acoustic model. When using Chinese monolingual data to fine-tune the cross-language acoustic model, the synthesized Chinese-English cross-language voice effect will be deteriorated. For the catastrophic forgetting problem of Chinese-English cross-language speech synthesis, the present invention introduces a continuous learning method to improve the synthesis effect when using Chinese monolingual recording data to fine-tune the cross-language CS-Tacotron model.

[0056] Further, the first cross-language acoustic model is fine-tuned by using the continuous learning method based on experience replay. During the fine-tuning process, the regular-based plastic weight stabilization method is used to fix the parameters of the first cross-language acoustic model during fine-tuning at the first cross-language acoustic model before fine-tuning. Within a very small margin of error for th...

Embodiment 3

[0058] In this embodiment, in order to improve the expressiveness of the synthesized speech by the CS-Tacotron model, the modeling of prosodic pause in speech synthesis is studied. Effect. The way of hierarchical prosody to achieve prosody modeling usually mixes the prosodic boundaries of different levels as phonemes into the input sequence, the model learns the corresponding pause duration independently according to the training data, constructs the hierarchical prosodic graph from the hierarchical prosody of the input cross-language text, and introduces the graph neural network network for modeling, implements an alternative way of modeling prosodic information, and proposes the second graph-based cross-lingual acoustic model GCS-Tacotron model.

[0059] Further, the Chinese prosodic structure is extended to Chinese-English cross-language texts, and the specific method includes: using English words or single letters as prosodic words in the Chinese four-level prosodic struct...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a Chinese and English cross-language speech synthesis method and apparatus, an electronic device and a storage medium. The method comprises the steps of constructing a first cross-language acoustic model by using a sequence-to-sequence task in deep learning; processing the text data set into a basic statement comprising a phoneme sequence, a tone sequence and a language sequence; a model encoder is utilized to encode the basic statement into high-level context semantic representation, and meanwhile language embedding and speaker embedding are introduced into multiple positions of the model encoder; learning a mapping relation between the advanced context semantic representation and the acoustic feature Mel spectrogram by using an attention mechanism to obtain the advanced context semantic representation after linear weighting; and generating an original spectrogram from the linearly weighted advanced context semantic representation by using a model decoder. According to the Chinese and English cross-language speech synthesis method, two cross-language acoustic models are constructed based on fusion of multiple strategies, so that the Chinese and English cross-language speech synthesis method overcomes the defects of an existing speech synthesis method.

Description

technical field [0001] The present invention relates to the technical field of speech synthesis, in particular to a method, device, electronic device and storage medium for Chinese-English cross-language speech synthesis. Background technique [0002] As mobile phones, tablets, smart homes, and wearable devices all begin to access voice functions, human-computer interaction has gradually entered the voice era. Different from the traditional human-computer interaction, the voice interaction is convenient and intelligent, which can make the machine have the comprehensive ability of listening, speaking, reading and writing like a human. Speech synthesis is the last link of the intelligent speech interaction system, which is responsible for letting the machine speak specific text and the speech audio of a specific speaker. It is divided into two parts: text analysis and acoustic model modeling. Text analysis mainly extracts features from text, and provides text-related informat...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G10L13/02G10L13/08G10L13/10G10L25/30

CPCG10L13/02G10L13/08G10L13/10G10L25/30

Inventor 汤步洲刘超

Owner HARBIN INST OF TECH SHENZHEN GRADUATE SCHOOL

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Chinese and English cross-language speech synthesis method and device, electronic equipment and storage medium

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology