Unlock instant, AI-driven research and patent intelligence for your innovation.

Chinese and English cross-language speech synthesis method and device, electronic equipment and storage medium

A speech synthesis and cross-language technology, applied in the field of Chinese-English cross-language speech synthesis, can solve the problems of high cost and high price of mixed-reading audio labeling, increasing the difficulty of cross-language synthesis tasks, etc.

Pending Publication Date: 2022-06-24
HARBIN INST OF TECH SHENZHEN GRADUATE SCHOOL
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Due to the scarcity of professional recording personnel who are proficient in multiple languages, there are few high-quality Chinese-English mixed reading recordings, and the recording and labeling of mixed reading audio is costly and expensive, which increases the difficulty of cross-language synthesis tasks
Fortunately, the open source of a large amount of high-quality monolingual speech data makes it possible to realize a cross-language speech synthesis system

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese and English cross-language speech synthesis method and device, electronic equipment and storage medium
  • Chinese and English cross-language speech synthesis method and device, electronic equipment and storage medium
  • Chinese and English cross-language speech synthesis method and device, electronic equipment and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0040] This embodiment is used to illustrate the principles and steps of the present invention for solving technical problems, such as figure 1 As shown, it is a flowchart of a method for cross-language speech synthesis in Chinese and English according to Embodiment 1 of the present invention, and the specific steps are:

[0041] S1. Use the sequence-to-sequence task in deep learning to build the first cross-language acoustic model;

[0042] Further, the first cross-language acoustic model is based on the Tacotron model, including: a CBHG-based encoder, a Gaussian mixture distribution-based GMMv2b attention mechanism module, and a decoder.

[0043] In the specific implementation process, such as figure 2 As shown, the first cross-linguistic acoustic model CS-Tacotron includes: an encoder, which is based on CBHG, and the language embedding is added to the convolutional network after activation by different linear and nonlinear layers, serving as the gate of the high-speed net...

Embodiment 2

[0055] This embodiment is used to fine-tune the constructed CS-Tacotron model, the first cross-language acoustic model. When using Chinese monolingual data to fine-tune the cross-language acoustic model, the synthesized Chinese-English cross-language voice effect will be deteriorated. For the catastrophic forgetting problem of Chinese-English cross-language speech synthesis, the present invention introduces a continuous learning method to improve the synthesis effect when using Chinese monolingual recording data to fine-tune the cross-language CS-Tacotron model.

[0056] Further, the first cross-language acoustic model is fine-tuned by using the continuous learning method based on experience replay. During the fine-tuning process, the regular-based plastic weight stabilization method is used to fix the parameters of the first cross-language acoustic model during fine-tuning at the first cross-language acoustic model before fine-tuning. Within a very small margin of error for th...

Embodiment 3

[0058] In this embodiment, in order to improve the expressiveness of the synthesized speech by the CS-Tacotron model, the modeling of prosodic pause in speech synthesis is studied. Effect. The way of hierarchical prosody to achieve prosody modeling usually mixes the prosodic boundaries of different levels as phonemes into the input sequence, the model learns the corresponding pause duration independently according to the training data, constructs the hierarchical prosodic graph from the hierarchical prosody of the input cross-language text, and introduces the graph neural network network for modeling, implements an alternative way of modeling prosodic information, and proposes the second graph-based cross-lingual acoustic model GCS-Tacotron model.

[0059] Further, the Chinese prosodic structure is extended to Chinese-English cross-language texts, and the specific method includes: using English words or single letters as prosodic words in the Chinese four-level prosodic struct...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a Chinese and English cross-language speech synthesis method and apparatus, an electronic device and a storage medium. The method comprises the steps of constructing a first cross-language acoustic model by using a sequence-to-sequence task in deep learning; processing the text data set into a basic statement comprising a phoneme sequence, a tone sequence and a language sequence; a model encoder is utilized to encode the basic statement into high-level context semantic representation, and meanwhile language embedding and speaker embedding are introduced into multiple positions of the model encoder; learning a mapping relation between the advanced context semantic representation and the acoustic feature Mel spectrogram by using an attention mechanism to obtain the advanced context semantic representation after linear weighting; and generating an original spectrogram from the linearly weighted advanced context semantic representation by using a model decoder. According to the Chinese and English cross-language speech synthesis method, two cross-language acoustic models are constructed based on fusion of multiple strategies, so that the Chinese and English cross-language speech synthesis method overcomes the defects of an existing speech synthesis method.

Description

technical field [0001] The present invention relates to the technical field of speech synthesis, in particular to a method, device, electronic device and storage medium for Chinese-English cross-language speech synthesis. Background technique [0002] As mobile phones, tablets, smart homes, and wearable devices all begin to access voice functions, human-computer interaction has gradually entered the voice era. Different from the traditional human-computer interaction, the voice interaction is convenient and intelligent, which can make the machine have the comprehensive ability of listening, speaking, reading and writing like a human. Speech synthesis is the last link of the intelligent speech interaction system, which is responsible for letting the machine speak specific text and the speech audio of a specific speaker. It is divided into two parts: text analysis and acoustic model modeling. Text analysis mainly extracts features from text, and provides text-related informat...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G10L13/02G10L13/08G10L13/10G10L25/30
CPCG10L13/02G10L13/08G10L13/10G10L25/30
Inventor 汤步洲刘超
Owner HARBIN INST OF TECH SHENZHEN GRADUATE SCHOOL
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More