Speech synthesis method, device and equipment and storage medium

A technology of speech synthesis and synthetic speech, which is applied in speech synthesis, speech analysis, instruments, etc., can solve the problems of single speech quality and low speech quality, and achieve the effect of high quality and close speaking style

Pending Publication Date: 2021-04-30
IFLYTEK CO LTD
View PDF8 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

People hope to synthesize these cross-lingual sentences with a consistent and natural voice, but most current end-to-end models assume that the input is a single language and only use the original text as input to the synthesis model
[0003] The inventors of this case found that the pronunciation phenomena of different languages ​​are different, such as Chinese tone patterns, Japanese accents, and Russian accents, etc. are not expressed on the text, so the existing synthesis model for a single language only uses the original Text is used as model input, and for speech synthesis of cross-lingual sentences, the quality of the synthesized speech is not high

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Speech synthesis method, device and equipment and storage medium
  • Speech synthesis method, device and equipment and storage medium
  • Speech synthesis method, device and equipment and storage medium

Examples

Experimental program
Comparison scheme
Effect test

preparation example Construction

[0062] Next, combine figure 1 Described, the speech synthesis method of the present application may comprise the following steps:

[0063] Step S100: Obtain the original text, the phoneme sequence corresponding to the original text, and the speaker characteristics of the speech to be synthesized.

[0064] Specifically, before speech synthesis, it is necessary to obtain the original text to be subjected to speech synthesis. The original text may be text information in a single language, or may be text information in multiple languages, for example, the original text may be text information including two or more languages ​​at the same time.

[0065] Further, considering the different pronunciation characteristics of different languages, the pronunciation characteristics of some languages ​​may not be displayed in the form of text, for example, Chinese tone patterns, Japanese tone cores, Russian accents, etc. cannot be displayed in the form of word faces. , but can be displaye...

Embodiment approach

[0092] In an optional implementation manner, the specific implementation process of the above step S120 may include the following steps:

[0093] S1. Perform encoding processing on the fusion feature to obtain an encoded feature.

[0094] Specifically, the fusion feature can be encoded by the text encoder to obtain the encoded feature output by the text encoder.

[0095] Further, considering that the existing end-to-end speech synthesis models all assume that the input is in a single language, the result is that when different languages ​​are mixed in the input text, the existing models often synthesize wrong speech, or even skip it directly. word. At the same time, since it is difficult to obtain the speech of the same speaker in different languages, in order to prevent the model from erroneously learning the correlation between speaker characteristics and languages, resulting in the phenomenon of switching speakers in the synthesized speech, this embodiment provides a metho...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a speech synthesis method and device, equipment and a storage medium, and the method comprises the steps: obtaining an original text, a phoneme sequence corresponding to the original text, and the speaker features of to-be-synthesized speech, carrying out the feature fusion of the original text and the phoneme sequence, and obtaining a fusion feature; and performing encoding and decoding processing based on the fusion feature and the speaker features to obtain an acoustic spectrum, and performing speech synthesis based on the acoustic spectrum to obtain a synthesized speech. The fusion feature is obtained by fusing the original text and the phoneme sequence, input information is enriched, specific pronunciation information of different languages can be mined, for example, tone types of Chinese, tone nucleuses of Japanese, accent of Russian and the like can be displayed through the phoneme sequence, the acoustic spectrum is obtained, speech synthesis is carried out, the obtained synthesized speech is more natural, and accords with the pronunciation characteristics of the corresponding language, and the quality of the synthesized speech is higher.

Description

technical field [0001] The present application relates to the technical field of speech signal processing, and more particularly, to a speech synthesis method, apparatus, device and storage medium. Background technique [0002] In recent years, end-to-end speech synthesis systems have been able to achieve good results and can generate synthetic speech close to human in real time. With the development of globalization, in important scenarios of speech synthesis applications such as social media, informal information, and voice navigation, the language phenomenon of mixing different languages ​​in text or speech becomes more and more obvious. One wants to synthesize these cross-lingual sentences with a consistent and natural voice, but most current end-to-end models assume that the input is monolingual and use only raw text as the input to the synthesis model. [0003] The inventor of the present case found that different languages ​​have different pronunciation phenomena, su...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G10L13/02G10L17/02G10L17/04G10L19/00G10L19/02
CPCG10L13/02G10L17/02G10L17/04G10L19/0018G10L19/02
Inventor 陈梦楠江源高丽祖漪清
Owner IFLYTEK CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products