Voice conversion system, method and application

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology for transforming systems and sounds, applied in speech analysis, speech recognition, instruments, etc., can solve problems such as inflexibility, sudden increase in computing volume, inappropriate computing resources and equipment, etc., to improve flexibility, shorten training time, alleviate The effect of inaccurate pronunciation

Active Publication Date: 2020-12-01

NANJING SILICON INTELLIGENCE TECH CO LTD

View PDF6 Cites 18 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

However, this type of algorithm still has some defects. For example, the classic method of using Gaussian mixture model for speech conversion is mostly based on one-to-one speech conversion tasks, requiring the source speaker and the target speaker to use the same training sentence content. Only by aligning the spectral features with Dynamic Time Warping (DTW) frame by frame can the mapping relationship between spectral features be obtained through model training. Such a voice conversion method is not flexible enough in practical applications; the Gaussian mixture model is used to train the mapping function When considering global variables and iterating the training data, the amount of calculation will increase sharply, and only when the training data is sufficient, the Gaussian mixture model can achieve a better conversion effect, which is not suitable for limited computing resources and equipment

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0046] A sound transformation system,

[0047] include:

[0048] (1) The speaker-independent speech recognition (AI-ASR) model adopts a five-layer DNN structure, of which the fourth layer uses the Bottleneck layer to transform the Mel cepstrum feature (MFCC) of the source speech into the source speech bottleneck feature(Bottleneck Feature);

[0049] The ASR model converts speech into text, and the model outputs the probability of each word corresponding to the audio, and PPG is the carrier of this probability. The PPG-based method uses PPG as the output of the SI-ASR model.

[0050] PPG is Phonetic PosteriorGram, which is a matrix that maps each audio time frame to the posterior probability of a certain phoneme category. To a certain extent, PPG can represent the rhythm and prosody information of a speech content, and at the same time, it removes the features related to the speaker's timbre, so it is independent of the speaker. PPG is defined as follows:

[0051] P_t=(p(s...

Embodiment 2

[0065] Introduce a sound transformation system training method, including the following three parts A1-A3:

[0066] A1, SI-ASR model (speaker-independent speech recognition model) training phase. This stage is trained to obtain the SI-ASR model used in the training stage of the Attention voice-changing network (attention voice-changing network) and the extraction of Bottleneck features (literally translated as bottleneck features, also referred to as BN features) in the voice conversion stage; the training of this model includes The training corpus of many speakers is trained. After training, it can be used for any source speaker, that is, it is speaker-independent (Speaker-Independent, SI), so it is called the SI-ASR model; after training, it can be used directly later without repetition train.

[0067] The SI-ASR model (Speaker Independent Speech Recognition model) training phase consists of the following steps (see attached figure 1 ):

[0068] B1. Preprocessing the mul...

Embodiment 3

[0096] Embodiment 3, a sound transformation method.

[0097] Perform sound transformation on the input source speech, and transform it into a target speech signal output, that is, the speech conforms to the characteristics of the target speaker's voice, but the speech content is the same as the source speech.

[0098] The sound conversion phase consists of the following steps (see appendix Figure 4 ):

[0099] E1, the source speech to be converted is carried out parameter extraction, obtains MFCC characteristic;

[0100] E2, use the SI-ASR model trained in B3 to transform the MFCC feature into a BN feature;

[0101] E3. Use the Attention voice-changing network trained in C5 to transform the BN feature into an acoustic feature (mel spectrum);

[0102] E4. Use the neural network vocoder trained in D4 to convert the acoustic features (mel spectrum) into speech output.

[0103] In this way, the trained speaker-independent speech recognition model can be used for any source sp...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides a voice conversion scheme for non-parallel corpus training to get rid of dependence on parallel texts and solve the technical problem of difficult realization of voice conversion under the condition that resources and equipment are limited. The invention relates to a voice conversion system, method and an application terminal. Compared with the prior art, the scheme has theadvantages that: a trained speaker-independent voice recognition model can be used for any source speaker, namely the speaker is independent; the bottleneck characteristics of the audio are more abstract than voice phonetic posterior Gram characteristics, can reflect the speaking content and decouple from the tone of the speaker, and are not closely bound with the phoneme category, and are not ina clear one-to-one correspondence relationship, so that the problem of inaccurate pronunciation caused by ASR recognition errors is relieved to a certain extent. The pronunciation accuracy of the audio obtained by performing voice conversion by using bottleneck characteristics is obviously higher than that of a voice phonetic posterior Gram method, and the tone is not obviously different; by meansof a transfer learning mode, dependence on training corpora can be greatly reduced.

Description

technical field [0001] The present invention relates to the field of speech calculation algorithms, in particular to a sound transformation system, method and applied terminal. Background technique [0002] With the continuous development of computer technology and the continuous deepening of the field of artificial intelligence, voice robots for the purpose of voice interaction have gradually entered the public eye. The emergence of voice robots has changed the working nature of existing telephone services. At present, voice robots are used in real estate, education, finance, tourism and other industries to perform voice interaction functions, thereby replacing manual voice interactions with users. [0003] In order to optimize customer experience, using voice conversion technology to change the voice characteristics of voice robots is one of the important improvement directions. [0004] Speech conversion technology is a research branch of speech signal processing. It cov...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G10L15/06G10L15/02G10L19/16G10L25/24G10L25/30

CPCG10L15/063G10L15/02G10L19/173G10L25/24G10L25/30G10L2015/025G10L21/003G10L2021/0135G10L15/16

Inventor 司马华鹏毛志强龚雪飞

Owner NANJING SILICON INTELLIGENCE TECH CO LTD

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Voice conversion system, method and application

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology