Many-to-many voice conversion method and system based on speaker style feature modeling

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A speech conversion and speaker technology, applied in speech synthesis, speech analysis, instruments, etc., can solve the problems of inability to provide, information loss and noise, lack of speaker identity information, etc.

Active Publication Date: 2020-10-23

NANJING UNIV OF POSTS & TELECOMM

View PDF9 Cites 10 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

However, since C-VAE is based on the ideal assumption that the observed data usually follows a Gaussian distribution, the output speech of the decoder is overly smooth, and the converted speech quality is not high

The voice conversion method based on the Cycle-GAN model uses the adversarial loss and the cycle consistent loss, and learns the forward mapping and inverse mapping of the acoustic features at the same time, which can effectively alleviate the problem of over-smoothing and improve the quality of the converted voice. However, Cycle-GAN can only achieve one One-to-one voice conversion

[0005] The voice conversion method based on the Star Generative Adversarial Network (StarGAN) model has the advantages of C-VAE and Cycle-GAN at the same time. The attributes of the output of the speaker are controlled by the speaker identity label, so many-to-many speech conversion under non-parallel text conditions can be realized, but there are still three problems. First, the speaker identity label is just a one-hot vector, although it has the instruction function, but it cannot provide more speaker identity information, and the lack of speaker identity information makes it difficult for the generator to reconstruct a converted voice with high personality similarity; secondly, the speaker identity label in the decoding network of the generator is only passed through Simple splicing controls the output attributes, which cannot achieve the full fusion of semantic features and speaker personality characteristics, resulting in the loss of deep semantic features and speaker personality characteristics in the spectrum; in addition, the encoding network and decoding network in the generator are independent of each other, this simple network structure makes the generator lack the ability to extract deep features, which can easily cause the loss of information and the generation of noise

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0076] The following will clearly and completely describe the technical solutions in the embodiments of the present invention in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

[0077] The present invention proposes a many-to-many speech conversion method based on speaker style feature modeling, which is to add a multi-layer perceptron and a style encoder to the traditional StarGAN neural network to achieve effective extraction and constraints on the speaker's style features. Using the speaker style feature instead of the speaker label feature overcomes the shortcomings of the limited speaker information carried by the ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a many-to-many voice conversion method and system based on speaker style feature modeling. Firstly, a multi-layer sensor and a style encoder are added to a StarGAN neural network, effective extraction and constraint of speaker style features are realized, the defect of limited speaker information carried by one-hot vectors in a traditional model is overcome, then, an adaptive instance normalization method is adopted to realize full fusion of semantic features and speaker personality features, so that the network can learn more semantic information and speaker personality information, furthermore, a lightweight network module SKNet is introduced into the generator residual error network, so that the network can adaptively adjust the size of a receptive field according to multiple scales of input information, the weight of each feature channel is adjusted through an attention mechanism, the learning ability of frequency spectrum features is enhanced, and the details of the frequency spectrum features are refined.

Description

technical field [0001] The invention relates to the technical field of voice conversion, in particular to a many-to-many voice conversion method based on speaker style feature modeling. Background technique [0002] Speech conversion is a research branch in the field of speech signal processing, which is developed and extended on the basis of speech analysis, synthesis and speaker recognition. The goal of speech conversion is to change the voice personality characteristics of the source speaker so that it has the personality characteristics of the target speaker, while keeping the semantic information unchanged, that is, to make the voice of the source speaker sound like the target speaker after conversion. voice. [0003] After years of research on speech conversion technology, many classic conversion methods have emerged. According to the classification of training corpus, they can be divided into conversion methods under parallel text conditions and conversion methods un...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G10L13/02G10L25/30G10L25/48G10L25/18G10L17/18

CPCG10L13/02G10L25/30G10L25/48G10L25/18G10L17/18

Inventor李燕萍张成飞

OwnerNANJING UNIV OF POSTS & TELECOMM

Many-to-many voice conversion method and system based on speaker style feature modeling

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology