Multi-speaker voice separation method based on convolutional neural network and depth clustering

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A convolutional neural network and speech separation technology, applied in biological neural network models, speech analysis, neural architecture, etc., can solve the problem that the system model cannot use three-speaker mixed signal speech separation, and the separation model cannot use speaker speech separation , can not expand the speaker voice separation and other problems, to achieve the effect of less parameters, reduce parameters, and improve performance

Inactive Publication Date: 2019-11-15

XINJIANG UNIVERSITY

View PDF5 Cites 35 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0004] The current solution has the following problems. First, the speech separation model depends on the speaker, that is, the trained separation model cannot be used for the speech separation of new speakers, that is, it is only used for closed-set speakers, and cannot increase with the increase of speakers. Second, it can only separate mixed signals whose signal sources are different types (for example, separate noise from speakers), and separate signals whose sources are signals of the same type (such as multiple speakers). Ineffective; finally, existing source separation models do not scale to voice separation for an arbitrary number of speakers, and the system cannot model three utterances if the samples used to train the separation model are a mixture of two speakers Speech Separation of Human-Mixed Signals

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0042] The embodiment of the present invention provides a multi-speaker speech separation method based on convolutional neural network and deep clustering, which includes two steps during implementation: training the separation network model and using the separation network to realize single-speaker speech separation. figure 1 It is a flow chart of the multi-speaker speech separation method based on convolutional neural network and deep clustering. This process is followed when training the separation network model and using the separation network to achieve single-speaker speech separation. Only when training the network model, it needs to be calculated according to The model continuously updates the network parameters, while the network parameters remain unchanged when the speech separation system is running to separate the mixed speech from single-speaker speech; in addition, when training the network, execute figure 1 Mixed speech features--threshold expansion convolutional...

Embodiment 2

[0065] Specific description of the speech separation problem

[0066] The goal of mono speech separation is to estimate individual source signals that are mixed together and overlapped in a mono signal. The S source signal sequences are denoted as x in the time domain s (t), s=1,..., S, and express the mixed signal sequence in the time domain as:

[0067]

[0068] Framing, windowing, and short-time Fourier transform are performed on the speech signal to obtain the spectrum of the speech signal. Specifically, take 32ms sampling points as a frame signal, if the sampling rate is 8kHz, then one frame is 256 sampling points, if the sampling frequency is 16kHz, then one frame is 512 sampling points, if the length is less than 32ms, first The number of sampling points is zero-padded to 256 or 512; then, windowing is performed on each frame signal, and the windowing function adopts a Hamming window or a Hanning window. The corresponding short-time Fourier transform (STFT) is X ...

Embodiment 3

[0108] Experimental results show that the present invention adopts the separation network model based on convolutional neural network and deep clustering, even for the situation that the speaker's voice in the mixed voice has the same energy (such as the WSJ0 corpus), and for the presence of non-participating in the voice to be separated The case where the model is trained on speakers (i.e. the model is "speaker-agnostic") also performs well. Experimental results show that the trained network model can effectively separate single speaker speech. The deep learning model learns acoustic cues, which are neither speaker nor language-independent, for source separation and account for the property of amplitude spectrogram region correlation.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a multi-speaker voice separation method based on a convolutional neural network and depth clustering. The method comprises the following steps: 1, a training stage: respectively performing framing, windowing and short-time Fourier transform on single-channel multi-speaker mixed voice and corresponding single-speaker voice; and training mixed voice amplitude frequency spectrum and single-speaker voice amplitude frequency spectrum as an input of a neural network model; 2, a testing stage: taking the mixed voice amplitude frequency spectrum as an input of a threshold expansion convolutional depth clustering model to obtain a high-dimensional embedded vector of each time-frequency unit in the mixed frequency spectrum; using a K-means clustering algorithm to classify thevectors according to a preset number of speakers, obtaining a time-frequency masking matrix of each sound source by means of the time-frequency unit corresponding to each vector, and multiplying thematrixes with the mixed voice amplitude frequency spectrum respectively to obtain a speaker frequency spectrum; and combining a mixed voice phase frequency spectrum according to the speaker frequencyspectrum, and obtaining a plurality of separate voice time domain waveform signals by adopting short-time Fourier inverse transform.

Description

technical field [0001] The present invention relates to the field of monophonic speech separation, in particular to a multi-speaker speech separation method based on dilated convolution convolution neural network and deep clustering, which can realize the separation of two monophonic multi-speaker mixed speech. One or three single-speaker speech time-domain waveforms. Background technique [0002] With the increasing strategic importance of artificial intelligence, voice is a bridge between man and machine, and powerful voice processing technology is essential. Although the accuracy of automatic speech recognition systems has exceeded the threshold for many practical applications, there are still some difficulties to be solved in order to make speech recognition systems more robust and have a wider range of applications. Such as the cocktail party problem, that is, for multiple speakers speaking at the same time or background noise accompanied by other human voices, track a...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G10L21/028G10L21/0208G10L25/30G06K9/62G06N3/04

CPCG10L21/028G10L21/0208G10L25/30G10L2021/02087G06N3/045G06F18/23213

Inventor 董兴磊胡英黄浩

Owner XINJIANG UNIVERSITY

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Multi-speaker voice separation method based on convolutional neural network and depth clustering

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology