Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Multi-speaker voice separation method based on convolutional neural network and depth clustering

A convolutional neural network and speech separation technology, applied in biological neural network models, speech analysis, neural architecture, etc., can solve the problem that the system model cannot use three-speaker mixed signal speech separation, and the separation model cannot use speaker speech separation , can not expand the speaker voice separation and other problems, to achieve the effect of less parameters, reduce parameters, and improve performance

Inactive Publication Date: 2019-11-15
XINJIANG UNIVERSITY
View PDF5 Cites 35 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The current solution has the following problems. First, the speech separation model depends on the speaker, that is, the trained separation model cannot be used for the speech separation of new speakers, that is, it is only used for closed-set speakers, and cannot increase with the increase of speakers. Second, it can only separate mixed signals whose signal sources are different types (for example, separate noise from speakers), and separate signals whose sources are signals of the same type (such as multiple speakers). Ineffective; finally, existing source separation models do not scale to voice separation for an arbitrary number of speakers, and the system cannot model three utterances if the samples used to train the separation model are a mixture of two speakers Speech Separation of Human-Mixed Signals

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multi-speaker voice separation method based on convolutional neural network and depth clustering
  • Multi-speaker voice separation method based on convolutional neural network and depth clustering
  • Multi-speaker voice separation method based on convolutional neural network and depth clustering

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0042] The embodiment of the present invention provides a multi-speaker speech separation method based on convolutional neural network and deep clustering, which includes two steps during implementation: training the separation network model and using the separation network to realize single-speaker speech separation. figure 1 It is a flow chart of the multi-speaker speech separation method based on convolutional neural network and deep clustering. This process is followed when training the separation network model and using the separation network to achieve single-speaker speech separation. Only when training the network model, it needs to be calculated according to The model continuously updates the network parameters, while the network parameters remain unchanged when the speech separation system is running to separate the mixed speech from single-speaker speech; in addition, when training the network, execute figure 1 Mixed speech features--threshold expansion convolutional...

Embodiment 2

[0065] Specific description of the speech separation problem

[0066] The goal of mono speech separation is to estimate individual source signals that are mixed together and overlapped in a mono signal. The S source signal sequences are denoted as x in the time domain s (t), s=1,..., S, and express the mixed signal sequence in the time domain as:

[0067]

[0068] Framing, windowing, and short-time Fourier transform are performed on the speech signal to obtain the spectrum of the speech signal. Specifically, take 32ms sampling points as a frame signal, if the sampling rate is 8kHz, then one frame is 256 sampling points, if the sampling frequency is 16kHz, then one frame is 512 sampling points, if the length is less than 32ms, first The number of sampling points is zero-padded to 256 or 512; then, windowing is performed on each frame signal, and the windowing function adopts a Hamming window or a Hanning window. The corresponding short-time Fourier transform (STFT) is X ...

Embodiment 3

[0108] Experimental results show that the present invention adopts the separation network model based on convolutional neural network and deep clustering, even for the situation that the speaker's voice in the mixed voice has the same energy (such as the WSJ0 corpus), and for the presence of non-participating in the voice to be separated The case where the model is trained on speakers (i.e. the model is "speaker-agnostic") also performs well. Experimental results show that the trained network model can effectively separate single speaker speech. The deep learning model learns acoustic cues, which are neither speaker nor language-independent, for source separation and account for the property of amplitude spectrogram region correlation.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a multi-speaker voice separation method based on a convolutional neural network and depth clustering. The method comprises the following steps: 1, a training stage: respectively performing framing, windowing and short-time Fourier transform on single-channel multi-speaker mixed voice and corresponding single-speaker voice; and training mixed voice amplitude frequency spectrum and single-speaker voice amplitude frequency spectrum as an input of a neural network model; 2, a testing stage: taking the mixed voice amplitude frequency spectrum as an input of a threshold expansion convolutional depth clustering model to obtain a high-dimensional embedded vector of each time-frequency unit in the mixed frequency spectrum; using a K-means clustering algorithm to classify thevectors according to a preset number of speakers, obtaining a time-frequency masking matrix of each sound source by means of the time-frequency unit corresponding to each vector, and multiplying thematrixes with the mixed voice amplitude frequency spectrum respectively to obtain a speaker frequency spectrum; and combining a mixed voice phase frequency spectrum according to the speaker frequencyspectrum, and obtaining a plurality of separate voice time domain waveform signals by adopting short-time Fourier inverse transform.

Description

technical field [0001] The present invention relates to the field of monophonic speech separation, in particular to a multi-speaker speech separation method based on dilated convolution convolution neural network and deep clustering, which can realize the separation of two monophonic multi-speaker mixed speech. One or three single-speaker speech time-domain waveforms. Background technique [0002] With the increasing strategic importance of artificial intelligence, voice is a bridge between man and machine, and powerful voice processing technology is essential. Although the accuracy of automatic speech recognition systems has exceeded the threshold for many practical applications, there are still some difficulties to be solved in order to make speech recognition systems more robust and have a wider range of applications. Such as the cocktail party problem, that is, for multiple speakers speaking at the same time or background noise accompanied by other human voices, track a...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G10L21/028G10L21/0208G10L25/30G06K9/62G06N3/04
CPCG10L21/028G10L21/0208G10L25/30G10L2021/02087G06N3/045G06F18/23213
Inventor 董兴磊胡英黄浩
Owner XINJIANG UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products