Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Multi-speaker Speech Separation Method Based on Convolutional Neural Network and Deep Clustering

A convolutional neural network and speech separation technology, which is applied to biological neural network models, speech analysis, neural architecture, etc., can solve the problem that the system model cannot use three-speaker mixed signal speech separation, and the separation model cannot use speaker speech separation , Can not expand the speaker's voice separation and other issues, to achieve the effect of fewer parameters, reduce parameters, and improve performance

Inactive Publication Date: 2021-01-12
XINJIANG UNIVERSITY
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The current solution has the following problems. First, the speech separation model depends on the speaker, that is, the trained separation model cannot be used for the speech separation of new speakers, that is, it is only used for closed-set speakers, and cannot increase with the increase of speakers. Second, it can only separate mixed signals whose signal sources are different types (for example, separate noise from speakers), and separate signals whose sources are signals of the same type (such as multiple speakers). Ineffective; finally, existing source separation models do not scale to voice separation for an arbitrary number of speakers, and the system cannot model three utterances if the samples used to train the separation model are a mixture of two speakers Speech Separation of Human-Mixed Signals

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multi-speaker Speech Separation Method Based on Convolutional Neural Network and Deep Clustering
  • Multi-speaker Speech Separation Method Based on Convolutional Neural Network and Deep Clustering
  • Multi-speaker Speech Separation Method Based on Convolutional Neural Network and Deep Clustering

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0042] The embodiment of the present invention provides a multi-speaker speech separation method based on convolutional neural network and deep clustering, which includes two steps during implementation: training the separation network model and using the separation network to realize single-speaker speech separation. figure 1 It is a flow chart of the multi-speaker speech separation method based on convolutional neural network and deep clustering. This process is followed when training the separation network model and using the separation network to achieve single-speaker speech separation. Only when training the network model, it needs to be calculated according to The model continuously updates the network parameters, while the network parameters remain unchanged when the speech separation system is running to separate the mixed speech from single-speaker speech; in addition, when training the network, execute figure 1 Mixed speech features--threshold expansion convolutional...

Embodiment 2

[0065] Specific description of the speech separation problem

[0066] The goal of mono speech separation is to estimate individual source signals that are mixed together and overlapped in a mono signal. The S source signal sequences are denoted as x in the time domain s (t), s=1,..., S, and express the mixed signal sequence in the time domain as:

[0067]

[0068] Framing, windowing, and short-time Fourier transform are performed on the speech signal to obtain the spectrum of the speech signal. Specifically, take 32ms sampling points as a frame signal, if the sampling rate is 8kHz, then one frame is 256 sampling points, if the sampling frequency is 16kHz, then one frame is 512 sampling points, if the length is less than 32ms, first The number of sampling points is zero-padded to 256 or 512; then, windowing is performed on each frame signal, and the windowing function adopts a Hamming window or a Hanning window. The corresponding short-time Fourier transform (STFT) is X ...

Embodiment 3

[0108] Experimental results show that the present invention adopts the separation network model based on convolutional neural network and deep clustering, even for the situation that the speaker's voice in the mixed voice has the same energy (such as the WSJ0 corpus), and for the presence of non-participating in the voice to be separated The case where the model is trained on speakers (i.e. the model is "speaker-agnostic") also performs well. Experimental results show that the trained network model can effectively separate single speaker speech. The deep learning model learns acoustic cues that are neither speaker- nor language-independent for source separation and account for the property of amplitude spectrogram region correlation.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a multi-speaker voice separation method based on convolutional neural network and deep clustering, including: 1. Training stage: dividing single-channel multi-speaker mixed voice and corresponding single-speaker voice into frames, Windowing and short-time Fourier transform; the mixed speech amplitude spectrum and the single-speaker speech amplitude spectrum are used as the input of the neural network model for training; 2. Test phase: the mixed speech amplitude spectrum is used as the threshold expansion convolution depth clustering model input to get the high-dimensional embedding vector of each time-frequency unit in the mixed spectrum; use the K-means clustering algorithm to classify the vectors according to the set number of speakers, and then obtain each time-frequency unit corresponding to each vector The time-frequency masking matrix of the sound source multiplies the matrix with the amplitude spectrum of the mixed speech to obtain the speaker spectrum; according to the speaker spectrum, combined with the phase spectrum of the mixed speech, the inverse short-time Fourier transform is used to obtain multiple separated speech time domains waveform signal.

Description

technical field [0001] The present invention relates to the field of monophonic speech separation, in particular to a multi-speaker speech separation method based on dilated convolution convolution neural network and deep clustering, which can realize the separation of two monophonic multi-speaker mixed speech. One or three single-speaker speech time-domain waveforms. Background technique [0002] With the increasing strategic importance of artificial intelligence, voice is a bridge between man and machine, and powerful voice processing technology is essential. Although the accuracy of automatic speech recognition systems has exceeded the threshold for many practical applications, there are still some difficulties to be solved in order to make speech recognition systems more robust and have a wider range of applications. Such as the cocktail party problem, that is, for multiple speakers speaking at the same time or background noise accompanied by other human voices, track a...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G10L21/028G10L21/0208G10L25/30G06K9/62G06N3/04
CPCG10L21/028G10L21/0208G10L25/30G10L2021/02087G06N3/045G06F18/23213
Inventor 董兴磊胡英黄浩
Owner XINJIANG UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products