Single-channel human voice and background voice separation method based on convolutional recurrent neural network

A technology of cyclic neural network and convolutional neural network, which is applied in speech analysis, instruments, etc., can solve problems such as poor separation of human voice and background sound, and inability to accurately extract voice time domain and frequency domain information.

Active Publication Date: 2021-01-22
NANJING SILICON INTELLIGENCE TECH CO LTD
View PDF12 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The purpose of the present invention is to overcome the shortcomings of the existing technology that the time domain and frequency domain information in the speech cannot be accurately extracted, and the separation effect of the human voice and the background sound in the mixed speech is poor, and a single-channel human voice based on convolutional cyclic neural network is provided. The sound and background sound separation method, by designing two different sizes of convolution kernels in the convolutional neural network, captures the time domain and frequency domain information of the speech, and simultaneously performs feature dimensionality reduction and extracts its local features and compares it with the original mixed signal amplitude The multi-scale features formed by spectral combination are input into the recurrent neural network, which can accurately separate the human voice signal and background sound signal of the mixed voice

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Single-channel human voice and background voice separation method based on convolutional recurrent neural network
  • Single-channel human voice and background voice separation method based on convolutional recurrent neural network
  • Single-channel human voice and background voice separation method based on convolutional recurrent neural network

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0036] A single-channel human voice and background sound separation method based on convolutional cyclic neural network, including steps:

[0037] S1. Acquiring an original mixed voice signal, the original voice signal being a single-channel mixed signal of human voice and background sound;

[0038] S2, the obtained original mixed speech signal is subjected to frame division, windowing and time-frequency conversion to obtain the original mixed signal amplitude spectrum and the original mixed signal phase spectrum;

[0039] S3, the original mixed signal amplitude spectrum is input into the convolutional neural network, and the convolutional neural network includes a convolutional layer and a pooling layer arranged in sequence; the convolutional layer obtains the local features of the original mixed signal amplitude spectrum, and the pooling layer pairs The feature is dimensionally reduced, converted into a low-resolution feature map and output; the convolutional layer includes ...

Embodiment 2

[0053] like figure 1 and 2 As shown, on the basis of Embodiment 1, this embodiment also includes: there is an attention layer between the convolution layer and the pooling layer of the convolutional neural network, and the attention layer automatically acquires each feature by learning According to the importance of the channel, the weight of useful feature channels is increased according to the importance, and the feature channels that are not very useful for the current task are suppressed. figure 2 Represents the attention layer. The attention layer is arranged between the convolutional layer and the pooling layer.

[0054] Preferably, the attention layer adopts a maximum pooling method for global pooling.

[0055] figure 2A schematic diagram of the attention layer of this embodiment is provided. Given an input x, the number of feature channels is c_1, and after a series of general transformations such as convolution, a feature with a number of feature channels of c_2...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a single-channel human voice and background voice separation method based on a convolutional recurrent neural network. The method comprises the following steps: S1, acquiring an original mixed voice signal; s2, obtaining an original mixed signal amplitude spectrum and an original mixed signal phase spectrum; s3, inputting the original mixed signal amplitude spectrum into aconvolutional neural network; s4, inputting the low-resolution feature map and the original mixed signal amplitude spectrum into a recurrent neural network, and combining with a time-frequency mask toobtain a predicted value of the human voice after passing through the time-frequency mask and a predicted value of the background sound after passing through the time-frequency mask; and S5, combining the predicted value of the human voice after passing through the time-frequency mask and the predicted value of the background voice after passing through the time-frequency mask with the original mixed signal phase spectrum to obtain a predicted human voice signal and a predicted background voice signal. Compared with the prior art, the separation method provided by the invention has the advantages that the time domain and frequency domain information of the voice can be captured, and the human voice signal and the background sound signal of the mixed voice are separated by the generated multi-scale characteristics.

Description

technical field [0001] The present invention relates to the separation of human voice and background sound, in particular to a single-channel human voice and background sound separation method based on convolutional cyclic neural network. Background technique [0002] The purpose of speech separation is to separate the target speech from the background interference. Since the sound collected by the microphone may include noise, other people's voices, background music and other interference items, if the recognition is performed directly without speech separation, it will affect the recognition. the accuracy rate. Therefore, the source of separation and recognition is of great value in the field of signal processing such as human voice and automatic speech recognition. The separation of human voice and background music under a single channel is a basic and important branch of voice separation. [0003] In recent years, with the improvement of software and hardware performanc...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G10L21/0272G10L21/0308G10L25/18G10L25/30G10L25/45
CPCG10L21/0272G10L21/0308G10L25/18G10L25/30G10L25/45
Inventor 孙超
Owner NANJING SILICON INTELLIGENCE TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products