A speech enhancement method based on fusion network
By combining an improved network model and feature fusion module of EMD, TCN and GCRN, the problem of insufficient speech feature extraction under low signal-to-noise ratio is solved, and better speech enhancement effect is achieved, especially with significant performance improvement under low signal-to-noise ratio conditions.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HARBIN UNIV OF SCI & TECH
- Filing Date
- 2023-04-27
- Publication Date
- 2026-06-26
AI Technical Summary
Under low signal-to-noise ratio conditions, traditional neural network speech feature extraction is insufficient, and the speech enhancement effect needs to be improved.
This paper adopts a fusion-improved empirical mode decomposition (EMD) and gated convolutional recurrent neural network (GCRN) combined with a feature fusion module (FFM). Low-frequency signals are processed by a temporal convolutional network (TCN) and high-frequency signals are processed by a multi-layer gated convolutional recurrent neural network (MGCRN). Information fusion is performed using the feature fusion module (FFM), and finally, the speech is reconstructed by inverse short-time Fourier transform (ISTFT).
Under low signal-to-noise ratio conditions, it significantly improves speech enhancement performance, especially when the SNR is -5dB, the speech enhancement performance index is more than 0.86dB higher than other models.
Smart Images

Figure CN116486826B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to a speech enhancement method based on a fusion network, belonging to the field of speech enhancement. Background Technology
[0002] Voice is the primary means of communication between people, playing a vital role in advancing human society. The development of voice processing technology has promoted the advancement of voice-based human-computer interaction technology, thereby enhancing the ability of humans to interact with smart terminals. Voice processing includes many aspects such as voice enhancement, voice separation, and voice recognition. Among these, voice enhancement is the front-end processing technology for voice interaction; therefore, the effectiveness of voice enhancement directly affects the quality of interaction and user experience.
[0003] Early methods for speech enhancement included wavelet thresholding, spectral subtraction, Wiener filtering, and minimum mean square error estimation. However, these methods were not effective for processing non-stationary signals. With the continuous development of signal processing technology, Huang et al. proposed the Empirical Mode Decomposition (EMD) algorithm in 1998. This algorithm adaptively decomposes the signal into vibrational components of Intrinsic Mode Functions (IMFs). It has been optimized and is widely used in the analysis of non-stationary and nonlinear signals. Therefore, for non-stationary, non-uniform, and nonlinear signals such as speech signals, the EMD algorithm can be used to decompose them into IMF components. Then, combining the extracted speech components from different frequency bands with a neural network can significantly improve the performance.
[0004] With the improvement of computer performance and the development of artificial intelligence, more and more scholars are applying deep learning algorithms to speech enhancement. In 2014, Xu Yong et al. proposed fully connected deep neural networks (DNNs), which used this network to learn the complex nonlinear relationship between the logarithmic power spectrum of noisy speech and the logarithmic power spectrum of clean speech, achieving good results. However, this model is prone to gradient vanishing and gradient exploding problems during training. Later, scholars proposed convolutional neural networks (CNNs), which are widely used in the field of speech enhancement due to their strong feature extraction capabilities and local filtering characteristics. With the improvement of research technology, researchers proposed recurrent neural networks (RNNs). Although RNNs have improved the performance of CNNs to some extent, the phenomenon of gradient vanishing or exploding still exists during training. Long Short-Term Memory (LSTM) networks can avoid this phenomenon. However, with further research, it has been found that the gradient vanishing problem still occurs when using LSTMs to process very long sequences. Building upon LSTM, Tan et al. proposed a neural network model that integrates CNN and LSTM—the Convolutional Recurrent Neural Network (CRN). CRN requires fewer training parameters than LSTM and offers better objective clarity and perceptual quality. However, as it is a variant structure based on CNN and LSTM, it inevitably suffers from the vanishing gradient problem as the amount of data increases. Dauphin et al. proposed using Gated Linear Units (GLUs) instead of recursive connections as a gating mechanism to alleviate gradient propagation and reduce complexity. Combining the advantages of GLU and CRN, Tan et al. replaced the encoding and decoding convolutional layers of CRN with GLU modules, proposing the GCRN network, which achieved good results in speech enhancement. However, the increased structural complexity of the CRN network resulted in relatively low network efficiency. Therefore, addressing the issue of network model complexity affecting memory usage and computational speed, in 2020, Ibtehaz et al. proposed the MultiResUNet network. This network improves efficiency by adding multi-resolution modules to the encoder and decoder, focusing on different features to acquire more feature information. Subsequently, Lea et al. proposed using Temporal Convolutional Networks (TCNs) to process sequential data. This method not only improved training speed but also effectively avoided the error accumulation drawbacks of RNNs and their variant models. Speech signals are nonlinear and discontinuous signals with complex frequency characteristics.When extracting speech features, processing speech signals from different frequency bands separately can reduce model parameters and prevent over-processing of low-frequency signals with indistinct features, thereby improving processing speed and speech enhancement. How to fuse two speech features from different dimensions is a key research focus. Y. Shi et al. used a Feature Fusion Module (FFM) to fuse cochlear cepstral coefficients and energy features in speech features, achieving excellent recognition rates and high noise resistance. F. Dang et al. proposed a full-band and sub-band fusion network based on a dual-path transformer, fusing sub-band and full-band information through FFM to obtain good frequency domain speech enhancement. Therefore, using an FFM module will yield better processing results when fusing different features. Summary of the Invention
[0005] To address the issues of insufficient speech feature extraction and inadequate speech enhancement performance under low signal-to-noise ratio conditions using traditional neural networks, this invention proposes a speech enhancement method that integrates an improved EMD and GCRN network.
[0006] This invention provides a speech enhancement method that integrates an improved EMD and GCRN network, the method comprising:
[0007] S1. Use AMM-EMD to extract speech features across the entire frequency band and perform preliminary noise reduction;
[0008] S2. Based on feature extraction, FFM is used to integrate feature information from different dimensions in a more complete way;
[0009] S3. Based on the fusion of two features using FFM, construct a dual-input ME-MGFCRN speech enhancement model;
[0010] S4. The signal is reconstructed using ISTFT and converted into a time-domain signal, thus completing the conversion from noisy speech to clean speech and finally obtaining the enhanced speech.
[0011] Preferably, S1 includes:
[0012] S11. Use the TCN module to process low-frequency signals;
[0013] S12. Use MGCRN network to process high-frequency signals.
[0014] Preferably, in S1, the TCN network consists of two residual networks and dilated convolutions. Each residual network comprises two dilated causal convolutions, two weight normalization layers, two activation function layers, and two dropout layers, which are connected by 1×1 convolutional modules. Each dilated causal convolution module in the TCN network uses exponentially increasing dilation to maintain a small size while obtaining more information, thus increasing the receptive field. The receptive field is calculated using the following formula:
[0015] RF d+1 =RF d +(k-1)×S d
[0016] Where d represents the number of convolutional layers, RF d+1 Represents the receptive field of the current layer, RF d S represents the receptive field of the previous layer, k represents the size of the convolution kernel, and S represents the size of the convolution kernel. d This represents the product of the step sizes of all previous layers (excluding the current layer).
[0017] Preferably, in step S1, based on the GLU network's ability to capture long-term memory and the absence of gradient vanishing, this invention combines the GLU network with CRN units to construct a GCRN network. Simultaneously, each encoder and decoder module of the GCRN network is transformed into a multi-layer structure, with each encoder and decoder module replaced by three batch normalization layers and three different convolutional modules. Residual connections are used between the encoder and decoder modules to construct the MGCRN network of this invention.
[0018] Preferably, in step S2, since the GCRN network is used to process the high-frequency (HF) group and the TCN network to process the low-frequency (LF) group, simply adding the processed features together would inevitably lead to information redundancy or distortion of the speech features. Therefore, a feature fusion module is used to process information from different dimensions, thereby effectively solving this problem. The FMM module contains one fully connected layer (FC) and two BGRU layers, which are connected by a dense network (DN).
[0019] Preferably, in step S3, based on the constructed MGCRN network, and taking into account the advantages of the TCN module and the fusion capability of the FFM module for low-frequency and high-frequency features, an ME-MGFCRN network is created.
[0020] Preferably, in step S4, the noisy speech is first transformed using EMD to obtain IMF components, which are then divided into LF and HF groups based on frequency characteristics. The HF groups are then transformed using the AMM algorithm to obtain time-domain features, and multi-layer gated convolutional recurrent neural network is used to extract these features. Simultaneously, TCN is used to extract the LF group features. Then, FFM is used to fuse these two types of features to construct a dual-input ME-MGFCRN speech enhancement model. The fused output is then reconstructed using ISTFT to convert it into a time-domain signal, thus completing the conversion from noisy speech to clean speech, ultimately obtaining the enhanced speech.
[0021] The present invention addresses the problem of insufficient speech feature extraction and inadequate speech enhancement performance under low signal-to-noise ratio conditions using traditional neural networks. Based on Empirical Mode Decomposition (EMD), Temporal Convolutional Network (TCN), and Gated Linear Units Convolutional Recurrent Neural Network (GCRN), and combined with a Feature Fusion Module (FFM), it proposes an Adaptive Mean Median-Empirical Mode Decomposition-Multilayer Gated Feature Fusion Module Convolutional Recurrent Neural Network (ME-MGFCRN) speech enhancement model. This network model employs a frequency-separated learning strategy to learn low-frequency and high-frequency features. Specifically, it uses TCN and MGFCRN networks to acquire low-frequency and high-frequency features, and then processes these two sets of features through FFM to achieve speech enhancement via feature mapping. The proposed model was subjected to ablation and comparative experiments on a dataset, and the speech enhancement effect was evaluated using PESQ, fwSegSNR, and STOI. The study shows that the proposed model improves upon other baseline models under different noise environments and signal-to-noise ratios (SNRs), especially under low SNR conditions of -5dB, where fwSegSNR and PESQ are improved by more than 0.86dB and 0.02, respectively, compared to other baseline models. Attached Figure Description
[0022] Figure 1 Overall structural diagram of the ME-MGFCRN speech enhancement model;
[0023] Figure 2 TCN structure diagram;
[0024] Figure 3 Feature fusion module diagram;
[0025] Figure 4 MGFCRN network structure diagram;
[0026] Figure 5 fwSegSNR score graphs under different environments; Detailed Implementation
[0027] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0028] It should be noted that, unless otherwise specified, the embodiments and features described in the present invention can be combined with each other.
[0029] The present invention will be further described below with reference to the accompanying drawings and specific embodiments, but this is not intended to limit the invention. The speech enhancement method of this embodiment, which integrates and improves EMD and GCRN networks, firstly, obtains IMF components from noisy speech through EMD transformation, and divides them into LF and HF groups based on frequency characteristics. Secondly, the HF group is transformed using the AMM algorithm to obtain time-domain features, and multilayer gated convolutional recurrent neural networks (MGCRN) are used to extract HF group features. Simultaneously, TCN is used to extract LF group features. Finally, FFM is used to fuse the above two types of features to construct a dual-input ME-MGFCRN speech enhancement model, and the fused output is reconstructed using ISTFT to convert it into a time-domain signal, thereby completing the conversion from noisy speech to clean speech, and finally obtaining the enhanced speech. The overall implementation structure diagram is as follows: Figure 1 As shown. This embodiment implements a speech enhancement method based on a fusion network, including:
[0030] S1. Utilize AMM-EMD for full-band speech feature extraction and preliminary noise reduction, including:
[0031] S11. Use the TCN module to process low-frequency signals;
[0032] S12. Use MGCRN network to process high-frequency signals.
[0033] Based on the EMD algorithm's ability to both denoise and decompose different frequency components, this implementation divides all IMFs after EMD decomposition into LF and HF groups. An adaptive average median thresholding method is used to denoise all components in the HF group using a soft thresholding function, obtaining each individual threshold point for the HF group and estimating its threshold, which can be expressed as:
[0034]
[0035] Where i represents the number, Thr i Represents the estimated threshold; L represents the signal length; ρ i The estimated noise level representing the i-th IMF can be expressed as:
[0036]
[0037] Among them, AMM i The mean median deviation of the i-th IMF component can be expressed as:
[0038] AMM i =median[IMF i (t)-mean[IMF i (t)]]
[0039] Among them, IMF i (t) represents the IMF component value after the i-th pass through the AMM-EMD algorithm.
[0040] After processing with an adaptive average median threshold, the i-th IMF component C of all speech signals at different frequencies is obtained. i (t) can be expressed as:
[0041]
[0042] Among them, C i (t) is subjected to a Fourier transform and fed into the neural network as a speech feature for learning. Since the feature extraction process only involves mathematical operations, the computational load is relatively small.
[0043] To address the limitations of existing LSTM, GRU, and similar variant models, an increasing number of researchers are considering CNN networks as the primary choice for studying large-scale batch sequence problems. This has led to the development of TCN networks, whose structure is as follows: Figure 2 As shown.
[0044] The TCN network consists of two residual networks and dilated convolutions. Each residual network comprises two dilated causal convolutions, two weight normalization layers, two activation function layers, and two dropout layers, which are connected by 1×1 convolutional modules.
[0045] Traditional CNN or RNN networks acquire more information by controlling network depth and kernel size, but this leads to increased network size and information loss. Dilated convolution, also known as dilation rate, uses exponentially increasing dilation rates in each dilated causal convolution module of a TCN network. This allows for acquiring more information while maintaining a smaller size, thus increasing the receptive field. The calculation formula is:
[0046]
[0047] Where k and d represent the filter size and expansion rate, respectively.
[0048] The TCN network possesses advantages such as parallel processing capability, flexible receptive field, gradient stability, relatively simple network structure, and the ability to preserve the original input feature map, enabling it to process sequences of arbitrary length. Therefore, this implementation chooses TCN to process the features of the LF group to improve the network's operating speed and obtain more feature information.
[0049] Based on the GLU network's ability to capture long-term memory and its absence of gradient vanishing, this implementation combines the GLU network with CRN units to construct the GCRN network. This network is then used to process the HF group.
[0050] S2. Based on feature extraction, FFM is used to integrate feature information from different dimensions in a relatively complete way:
[0051] This implementation uses the GCRN network to process the HF group and the TCN network to process the LF group. Simply adding the processed features together would inevitably lead to information redundancy or distortion in the speech features. Therefore, a feature fusion module is used to process information from different dimensions, thus effectively solving this problem. The FMM module contains one fully connected layer (FC) and two BGRU layers, which are connected by a dense network (DN). The fusion module structure is as follows: Figure 3 As shown.
[0052] S3. Based on the fusion of two features using FFM, construct a dual-input ME-MGFCRN speech enhancement model:
[0053] This implementation transforms each encoder and decoder module of the GCRN network into a multi-layer structure, replacing each encoder and decoder module with three batch normalization layers and three different convolutional modules. Residual connections are used between the encoder and decoder modules to construct the MGCRN network of this invention. Simultaneously, leveraging the advantages of the TCN module and considering the fusion capability of the FFM module for low-frequency and high-frequency features, the MGFCRN network of this implementation is created, with the structure as follows: Figure 4 As shown.
[0054] Since the speech complexity within the LF group is low and contains less noise after AMM-EMD preprocessing, its information can be extracted using the TCN network. The information within the HF group is more complex, so the MGCRN network is used for feature extraction. This implementation first trains the network using the TCN and MGCRN modules, and then performs information fusion using the FFM module. Therefore, it achieves parallel processing of the LF and HF groups, effectively acquiring speech information with different frequency features while improving the network's processing speed.
[0055] S4. The signal is reconstructed using ISTFT and converted into a time-domain signal, thus completing the conversion from noisy speech to clean speech, ultimately obtaining the enhanced speech:
[0056] Noisy speech is transformed using EMD to obtain IMF components, which are then divided into LF and HF groups based on frequency characteristics. The HF group is transformed using the AMM algorithm to obtain temporal features, and a multi-layer gated convolutional recurrent neural network is used to extract these features. Simultaneously, a TCN is used to extract LF group features. Then, the FFM module is used to fuse these two types of features to construct a dual-input ME-MGFCRN speech enhancement model. The fused output is then reconstructed using ISTFT to convert it into a temporal signal, thus completing the transformation from noisy speech to clean speech, ultimately obtaining the enhanced speech.
[0057] experiment
[0058] 1. Experimental Environment
[0059] The experimental environment was conducted under the Tensorflow 2.0 framework. The experimental equipment consisted of an Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz, 32GB of RAM, a 64-bit Windows 10 operating system, and a GEFORCE RTX2080Ti GPU. The experiment was run in GPU mode.
[0060] 2. Speech Dataset
[0061] This invention selects the LibriSpeech ASR database, the NoiseX-92 noise set, and the AURORA noise set for experiments. The LibriSpeech ASR dataset comes from audiobooks read in the LibriVox project, containing 1000 hours of speech with a sampling rate of 16kHz. During the experiment, 100 hours of speech from the LibriSpeech ASR database were randomly selected as the dataset, with the training and test sets set at an 8:2 ratio. Each speech segment was 10 seconds long, and the number of iterations was 30. Noise was taken from the NoiseX-92 and AURORA datasets. During the experiment, bubble, factory, f16, hfchannel, car, train, and airport noise were mixed with the speech data at a certain signal-to-noise ratio. These mixed speech samples were used to test the model's enhancement effect.
[0062] The audio signal sampling rate was set to 16kHz, the frame length to 512 points, the Hanning window to 32ms, and the frame shift to 16ms. A 16×256×1 processing window was created to limit the dimensions of the training and test sets. The timing window was set to 16 frames, with a duration of 0.256s, to meet low latency requirements.
[0063] The Adam optimizer was used to train the network with a learning rate of 0.0004, an exponential decay rate of 0.9 for the first moment estimate, an exponential decay rate of 0.999 for the second moment estimate, a batch size of 64, and 30 epochs.
[0064] 3. Experimental Results
[0065] To verify the speech enhancement effect of adding the TCN network on the LF group, baby noise from the NoiseX-92 noise library was mixed with the speech signal, and the SNR was set to 0dB. Comparative and ablation experiments were conducted, and the results are shown in Table 1. Here, MGCRN represents the network directly inputting noisy speech; MGCRN+GRU represents using MGCRN to process the HF group and GRU to process the LF group; MGCRN+TCN represents using MGCRN to process the HF group and TCN to process the LF group; and MGCRN+TCN+FFM represents using MGCRN to process the HF group, TCN to process the LF group, and adding an FFM module.
[0066] Table 1 Enhancement performance under different network structures
[0067]
[0068] As shown in Table 1, when the MGCRN+TCN+FFM structure is used, the performance indicators of PESQ, STOI and fwSegSNR are higher than those of other modules. Therefore, the present invention constructs the network model structure of MGCRN+TCN+FFM.
[0069] To discuss the optimal structure of FFM and explore the impact of its internal structure on speech enhancement performance, the results are shown in Table 2. Here, FC+2BRU represents one layer of FC + two layers of BRU; FC+2BRU+DN represents one layer of FC + two layers of BRU connected by a dense network; FC+2LSTM represents one layer of FC + two layers of LSTM; and FC+2LSTM+DN represents one layer of FC + two layers of LSTM connected by a dense network.
[0070] Table 2. Speech enhancement performance under different internal structures of FMM
[0071]
[0072] As shown in Table 2, the PESQ and STOI of the FC+2BRU+DN structure are higher than the other three structures. Although the fwSegSNR index is 0.11dB lower than the FC+2LSTM+DN model, it is at least 1.1dB higher than the other models. Therefore, based on a comprehensive analysis of the three indices, this invention selects the FC+2BRU+DNFMM as the internal structure of the FMM module.
[0073] To verify the model's generalization ability, noise was selected from either the train or airport in AURORA and mixed with the speech signal under different SNR conditions. The evaluation results were obtained using the fwSegSNR rating metric, as shown below. Figure 5 As shown.
[0074] Depend on Figure 5 It can be seen that, under different signal-to-noise ratio conditions, whether in training or airport noise environments, the fwSegSNR score of the model proposed in this invention is better than other models. Especially when training noise and airport noise are mixed with the speech signal as background noise, in an environment with an SNR of -5dB, the fwSegSNR score of the ME-MGFCRN model proposed in this invention is still 3.1 and 1.9 higher than that of the DP-RNN and GCRN models, respectively, and 2.6 and 1.6 higher than those of the GCRN models. Therefore, it is evident that the model proposed in this invention significantly improves the speech enhancement effect and has good generalization ability to different noise levels.
Claims
1. A speech enhancement method based on fusion networks, characterized in that, include: S1. Use AMM-EMD to extract speech features across the entire frequency band and perform preliminary noise reduction; S2. Based on feature extraction, FFM is used to integrate feature information from different dimensions in a more complete way; S3. Based on the fusion of two features using FFM, construct a dual-input ME-MGFCRN speech enhancement model; S4. The noisy speech is reconstructed using ISTFT and converted into a time-domain signal, thus completing the conversion from noisy speech to clean speech and finally obtaining the enhanced speech. S1 includes: S11. Use the TCN module to process low-frequency signals; S12. Use MGCRN network to process high-frequency signals; In S12: Based on the characteristics of GLU network having the ability to capture long-term memory and not having gradient vanishing, GLU network is combined with CRN unit to construct GCRN network. At the same time, each encoder module and decoder module of GCRN network is transformed into a multi-layer structure, and each encoder and decoder module is replaced by three batch normalization layers and three different convolutional modules. Residual connections are adopted between the encoder and decoder modules to construct MGCRN network in this way. In S3: Based on the constructed MGCRN network, and taking into account the advantages of the TCN module and the fusion capability of the FFM module for low-frequency and high-frequency features, an ME-MGFCRN network is created.
2. The speech enhancement method based on fusion networks according to claim 1, characterized in that, In S11: the TCN network consists of two residual networks and dilated convolutions. Each residual network comprises two dilated causal convolutions, two weight normalization layers, two activation function layers, and two Dropout layers. These layers are interconnected. The convolutional modules perform residual connections. Each dilated causal convolutional module in the TCN network uses exponentially increasing dilated convolutions to obtain more information while maintaining a small size, thereby increasing the receptive field. The receptive field is calculated using the following formula: in, Indicates the number of convolutional layers. Indicates the receptive field of the current layer. This indicates the sensory field of the next higher level. Indicates the size of the convolution kernel. This represents the product of the step sizes of all layers preceding the current layer.
3. The speech enhancement method based on fusion networks according to claim 1, characterized in that, In S2: Since the GCRN network is used to process the HF group and the TCN network is used to process the LF group, simply adding the features processed by the two will inevitably lead to information redundancy or information distortion of the speech features. The feature fusion module is used to process information of different dimensions. The FFM module contains a fully connected layer and two BGRU layers, and they are connected by a dense network.
4. The speech enhancement method based on fusion networks according to claim 1, characterized in that, In step S4: First, the noisy speech is transformed by EMD to obtain IMF components. Based on frequency characteristics, it is divided into LF and HF groups. The HF group is transformed by AMM algorithm to obtain time-domain features. Then, a multi-layer gated convolutional recurrent neural network is used to extract the features of the HF group. At the same time, TCN is used to extract the features of the LF group. Then, FFM is used to fuse the above two features to construct a dual-input ME-MGFCRN speech enhancement model. The fused output is then reconstructed by ISTFT and converted into a time-domain signal, thereby completing the conversion from noisy speech to clean speech and finally obtaining the enhanced speech.