A two-branch speech enhancement algorithm based on structured state-space sequential model

By using a dual-branch speech enhancement algorithm to process amplitude and complex spectral features in parallel, and by introducing an interactive module and an S4D model, the problem of insufficient utilization of amplitude and phase information in traditional methods is solved, thereby improving speech quality and intelligibility while reducing computational complexity.

CN117219109BActive Publication Date: 2026-06-23NANJING UNIV OF POSTS & TELECOMM

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NANJING UNIV OF POSTS & TELECOMM
Filing Date
2023-10-17
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing speech enhancement techniques cannot effectively utilize the potential relationship between amplitude and phase information when dealing with noise, resulting in limited speech quality and intelligibility. Furthermore, deep neural networks have high computational complexity when dealing with long-term dependencies and global context.

Method used

A two-branch speech enhancement algorithm based on a structured state-space sequence model is adopted. The amplitude spectrum and complex spectrum features of the speech signal are processed in parallel by a coarse amplitude estimation branch and a complex fine estimation branch. An interactive module is introduced to promote information exchange, and time modeling is performed in conjunction with the S4D model.

Benefits of technology

It improves the quality and intelligibility of speech signals, reduces computational complexity and cost, and achieves more efficient speech enhancement.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117219109B_ABST
    Figure CN117219109B_ABST
Patent Text Reader

Abstract

The application discloses a double-branch speech enhancement algorithm based on a structured state space sequence model, which comprises the following steps: obtaining amplitude spectrum and complex spectrum features of noisy speech, and inputting the features into an amplitude rough estimation branch and a complex refinement estimation branch respectively to obtain real and imaginary components of the rough estimated speech and the refined speech; introducing an interaction module to realize the flow of the amplitude spectrum and the complex spectrum features between the two branches; superimposing the real and imaginary components of the rough estimated speech and the refined speech to reconstruct a target signal complex spectrum; and evaluating the performance of the double-branch enhancement algorithm based on the structured state space sequence model. The amplitude spectrum and the complex spectrum are estimated simultaneously, and the interaction module is introduced to promote information exchange, so that the features learned from one branch can supplement the missing information of the other branch; and a diagonalized state space model is used to model the speech feature sequence, so that the parameter quantity of the model is reduced, and the algorithm performance is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of speech enhancement technology, and specifically to a two-branch speech enhancement algorithm based on a structured state-space sequence model. Background Technology

[0002] In daily life, more and more products involve speech enhancement technology, such as mobile phones, hearing aids, smart home control systems, and military walkie-talkies. In the real world, most of what we hear is speech mixed with noise, not clean speech. Noise severely degrades the quality of target speech, and noise removal has always been a major challenge in the field of speech signal processing, leading to the development of speech enhancement technology. The purpose of speech enhancement is to separate clean speech signals from noisy speech signals to improve speech quality and intelligibility. Traditional monophonic speech enhancement methods mainly include spectral subtraction, nonnegative matrix factorization, and computational auditory scene analysis. In the past decade, monophonic speech enhancement based on deep neural networks (DNNs) has been widely used, and its performance has been significantly improved. DNN-based speech enhancement methods can be divided into two categories according to the domain of the signal being studied. One is the time-domain method, which directly processes the one-dimensional original waveform of the speech signal. The other is the time-frequency (TF) domain method, which processes the two-dimensional spectrogram of the speech.

[0003] Temporal methods are mainstream in speech enhancement. They directly use the original speech waveform as input, thus avoiding the phase estimation problem. For example, conv-TasNet uses a linear convolutional encoder-decoder framework instead of the traditional STFT-ISTFT for speech separation. However, the PHASEN paper points out that speech and noise are easier to distinguish in the TF domain than in temporal methods.

[0004] TF-domain methods are another mainstream approach in speech enhancement. For the past few years, phase has been generally considered unimportant, and Willianson et al. pointed out that the phase spectrum has no structure in polar coordinates. This led early DNN-based TF methods to focus more on the amplitude spectrum of the input noisy speech without altering the phase. However, recent research has emphasized the importance of phase information. To address the challenge of phase estimation, time-domain and complex-domain methods have been proposed. The former directly estimates the enhanced speech from the original time-domain speech, without using any TF representation, indirectly avoiding the phase prediction problem. For the latter, researchers address the phase estimation problem from the perspective of complex masks or complex spectral mapping. A typical example of the former is the complex ideal ratio mask (cIRM), which aims to recover the TF representation while estimating both amplitude and phase. We found that for speech enhancement, cIRM achieves the best prediction performance compared to other ideal masks (such as ideal binary masks (IBM), ideal ratio masks (IRM), or phase-sensitive masks (PSM)).

[0005] With advancements in deep neural networks, many speech researchers have developed autoencoders, recurrent neural networks (RNNs) such as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs), convolutional neural networks (CNNs) and their variants, such as Temporal Convolutional Networks (TCNs), Convolutional Recurrent Network Architectures (CRNs), and Gated Convolutional Recurrent Networks (GCRNs). While these methods have achieved satisfactory results in speech enhancement, some limitations remain. For example, due to the limitation of receptive fields, CNNs cannot capture global information, and they are not sequentially processed, which poses challenges in handling long-term dependencies and global context of speech signals, leading to more expensive inference when performing speech enhancement tasks. As a natural time-series model, RNNs are well-suited for processing speech signals, but they have high computational complexity and cannot be parallelized. A recently proposed diagonal version of Structured State-Space Sequences (S4D) can handle long-dependency sequences with lower computational and memory requirements. It can replace CNNs for efficient parallel training or RNNs for fast autoregressive generation. Furthermore, complex spectrograms and amplitudes are considered two key features for speech enhancement. However, previous methods only dealt with one of these features, ignoring the potential relationships between them.

[0006] In view of this, it is necessary to design a two-branch speech enhancement algorithm based on a structured state-space sequence model to solve the above problems. Summary of the Invention

[0007] The purpose of this invention is to overcome the shortcomings of the prior art and provide a speech enhancement algorithm based on a bi-branch modeling of a structured state-space sequence model.

[0008] To achieve the above objectives, the present invention adopts the following technical solution, including:

[0009] Step S1: Preprocess the speech signal to obtain the amplitude spectrum features and complex spectrum features of the noisy speech;

[0010] Step S2: Input the amplitude spectrum features of the noisy speech into the amplitude coarse estimation branch network to obtain the amplitude spectrum features of the estimated speech. Combined with the phase of the noisy speech, the first real component and the first imaginary component of the coarsely estimated speech are finally obtained.

[0011] Step S3: Input the complex spectral features of the noisy speech into the complex thinning estimation branch network. Through the enhancement of the complex thinning estimation branch network, the second real component and the second imaginary component of the thinned speech are obtained.

[0012] Step S4: During the parallel process of steps S2 and S3, an interactive module is introduced to realize the flow of the amplitude spectrum feature and the complex spectrum feature between the coarse amplitude estimation branch network and the complex refined estimation branch network;

[0013] Step S5: Superimpose the first real component and the first imaginary component with the second real component and the second imaginary component to reconstruct the complex spectrum of the target signal;

[0014] Step S6: Evaluate the performance of the enhanced algorithms for the coarse amplitude estimation branch network and the complex refined estimation branch network based on the structured state-space sequence model.

[0015] As a further improvement of the present invention, step S1 includes:

[0016] Step S11: Resample all speech at a frequency of 16kHz and divide the speech into segments lasting 3 seconds;

[0017] Step S12: Short time segments are extracted using a Hamming window. The window length of the Hamming window is set to 25ms, and there is 25% overlap between adjacent frames. The number of points in the Fast Fourier Transform (FFT) is 512.

[0018] Step S13: Use STFT to obtain the STFT features of the preprocessed noisy signal, and extract the corresponding amplitude spectrum features and complex spectrum features.

[0019] As a further improvement of the present invention, both the magnitude coarse estimation branch network and the complex refinement estimation branch network include an encoder, an S4D module, and a decoder. The encoder includes six convolutional modules, each of which includes a two-dimensional convolutional layer, a BN layer, and a LeakyReLU activation function. The S4D module includes residual connections, a regularization layer, a one-dimensional convolutional layer, a gated recurrent unit, and a normalization layer. The decoder includes six deconvolutional layers, each of which includes a two-dimensional deconvolutional layer, a BN layer, and a LeakyReLU activation function.

[0020] As a further improvement of the present invention, step S2 includes:

[0021] Step S21: Construct a branch network for rough amplitude estimation;

[0022] Step S22: Expand the dimension of the amplitude spectrum feature so that its input shape is [BatchSize, 1, Frequency, Time]; set the number of channels of the convolution module to [16, 32, 64, 128, 256, 256], and set the number of channels of the deconvolution module to [256, 256, 128, 64, 32, 16];

[0023] Step S23: Input the amplitude spectrum features into the amplitude coarse estimation branch network. The amplitude spectrum of the target speech is used as the training target of the amplitude coarse estimation branch network. The amplitude coarse estimation branch network uses a spectral estimation method to estimate the amplitude spectrum of the target speech.

[0024] Step S24: Combine the denoised amplitude spectrum features with the phase of the noisy signal to roughly derive the first real component and the first imaginary component of the target speech.

[0025] As a further improvement of the present invention, step S3 includes:

[0026] Step S31: Construct a complex refinement estimation branch network, wherein the complex refinement estimation branch network uses two different decoder modules to estimate the real part and imaginary part of the target signal respectively;

[0027] Step S32: Expand the dimension of the complex spectral features so that the input shape is [BatchSize, 2, Frequency, Time], and the number of channels of the convolutional module and the deconvolutional module are [16, 32, 64, 128, 256, 256] and [256, 256, 128, 64, 32, 16], respectively.

[0028] Step S33: Input the complex spectral features into the complex thinning estimation branch network, which uses an implicit mask to estimate the complex spectrum of the target speech. The mask is defined as follows:

[0029]

[0030] Among them, X r X i S r S i Let represent the real and imaginary parts of the noisy speech and the clean speech, respectively. represents the implicit mask for the complex refinement estimation of the branch network estimation, and j represents the imaginary unit.

[0031] Step S34: Combine the mask described in step S33 with the complex spectrum of the noisy speech to obtain the second real component and the second imaginary component.

[0032]

[0033]

[0034] Among them, X C , Let represent the complex spectrum of the noisy speech, the estimated complex spectrum of the speech, the second real component, and the second imaginary component, respectively.

[0035] As a further improvement of the present invention, step S4 includes:

[0036] Step S41: Construct a bidirectional interaction module network, which includes a two-dimensional convolutional layer, a BatchNorm, and a sigmoid activation function;

[0037] Step S42: After rough estimation of amplitude, the branch flow features are constructed: the amplitude spectrum features F CME With the complex spectral feature F CSR The first feature is obtained by concatenation, and then the first feature is input into a two-dimensional convolutional layer, which produces a gain function G(F). CME ,F CSR To automatically learn, filter, and save the complex spectral features F CSR Different regions; the gain function G(F) CME ,F CSR ) and the complex spectral feature F CSR Element-wise multiplication yields the filtered features of the complex refined estimation branch; finally, the amplitude spectrum features F are... CME Adding the filtered features to the final features of the coarse amplitude estimation branch yields the final features, denoted as:

[0038]

[0039] Step S43: Complex refinement estimation of branch flow post-feature construction: The complex spectral features F CSR With the amplitude spectrum feature F CME The second feature is obtained by concatenation, and then the second feature is input into a two-dimensional convolutional layer, which produces a gain function G(F). CSR ,F CME To automatically learn, filter, and save the amplitude spectrum features F CME Different regions; the gain function G(F) CSR ,F CME ) and the amplitude spectrum feature F CME Element-wise multiplication yields the filter features of the coarse amplitude estimation branch; finally, the complex spectral features F are... CSR Adding the filtered features of the coarse amplitude estimation branch to the final features of the complex refined estimation branch yields the following expression:

[0040]

[0041] Here, G() represents the joint operation function of connection, convolution, and Sigmoid.

[0042] As a further improvement of the present invention, step S6 includes:

[0043] Step S61: Conduct an ablation experiment on the speech enhancement algorithm based on the coarse amplitude estimation branch network and the complex refined estimation branch network of the structured state-space sequence model to verify the effectiveness of the dual-branch structure;

[0044] Step S62: Keeping the framework of the dual-branch structure unchanged, replace the S4D module with a traditional time series modeling network to verify the superiority of the S4D module in time series modeling tasks;

[0045] Step S63: Compare the dual-branch speech enhancement algorithm based on the structured state space sequence model with current advanced speech enhancement algorithms to verify the advancement of the dual-branch speech enhancement algorithm based on the structured state space sequence model.

[0046] The beneficial effects of this invention are as follows:

[0047] This invention proposes a joint two-branch SE framework that can estimate the amplitude spectrum and complex spectrum in parallel. Considering the close correlation between the information in the two branches, we also introduce an information exchange module to compensate for each other's missing information. Furthermore, a diagonal version of the Structured State Space Sequence (S4D) is introduced as the temporal modeling network for speech enhancement, demonstrating that S4D-based deep neural networks can serve as a powerful alternative to traditional architectures (such as RNNs and CNNs), offering the ability to model speech enhancement tasks with low computational cost and high performance. Attached Figure Description

[0048] Figure 1 This is a system block diagram of the present invention;

[0049] Figure 2 This is a schematic diagram of the S4D module of the present invention;

[0050] Figure 3 This is a schematic diagram of the interactive module of the present invention;

[0051] Figure 4 This is a schematic diagram of the grouping strategy of the present invention;

[0052] Figure 5 This is a schematic diagram of the TCM module of the present invention. Detailed Implementation

[0053] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

[0054] It should also be noted that, in order to avoid obscuring the invention with unnecessary details, only the structures and / or processing steps closely related to the present invention are shown in the accompanying drawings, while other details that are not closely related to the present invention are omitted.

[0055] Please see Figure 1 This embodiment provides a two-stage single-channel speech separation method based on a strongly constrained dictionary and a deep neural network. In practical applications, the acoustic features of the speech signal are generally extracted, the acoustic features are input into the separation model for training, and then the trained separation model is used for speech separation.

[0056] Traditional time-frequency domain speech enhancement methods either enhance only the amplitude spectrum features without altering the phase, which contributes to intelligibility and harmonic structure, or only estimate the complex spectrum features composed of real and imaginary parts, limiting the accuracy of amplitude and phase recovery. To address this issue, a joint bi-branch structured state-space model is proposed. In the proposed network, these two branches predict the amplitude and complex spectrum features of the speech signal, respectively. Unlike information fusion performed only at the final output layer, we introduce an interaction module between the two branches to facilitate information exchange, allowing features learned from one branch to compensate for missing parts in the other. Furthermore, to reduce model complexity, we introduce a structured state-space sequence (S4) model in both branches for denoising the speech feature sequences. The following is a detailed discussion of the implementation of this embodiment.

[0057] Step S1: Preprocess the speech signal to obtain the amplitude spectrum and complex spectrum features of the noisy speech.

[0058] Speech signals are non-stationary, but they can typically be considered approximately stationary within a relatively short timeframe of 10ms to 40ms. Before extracting speech signal features, stationary frame signals are usually obtained by segmenting the signal into frames, and then amplitude spectrum features and complex spectrum features are extracted sequentially.

[0059] Steps S2 and S3: Construct the amplitude coarse estimation branch network and the complex spectrum refinement branch network, respectively.

[0060] Both the coarse amplitude estimation branch network and the complex spectrum refinement branch network are based on an encoder-decoder structure. The difference lies in that, for the coarse amplitude estimation branch, the input signal is the amplitude spectrum X. M The input to the complex spectrum refinement branch network is the complex spectrum X. C In the complex spectrum refinement branch network, the real and imaginary parts of the complex spectrum are treated as two different input features. Here, the encoder module and the S4D module are shared. During the decoding stage, two different decoder modules are used to estimate the real and imaginary parts.

[0061] The encoder-decoder method is currently one of the most commonly used methods for single-channel speech enhancement. The encoder is mainly responsible for extracting features from noisy speech to obtain high-level feature representations, while the decoder is used to reconstruct the feature spectrogram back to the original input size. In this embodiment, the encoder consists of 6 coding blocks, each consisting of a 2D convolutional layer, a normalization layer, and a Leaky ReLU activation function. Similarly, the decoder also consists of 6 decoding blocks, each consisting of a 2D deconvolutional layer, a normalization layer, and a Leaky ReLU activation function. It is worth noting that deconvolution can be viewed as the inverse operation of 2D convolution.

[0062] In this embodiment, we also introduce a diagonalized version of the structured state-space model (S4D), where the relationships between latent states can be represented by linear transformations. S4D possesses the characteristics of a recurrent neural network (RNN), allowing for autoregressive generation and saving memory usage during inference. Furthermore, S4D can replace convolutional neural networks (CNNs) to achieve parallel training.

[0063] S4D technology is based on Linear State-Space Layers (LSSL). Assume we define the relationship between the input continuous-time series u(t), the output continuous-time series y(t), and the latent space sequence x(t). The output sequence can be represented by the following equation:

[0064]

[0065] y(t)=Cx(t)+Du(t).

[0066] Here, A represents the state transition matrix, and B, C, and D represent other trainable parameter matrices.

[0067] Using the bilinear discretization method, the expression for a linear state-space layer (LSSL) with a trainable step size Δ for discrete-time series sampling can be represented.

[0068]

[0069]

[0070]

[0071]

[0072]

[0073] According to the equations shown, the Linear State Space Layer (LSSL) exhibits characteristics of a Recurrent Neural Network (RNN). The architecture of the LSSL can also be viewed as a Convolutional Neural Network (CNN), expressed by the following formula:

[0074]

[0075] Therefore, the output y k It can be calculated using the convolution kernel K, as follows:

[0076] y k =K*x k ,

[0077]

[0078] Here, L represents the convolutional kernel size. In this implementation, we propose a dual-branch codec architecture based on S4D. Specifically, S4D blocks replace traditional sequence-to-sequence networks, such as TCN and LSTM, for speech denoising. The S4D block includes residual connections, random deactivation, convolutional layers, gated linear units, and layer normalization. The activation functions of the convolutional layers and gated linear units ensure nonlinearity after the S4D layers.

[0079] The Coarse Amplitude Estimation (CME) branch network estimates the amplitude spectrum of the target speech based on spectral mapping techniques. Its main function is to filter out major noise in the speech. The Complex Refinement (CSR) branch network is primarily used to fill in missing details and estimate phase information, mainly based on implicit masking techniques. First, this branch estimates an implicit mask, which, combined with the input complex spectral features, yields the complex spectral features of the target speech.

[0080] Therefore, the rough estimate of the magnitude of the branch network (CME) can be expressed by the formula:

[0081]

[0082]

[0083]

[0084] The Complex Refinement Branch Network (CSR) process can be expressed by the following formula:

[0085]

[0086]

[0087]

[0088] Step S4: Introduce an interactive module to facilitate the flow of information between the two branches.

[0089] The coarse amplitude estimation branch network and the complex spectrum refinement branch network are designed to estimate the amplitude spectrum and the complex spectrum, respectively, indicating that the two branches jointly contribute to the spectrum recovery process. The main function of the interaction module is to exchange information between the two branches. In this way, features from the coarse amplitude estimation branch can serve as auxiliary information to guide the complex spectrum refinement branch to pay more attention to spectral details that may have been lost in the coarse amplitude estimation branch, and vice versa.

[0090] Taking the complex refinement branch as an example, the feature F of the complex refinement estimation branch is first obtained by concatenation. CSR Features F from the coarse branch of amplitude CME The concatenated features are then passed to a 2D convolutional layer, a BatchNorm layer, and an activation layer to obtain the gain function G(F). CSR ,F CME The main function of the gain function is to automatically learn, filter, and save F. CME Different regions. By using the gain function G(F) CSR ,F CME ) and F CME Element-wise multiplication yields the filtered representation. Finally, the features of the complex refinement estimation branch are added to the filtered features of the magnitude coarse estimation branch to obtain the final feature representation of the complex refinement estimation branch.

[0091] Conversely, the same applies. The entire process can be expressed by the formula:

[0092]

[0093]

[0094] Step S5: Reconstruct the target estimated speech.

[0095] The real and imaginary components roughly estimated in step S2 are added to the real and imaginary components refined in step S3 to reconstruct the target complex spectrum. The formula can be expressed as:

[0096]

[0097]

[0098] Step S6: Evaluate the performance of the proposed dual-branch enhancement algorithm based on the structured state-space sequence model.

[0099] Database and Experiment Setup

[0100] The datasets used in this embodiment are of three types: VoiceBank+DEMAND, VoiceBank+NOISE92, and TIMIT+NOISE92.

[0101] VoiceBank+DEMAND: The training set contains 11,572 noisy-clean speech pairs, while the test set contains 824 pairs. For the training set, audio samples are mixed with one of 10 noise types, including two artificial noise treatments (i.e., background noise and speech shape noise) and eight real-world recording noise treatments from the Demand database, for a total of four signal-to-noise ratios (SNRs): {0dB, 5dB, 10dB, 15dB}. Test sentences are created using five unseen noise types from the Demand database at SNRs of {2.5dB, 7.5dB, 12.5dB, 17.5dB}.

[0102] VoiceBank+NOISE92: In our ablation experiments, we selected 3500 sentences from the VoiceBank corpus for training and 400 sentences for testing. The noise signals came from the Babble and Factory noise in the NOISE92 dataset. For both the training and testing sets, we randomly extracted segments from the noisy speech. Then, we combined the clean speech with the noise at random signal-to-noise ratios (SNRs) to create a noisy dataset with different SNRs (-5dB, -3dB, 0dB, 3dB, 5dB).

[0103] All audio segments were resampled to 16kHz and segmented into 3-second segments. For all models, a 25ms window length was used, with 25% overlap between adjacent frames, and an FFT length of 512. For both CME and CSR streams, the number of channels in the encoding layer was set to [16, 32, 64, 128, 256, 256], and the decoding layer was the reverse of the encoding layer. The convolutional kernel size and stride were (2, 5) (time, frequency) and (1, 2) (time, frequency), respectively. The dimensions of the hidden layers and state space of the S4D module were 256 and 64, respectively. For each branch, the number of S4D blocks was set to 4. In the experiments, all algorithms were implemented in PyTorch. All networks were trained using stochastic gradient descent with the Adam optimizer. In the ablation experiments, all models were trained for 35 epochs with a batch size of 24. When compared with the baseline model, a batch size of 4 was used for 50 epochs. The learning rate is initially set to 0.001 and is reduced in multiples of 10 if no loss transformation is performed on the validation set. Additionally, this invention uses the SI-SNR loss function to limit the distributional distance between the two targets.

[0104] Experimental performance evaluation:

[0105] First, to demonstrate the effectiveness of each proposed submodule, including with and without CSR and CME, we conducted comprehensive ablation experiments. All models were optimized using the same loss function and implemented with the optimal configuration. As shown in Table I, we calculated the average results for the test set sentences to evaluate the performance of these variants. For example, in the first variant, we removed the CSR branch and retained only the CME branch for comparison, denoted as with and without CSR. In the second variant, we removed the CME branch, denoted as with and without CME, and ensured that the CSR branch remained unchanged. We observed that in all cases, the CSR network outperformed the CME network. This demonstrates the importance of phase estimation for improving speech quality and intelligibility. Furthermore, when the CME and CSR branches are combined, the dual-branch structure outperforms any single-branch method, demonstrating the effectiveness of the dual-branch strategy. This implies that the synergy between CME and CSR improves the removal of background noise. The CME branch is responsible for filtering out the main noise, while the CSR branch compensates for the lost phase information, thereby improving the overall performance of the system.

[0106] Table I:

[0107]

[0108] Next, to evaluate the performance of the S4D module in speech sequence enhancement modeling, we conducted a comparative analysis, replacing the S4D module with TCN (…). Figure 4 ) and group LSTM ( Figure 5 Two architectures, named Dual-TCM and Dual-gLSTM, were proposed to evaluate their performance on time-series data and compare their speech enhancement performance based on relevant metrics. The VoiceBank+NOISE92 corpus was used in this experiment, the same corpus used in the ablation experiments. However, it's important to note that in this part of the experiment, we only tested on Babble noise. Table II presents several evaluation results for these three architectures. We can observe that Dual-S4D outperforms Dual-gLSTM and Dual-TCM in terms of PESQ and STOI. Table II also shows the model size of the proposed architecture and other architectures. Specifically, the proposed model exhibits excellent performance while maintaining reasonable computational complexity. As shown in Table II, the proposed model has fewer parameters than the other two models, with only about 10.8M parameters.

[0109] Table II:

[0110] Methods Params PESQ STOI Dual-S4D 10.8M 2.41 0.81 Dual-TCM 26.5M 2.33 0.79 Dual-gLSTM 60.2M 2.35 0.80

[0111] Finally, we validated the effectiveness of the speech enhancement algorithm based on bi-branch modeling of a structured state-space sequence model. We conducted experiments on the publicly available dataset VoiceBank+DEMAND. On this dataset, we selected WB-PESQ, STOI, CSIG, CBAK, and COVL as evaluation metrics. In this experiment, our model was trained for 50 epochs using the Adam optimizer with a batch size of 4. Table III shows the comparison results of the speech enhancement algorithm based on bi-branch modeling of a structured state-space sequence model with other state-of-the-art algorithms, mainly including SEGAN, RNNoise, Wave-U-Net, MMSEGAN, DiffWave, SEFlow, CDiffuSE, and MOSE. As shown in Table III, our proposed model achieves greater improvements over some temporal methods (such as SEGAN, Wave-U-Net, DiffWave, SEFlow, CDiffuSE, and MOSE) on these five metrics. Except for CBAK, which is slightly lower than Wave-U-Net, DiffWave, and SEFlow, we achieved the best results on all other metrics. Furthermore, we compared our proposed model with some frequency domain methods (such as MMSEGAN and RNNoise), and the results showed significant improvements in metrics.

[0112] Table III:

[0113]

[0114]

[0115] In summary, this invention proposes a two-branch speech enhancement algorithm based on a structured state-space sequence model, which can estimate the amplitude spectrum and complex spectrum in parallel. Considering the close correlation between the two branches, an information exchange module is introduced to compensate for each other's missing information. Furthermore, a diagonal version of the structured state-space sequence (S4D) is introduced as the temporal modeling network for speech enhancement, demonstrating that S4D-based deep neural networks can serve as a powerful alternative to traditional architectures (such as RNNs and CNNs), possessing the ability to model speech enhancement tasks with low computational cost and high performance. Compared to other methods, the method in this embodiment shows improvements in various metrics and has significant reference value in practical applications.

[0116] The above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A two-branch speech enhancement algorithm based on a structured state-space sequence model, characterized in that, Includes the following steps: Step S1: Preprocess the speech signal to obtain the amplitude spectrum features and complex spectrum features of the noisy speech; Step S2: Input the amplitude spectrum features of the noisy speech into the amplitude coarse estimation branch network to obtain the amplitude spectrum features of the estimated speech. Combined with the phase of the noisy speech, the first real component and the first imaginary component of the coarsely estimated speech are finally obtained. Step S3: Input the complex spectral features of the noisy speech into the complex thinning estimation branch network. Through the enhancement of the complex thinning estimation branch network, the second real component and the second imaginary component of the thinned speech are obtained. Step S4: During the parallel process of steps S2 and S3, an interactive module is introduced to realize the flow of the amplitude spectrum feature and the complex spectrum feature between the coarse amplitude estimation branch network and the complex refined estimation branch network; Step S5: Superimpose the first real component and the first imaginary component with the second real component and the second imaginary component to reconstruct the complex spectrum of the target signal; Step S6: Evaluate the performance of the enhanced algorithms for the coarse amplitude estimation branch network and the complex refined estimation branch network based on the structured state-space sequence model; Both the coarse amplitude estimation branch network and the complex refined estimation branch network include an encoder, an S4D module, and a decoder. The encoder includes six convolutional modules, each of which includes a two-dimensional convolutional layer, a BN layer, and a LeakyReLU activation function. The S4D module includes residual connections, a regularization layer, a one-dimensional convolutional layer, a gated recurrent unit, and a normalization layer. The decoder includes six deconvolutional modules, each of which includes a two-dimensional deconvolutional layer, a BN layer, and a LeakyReLU activation function.

2. The dual-branch speech enhancement algorithm based on a structured state-space sequence model according to claim 1, characterized in that, Step S1 includes: Step S11: Resample all speech at a frequency of 16kHz and divide the speech into segments lasting 3 seconds; Step S12: Short time segments are extracted using a Hamming window. The window length of the Hamming window is set to 25ms, and there is 25% overlap between adjacent frames. The number of points in the Fast Fourier Transform (FFT) is 512. Step S13: Use STFT to obtain the STFT features of the preprocessed noisy signal, and extract the corresponding amplitude spectrum features and complex spectrum features.

3. The dual-branch speech enhancement algorithm based on a structured state-space sequence model according to claim 1, characterized in that, Step S2 includes: Step S21: Construct a branch network for rough amplitude estimation; Step S22: Expand the dimension of the amplitude spectrum feature so that its input shape is [BatchSize, 1, Frequency, Time]; set the number of channels of the convolution module to [16, 32, 64, 128, 256, 256], and set the number of channels of the deconvolution module to [256, 256, 128, 64, 32, 16]; Step S23: Input the amplitude spectrum features into the amplitude coarse estimation branch network. The amplitude spectrum of the target speech is used as the training target of the amplitude coarse estimation branch network. The amplitude coarse estimation branch network uses a spectral estimation method to estimate the amplitude spectrum of the target speech. ; Step S24: Combine the denoised amplitude spectrum features with the phase of the noisy signal to roughly derive the first real component and the first imaginary component of the target speech.

4. The dual-branch speech enhancement algorithm based on a structured state-space sequence model according to claim 3, characterized in that, Step S3 includes: Step S31: Construct a complex refinement estimation branch network, wherein the complex refinement estimation branch network uses two different decoder modules to estimate the real part and imaginary part of the target signal respectively; Step S32: Expand the dimension of the complex spectral features so that the input shape is [BatchSize, 2, Frequency, Time], and the number of channels of the convolutional module and the deconvolutional module are [16, 32, 64, 128, 256, 256] and [256, 256, 128, 64, 32, 16], respectively; Step S33: Input the complex spectral features into the complex thinning estimation branch network, which uses an implicit mask to estimate the complex spectrum of the target speech. The mask is defined as follows: in, Let represent the real and imaginary parts of the noisy speech and the clean speech, respectively. This represents the implicit mask for the estimation of branch networks in complex refinement estimation. Represents the imaginary unit; Step S34: Combine the mask described in step S33 with the complex spectrum of the noisy speech to obtain the second real part and the second imaginary part. , , in, , , , Let represent the complex spectrum of the noisy speech, the estimated complex spectrum of the speech, the second real component, and the second imaginary component, respectively.

5. The dual-branch speech enhancement algorithm based on a structured state-space sequence model according to claim 1, characterized in that, Step S4 includes: Step S41: Construct a bidirectional interaction module network, which includes a two-dimensional convolutional layer, a BatchNorm, and a sigmoid activation function; Step S42: After rough estimation of amplitude, the branch flow features are constructed: the amplitude spectrum features are... With the complex spectral features The first feature is obtained by concatenation, and then the first feature is input into a two-dimensional convolutional layer, which produces a gain function. To automatically learn, filter, and save the complex spectral features Different regions; the gain function With the complex spectral features Element-wise multiplication yields the filtered features of the complex refined estimation branch; finally, the amplitude spectrum features are... Adding the filtered features to the final features of the coarse amplitude estimation branch yields the following: ; Step S43: Complex Refinement Estimation of Branch Flow Features Construction: Constructing the Complex Spectral Features With the amplitude spectrum features The second feature is obtained by concatenation, and then the second feature is input into a two-dimensional convolutional layer, which produces a gain function. To automatically learn, filter, and save the amplitude spectrum features Different regions; the gain function With the complex spectral features Element-wise multiplication yields the filtered features of the coarse amplitude estimation branch; finally, the complex spectral features are... Adding the filtered features of the coarse amplitude estimation branch to the final features of the complex refined estimation branch yields the following: ,in, This represents the combined operation function of connection, convolution, and sigmoid.

6. The dual-branch speech enhancement algorithm based on a structured state-space sequence model according to claim 1, characterized in that, Step S6 includes: Step S61: Conduct an ablation experiment on the speech enhancement algorithm based on the coarse amplitude estimation branch network and the complex refined estimation branch network of the structured state-space sequence model to verify the effectiveness of the dual-branch structure; Step S62: Keeping the framework of the dual-branch structure unchanged, replace the S4D module with a traditional time series modeling network to verify the superiority of the S4D module in time series modeling tasks; Step S63: Compare the dual-branch speech enhancement algorithm based on the structured state space sequence model with current advanced speech enhancement algorithms to verify the advancement of the dual-branch speech enhancement algorithm based on the structured state space sequence model.