A multi-task speech enhancement method and device

By employing a multi-task speech enhancement method, which utilizes a coding module and an adaptive gating module for signal feature extraction and fusion, the high complexity of noise and echo processing in existing technologies is solved, achieving efficient speech enhancement on resource-constrained devices.

CN120766701BActive Publication Date: 2026-06-23BEIJING UNIV OF POSTS & TELECOMM +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING UNIV OF POSTS & TELECOMM
Filing Date
2025-06-17
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing speech enhancement methods are complex and resource-intensive when dealing with noise and echo, and they are difficult to deal with nonlinear interference, resulting in poor speech enhancement effects, and are especially unsuitable for resource-constrained devices.

Method used

A multi-task speech enhancement method is adopted, which uses the coding module for signal feature extraction and compression, combines dynamic time delay alignment and adaptive gating module for signal fusion, and extracts time and frequency domain features through multi-scale dilated convolutional layers to achieve synergistic optimization of noise suppression and echo cancellation.

Benefits of technology

While reducing computing resources, it improves voice call quality, achieves lightweight deployment, and is suitable for resource-constrained application scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120766701B_ABST
    Figure CN120766701B_ABST
Patent Text Reader

Abstract

The embodiment of the application provides a multi-task speech enhancement method and device, which utilizes a speech enhancement model to perform feature extraction processing on two signals of a microphone signal and a reference signal, utilizes a dynamic time delay alignment module to perform adaptive time alignment on the two signals, utilizes an adaptive gating module to fuse the features extracted from the two signal branches, extracts multi-scale time features and frequency domain features from the time and frequency dimensions, and improves the accuracy of noise feature extraction, echo feature extraction and speech feature extraction. The application can jointly perform noise suppression and echo cancellation tasks, realize collaborative optimization of multi-task speech enhancement, improve speech call quality, reduce required computing resources, can realize lightweight deployment, and can be suitable for resource-limited application scenarios.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of speech processing technology, and in particular to a multi-task speech enhancement method and apparatus. Background Technology

[0002] With the development of mobile communication technology, users have increasingly higher requirements for voice quality. Voice enhancement technology improves voice quality by processing voice signals to suppress noise, echo, and other interference factors. Existing voice enhancement methods can remove noise based on one neural network model or cancel echo based on another. Using both models for voice signal processing is not only complex and resource-intensive, but also struggles to handle various nonlinear interferences that may occur during the separate processing of noise and echo, resulting in poor voice enhancement performance and making it unsuitable for resource-constrained devices. Summary of the Invention

[0003] In view of this, the purpose of this application is to provide a multi-task speech enhancement method and apparatus to solve the problem of multi-task speech enhancement that simultaneously eliminates noise and echo.

[0004] To achieve the above objectives, embodiments of this application provide a multi-task speech enhancement method, including:

[0005] The microphone time-frequency signal is upsampled using the first encoding module to obtain the first microphone time-frequency signal feature; the first microphone time-frequency signal feature is upsampled using the second encoding module to obtain the second microphone time-frequency signal feature; the second microphone time-frequency signal feature is compressed and dimensionality reduced using the first bottleneck layer to obtain the third microphone time-frequency signal feature.

[0006] The reference time-frequency signal is upsampled using a third encoding module to obtain a first reference time-frequency signal feature; the first microphone time-frequency signal feature and the first reference time-frequency signal feature are softly aligned using a dynamic delay alignment module to obtain an aligned reference time-frequency signal; the aligned reference time-frequency signal is upsampled using a fourth encoding module to obtain a second reference time-frequency signal feature; and the second reference time-frequency signal feature is compressed and dimensionality reduced using a second bottleneck layer to obtain a third reference time-frequency signal feature.

[0007] The adaptive gating module is used to extract and fuse the time-frequency signal features of the third microphone and the third reference time-frequency signal to obtain the fused time-frequency signal features.

[0008] The fused time-frequency signal features are downsampled using a decoding module to obtain key time-frequency signal features; the enhanced speech signal is then reconstructed based on these key time-frequency signal features.

[0009] Optionally, the echo soft alignment of the first microphone time-frequency signal features and the first reference time-frequency signal features is performed using a dynamic time-delay alignment module to obtain an aligned reference time-frequency signal, including:

[0010] In the microphone signal processing branch, the first microphone time-frequency signal features are processed by a convolutional layer to obtain a query vector. The first microphone time-frequency signal features are processed by two convolutional layers to obtain a key vector. The query vector and the key vector are multiplied by a matrix to obtain an attention score. The attention score is normalized. The normalized attention score is multiplied by the first microphone time-frequency signal features to obtain the first feature part.

[0011] In the reference signal processing branch, the first reference time-frequency signal feature is processed by a sliding window to obtain a value vector. The first reference time-frequency signal feature is processed by a convolutional layer and a sliding window to obtain a key vector. The key vector is multiplied by the query vector to obtain an attention score. The attention score is normalized. The normalized attention score is multiplied by the value vector and then weighted and fused to obtain the second feature part.

[0012] The first feature portion and the second feature portion are spliced ​​together to obtain the aligned reference time-frequency signal.

[0013] Optionally, an adaptive gating module is used to extract and fuse the time-frequency signal features of the third microphone and the third reference time-frequency signal to obtain fused time-frequency signal features, including:

[0014] The time-frequency signal features of the third microphone and the time-frequency signal features of the third reference are subjected to feature fusion processing to obtain the fused features;

[0015] The fused features are extracted using a multi-scale temporal feature extraction layer to obtain multi-timescale features.

[0016] The fused features are extracted using a frequency-domain multi-scale dilated convolutional layer to obtain multi-frequency-domain scale features.

[0017] Based on the aforementioned multi-timescale features and multi-frequency domain scale features, a first weight is generated through a gating function;

[0018] The third reference time-frequency signal features are multiplied by the first weight to obtain a first calculation result;

[0019] The time-frequency signal features of the third microphone are multiplied by the second weight to obtain a second calculation result; wherein the sum of the first weight and the second weight is 1.

[0020] The first and second calculation results are superimposed, and the features are fused through a convolutional layer to obtain the fused time-frequency signal features.

[0021] Optionally, the temporal multi-scale dilated convolutional layer employs multi-level dilation rate adjustable convolution to extract echo features and noise features at different time scales;

[0022] The frequency domain multi-scale dilated convolutional layer uses independent convolutional kernels and combines multi-level dilation to separate the harmonic components of high-frequency noise and low-frequency echo.

[0023] Optionally, before upsampling the microphone time-frequency signal using the first encoding module, the method further includes:

[0024] Perform a short-time Fourier transform on the input microphone signal to obtain the transformed microphone time-frequency signal;

[0025] The transformed microphone time-frequency signal is subjected to bandwidth compression processing to obtain a compressed microphone time-frequency signal;

[0026] The upsampling of the microphone time-domain signal using the first encoding module is as follows: the compressed microphone time-domain signal is upsampled using the first encoding module.

[0027] Optionally, before upsampling the reference time-frequency signal using the third encoding module, the method further includes:

[0028] Perform a short-time Fourier transform on the input reference signal to obtain the transformed reference time-frequency signal;

[0029] The transformed reference time-frequency signal is subjected to bandwidth compression processing to obtain a compressed reference time-frequency signal;

[0030] The upsampling of the reference time-frequency signal using the third encoding module is as follows: the compressed reference time-frequency signal is upsampled using the third encoding module.

[0031] Optionally, the frequency band compression process includes:

[0032] Low-frequency components less than or equal to 4 kHz are retained, while high-frequency components greater than 4 kHz are compressed and merged.

[0033] Optionally, the reconstructed and enhanced speech signal based on the key time-frequency signal features includes:

[0034] The key time-frequency signal features are subjected to frequency band decompression processing to obtain the decompressed time-frequency signal;

[0035] The decompressed time-frequency signal is reconstructed by using a complex ratio mask generated by the mask layer to obtain the reconstructed time-frequency signal.

[0036] The reconstructed time-frequency signal is subjected to inverse short-time Fourier transform to obtain the transformed speech signal.

[0037] Optionally, the method is implemented based on a speech enhancement model, the loss function of which is:

[0038]

[0039] Where MSE is the mean squared error and SISNR is the scale-invariant signal-to-noise ratio. S represents the speech signal predicted by the model, S is the clean speech signal, S_noisy is the speech signal containing noise and echo, and α and β are the weights.

[0040] This application embodiment also provides a multi-task voice enhancement device, including:

[0041] The microphone signal processing module is used to upsample the microphone time-frequency signal using the first encoding module to obtain the first microphone time-frequency signal features; to upsample the first microphone time-frequency signal features using the second encoding module to obtain the second microphone time-frequency signal features; and to compress and reduce the dimensions of the second microphone time-frequency signal features using the first bottleneck layer to obtain the third microphone time-frequency signal features.

[0042] The reference signal processing module is used to upsample the reference time-frequency signal using the third encoding module to obtain a first reference time-frequency signal feature; to perform soft alignment of the first microphone time-frequency signal feature and the first reference time-frequency signal feature using the dynamic delay alignment module to obtain an aligned reference time-frequency signal; to upsample the aligned reference time-frequency signal using the fourth encoding module to obtain a second reference time-frequency signal feature; and to compress and reduce the second reference time-frequency signal feature using the second bottleneck layer to obtain a third reference time-frequency signal feature.

[0043] The signal fusion processing module is used to extract and fuse the time-frequency signal features of the third microphone and the third reference time-frequency signal using the adaptive gating module to obtain the fused time-frequency signal features.

[0044] The signal reconstruction module is used to downsample the fused time-frequency signal features using the decoding module to obtain key time-frequency signal features; and to reconstruct the enhanced speech signal based on the key time-frequency signal features.

[0045] As can be seen from the above, the multi-task speech enhancement method and apparatus provided in this application utilize a speech enhancement model to perform feature extraction processing on both the microphone signal and the reference signal. A dynamic delay alignment module performs adaptive time alignment of the two signals, and an adaptive gating module fuses the features extracted from the two signal branches. This extracts multi-scale temporal and frequency domain features from both time and frequency dimensions, improving the accuracy of noise, echo, and speech feature extraction. This application can jointly execute noise suppression and echo cancellation tasks, achieving collaborative optimization of multi-task speech enhancement, improving voice call quality, reducing required computational resources, enabling lightweight deployment, and making it suitable for resource-constrained application scenarios. Attached Figure Description

[0046] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0047] Figure 1 This is a schematic diagram of the method flow of an embodiment of this application;

[0048] Figure 2 This is a schematic diagram of the speech enhancement model training process according to an embodiment of this application;

[0049] Figure 3 This is a block diagram of the speech enhancement model structure according to an embodiment of this application;

[0050] Figure 4 This is a block diagram of the dynamic delay alignment module in an embodiment of this application;

[0051] Figure 5 This is a block diagram of the adaptive gating module structure according to an embodiment of this application;

[0052] Figure 6 This is a block diagram of the device structure according to an embodiment of this application;

[0053] Figure 7 This is a block diagram of the electronic device structure according to an embodiment of this application. Detailed Implementation

[0054] To make the objectives, technical solutions, and advantages of this disclosure clearer, the following detailed description is provided in conjunction with specific embodiments and the accompanying drawings.

[0055] It should be noted that, unless otherwise defined, the technical or scientific terms used in the embodiments of this application should have the ordinary meaning understood by one of ordinary skill in the art to which this disclosure pertains. The terms "first," "second," and similar terms used in the embodiments of this application do not indicate any order, quantity, or importance, but are merely used to distinguish different components. Terms such as "comprising" or "including" mean that the element or object preceding the word encompasses the elements or objects listed after the word and their equivalents, without excluding other elements or objects. Terms such as "connected" or "linked" are not limited to physical or mechanical connections, but can include electrical connections, whether direct or indirect. Terms such as "upper," "lower," "left," and "right" are only used to indicate relative positional relationships; when the absolute position of the described object changes, the relative positional relationship may also change accordingly.

[0056] like Figure 1-3 As shown, this application provides a multi-task speech enhancement method, including:

[0057] S101: The microphone time-frequency signal is upsampled using the first encoding module to obtain the first microphone time-frequency signal features; the first microphone time-frequency signal features are upsampled using the second encoding module to obtain the second microphone time-frequency signal features; the second microphone time-frequency signal features are compressed and dimension-reduced using the first bottleneck layer to obtain the third microphone time-frequency signal features.

[0058] In this embodiment, a speech enhancement model is used to process both the microphone signal and the reference signal simultaneously. The microphone signal contains noise, echo, and normal speech signal, while the reference signal contains echo and normal speech signal. The speech enhancement model is divided into two branches to process the input microphone signal and the reference signal respectively, extracting noise features, echo features, and speech features. Based on the extracted features, the noise and echo are masked to reconstruct a clean normal speech signal.

[0059] For the microphone signal processing branch, a short-time Fourier transform is performed on the input microphone signal to obtain the transformed microphone time-frequency signal; then, bandwidth compression is performed on the transformed microphone time-frequency signal to obtain the compressed microphone time-frequency signal. In other words, the time-domain microphone signal is converted to a microphone time-frequency signal via a short-time Fourier transform, and then bandwidth compression is applied to the microphone time-frequency signal to reduce computational complexity.

[0060] In some methods, an equivalent rectangular bandwidth (ERB) merging module is used to downsample the microphone time-frequency signal, retaining the original frequency band in the low-frequency range to obtain the basic frequency and main harmonics of speech, retaining low-frequency components less than or equal to 4 kHz, accurately capturing the fundamental frequency and harmonics to ensure the naturalness of speech, and compressing and merging high-frequency components greater than 4 kHz into several equivalent rectangular bandwidth (ERB) frequency bands to simulate the wideband perception characteristics of the human ear. This can significantly reduce computational complexity and achieve the effect of compressing the frequency band as much as possible within the range of human ear perception.

[0061] The compressed microphone time-frequency signal is input into the first encoding module, which performs preliminary upsampling on the input signal to obtain the first microphone time-frequency signal features. The second encoding module then upsamples these features to obtain the second microphone time-frequency signal features. Both the first and second encoding modules consist of multiple layers of convolutional layers, batch normalization layers, and activation functions, employing a layer-by-layer stacking approach for deep feature learning. Each encoding module contains point convolutions and multiple sets of dilated convolutions to capture the temporal and frequency features of the signal. Dilated convolutions effectively expand the receptive field, capturing echo path features with long-range time dependencies without significantly increasing the number of model parameters, thus reducing computational complexity.

[0062] The time-frequency signal features of the second microphone are input into the first bottleneck layer. The first bottleneck layer is used to compress and reduce the dimensionality of the time-frequency signal features of the second microphone to obtain the time-frequency signal features of the third microphone. This reduces computational complexity, removes redundant information, and improves the generalization ability of the model.

[0063] S102: The reference time-frequency signal is upsampled using the third encoding module to obtain the first reference time-frequency signal feature; the first microphone time-frequency signal feature and the first reference time-frequency signal feature are softly aligned using the dynamic time delay alignment module to obtain the aligned reference time-frequency signal; the aligned reference time-frequency signal is upsampled using the fourth encoding module to obtain the second reference time-frequency signal feature; the second reference time-frequency signal feature is compressed and dimensionality reduced using the second bottleneck layer to obtain the third reference time-frequency signal feature;

[0064] In this embodiment, for the reference signal processing branch, a short-time Fourier transform is performed on the input reference signal to obtain the transformed reference time-frequency signal; then, the transformed reference time-frequency signal undergoes bandwidth compression processing to obtain the compressed reference time-frequency signal. That is, the time-domain reference signal is converted into a reference time-frequency signal through a short-time Fourier transform, and then bandwidth compression is performed on the reference time-frequency signal to reduce computational complexity. Bandwidth compression also employs an equivalent rectangular bandwidth combining module to downsample the reference time-frequency signal, retaining low-frequency components less than or equal to 4 kHz, and compressing and combining high-frequency components greater than 4 kHz, achieving the effect of bandwidth compression as much as possible within the range of human hearing.

[0065] The compressed reference time-frequency signal is input into the third encoding module, which performs preliminary upsampling on the input reference time-frequency signal to obtain the first reference time-frequency signal characteristics.

[0066] In some embodiments, considering the dynamic time delay between the microphone signal and the reference signal, the first microphone time-frequency signal features and the first reference time-frequency signal features are input into the dynamic time delay alignment module, and the dynamic time delay alignment module is used to perform soft alignment on the first microphone time-frequency signal features and the first reference time-frequency signal features to obtain the aligned reference time-frequency signal.

[0067] like Figure 4 As shown, the dynamic delay alignment module is used to model the time-frequency correlation between the microphone signal and the reference signal. The dynamic delay alignment module includes a microphone signal processing branch and a reference signal processing branch. In the microphone signal processing branch, the time-frequency signal features of the first microphone are processed by a convolutional layer to obtain the query vector Q. mic The time-frequency signal features of the first microphone are processed by two convolutional layers to obtain a key vector. The query vector and the key vector are multiplied by a matrix to obtain an attention score. The attention score is normalized by an activation function. The normalized attention score is multiplied by the time-frequency signal features of the first microphone to obtain the first feature part of the microphone signal processing branch output.

[0068] In the reference signal processing branch, the first reference time-frequency signal features are processed by a sliding window to obtain the value vector V. sw The first reference time-frequency signal features are processed by a convolutional layer and a sliding window to obtain a key vector, K. sw The query vector Q of the microphone signal processing branch mic After matrix multiplication, the attention score is obtained. This score is then dynamically aligned with the temporal offset and normalized using an activation function to obtain a normalized attention score. (Where, the unsqueeze() function is used for vector dimension alignment), the normalized attention score and the value vector V sw After matrix multiplication, weighted fusion is performed to obtain the second characteristic part of the reference signal processing branch output.

[0069] Next, the first feature portion output from the microphone signal processing branch and the second feature portion output from the reference signal processing branch are concatenated to output a time-frequency signal with time delay alignment. Optionally, point convolution operations are performed on the feature portions of the two branches to achieve feature fusion. The fused features include aligned speech time-frequency features and echo time-frequency features, and are aligned with the dimensions of the subsequent fourth coding module.

[0070] In some implementations, a soft alignment mechanism using sliding window attention is employed in the reference signal processing branch. This mechanism uses a sliding window operation to capture features by adding a time dimension. The window size of the sliding window is a learnable parameter, allowing the model to adaptively adjust the delay range and identify the optimal delay parameter to achieve time-delay alignment with minimal computational resource consumption. This parameter is discretized into discrete integer delay parameters. The learnable delay parameter dynamically constrains the window range, dividing the reference signal into multiple sliding windows in the time dimension, generating time-delay-sensitive key-value pairs. The time-domain features of the reference signal are then extracted through the sliding window. The dynamic time-delay alignment module improves attention to important time frames and enhances the time alignment accuracy of the signal by dynamically weighting features with different time delays, thus aiding in the processing of linear and nonlinear echo components.

[0071] The aligned reference time-frequency signal is input into the fourth encoding module, which upsamples the signal to obtain the second reference time-frequency signal features. The second bottleneck layer then compresses and reduces the dimensionality of these features to obtain the third reference time-frequency signal features, reducing computational complexity, removing redundant information, and improving the model's generalization ability. Both the third and fourth encoding modules consist of multiple layers of convolutional layers, batch normalization layers, and activation functions, employing a layer-by-layer stacking approach for deep feature learning.

[0072] S103: Use the adaptive gating module to extract and fuse the time-frequency signal features of the third microphone and the third reference time-frequency signal to obtain the fused time-frequency signal features;

[0073] In this embodiment, for the third microphone time-frequency signal features obtained after processing the microphone signal processing branch and the third reference time-frequency signal features obtained after processing the reference signal processing branch, an adaptive gating module is used to perform feature extraction and fusion processing on the third microphone time-frequency signal features and the third reference time-frequency signal features to obtain fused time-frequency signal features, including:

[0074] The time-frequency signal features of the third microphone and the time-frequency signal features of the third reference are fused to obtain the fused features.

[0075] The fused features are extracted using a multi-scale temporal feature extraction layer to obtain multi-timescale features.

[0076] Multi-scale frequency domain features are extracted by using a frequency domain multi-scale dilated convolutional layer to obtain multi-frequency domain scale features.

[0077] Based on multi-time-scale features and multi-frequency-domain scale features, the first weight is generated through a gating function;

[0078] The third reference time-frequency signal features are multiplied by the first weight to obtain the first calculation result;

[0079] The time-frequency signal characteristics of the third microphone are multiplied by the second weight to obtain the second calculation result; wherein the sum of the first weight and the second weight is 1.

[0080] The first and second calculation results are superimposed, and the features are fused through a convolutional layer to obtain the fused time-frequency signal features.

[0081] like Figure 5 As shown, the adaptive gating module unifies the echo cancellation and noise suppression tasks, utilizing a gating mechanism to achieve synergistic enhancement of echo cancellation and noise suppression, avoiding the error accumulation problem caused by cascaded processing. The adaptive gating module includes a temporal multi-scale dilated convolutional layer and a frequency-domain multi-scale dilated convolutional layer. For the input third microphone time-frequency signal features and the third reference time-frequency signal features, feature fusion is first performed. The fused features are then processed by both temporal and frequency-domain multi-scale dilated convolutional layers. The temporal multi-scale dilated convolutional layer extracts multi-scale temporal features from the input fused features, capturing short-time echo noise, mid-time echo noise, and long-time echo noise features from the time dimension using convolutional kernels with different dilation rates, while avoiding feature redundancy. The frequency-domain multi-scale dilated convolutional layer extracts multi-scale frequency features from the input fused features, modeling harmonic correlations from the frequency dimension using independent convolutional kernels to separate overlapping echo noise.

[0082] After extracting multi-timescale and multi-frequency-scale features, a first weight G is dynamically generated using a gating function (e.g., the sigmoid activation function). The third reference time-frequency signal feature is then multiplied with this first weight using a matrix multiplication, and the third microphone time-frequency signal feature is multiplied with a second weight (1-G) using a matrix multiplication. The calculation results from the two branches are then fused using a convolutional layer to achieve sufficient feature fusion, resulting in the fused time-frequency signal features. By dynamically generating weights, the module accurately selects features for echo cancellation and noise suppression paths, improving the system's robustness, enhancing its adaptability to various interference types, and improving signal clarity and audio quality.

[0083] In some implementations, the temporal multi-scale dilated convolutional layer employs multi-level adjustable dilation rates. By adaptively adjusting the dilation rate, it dynamically matches the temporal scale requirements of different scenarios, extracting echo and noise features at different time scales. For example, dilation rates of 1, 3, and 5 are used to capture the dependencies of short-term local echo noise, medium-term periodic echo noise, and long-term far-field reflection echo noise, respectively. The frequency-domain multi-scale dilated convolutional layer uses independent convolutional kernels in the frequency dimension, combined with multi-level dilation, such as dilation rates of 1, 2, and 4, to separate high-frequency noise from the harmonic components of low-frequency echoes.

[0084] S104: The decoding module is used to downsample the features of the fused time-frequency signal to obtain key time-frequency signal features; the enhanced speech signal is reconstructed based on the key time-frequency signal features.

[0085] In this embodiment, the time-frequency signal features obtained after processing by the adaptive gating module are input to the decoding module. The decoding module performs a deconvolution operation on the input time-frequency signal features to achieve a downsampling and restoration process, and extracts key time-frequency signal features after downsampling. The decoding module adopts a symmetrical structure to gradually restore the frequency resolution, ensuring that important information is not lost during the encoding process. Figure 3 As shown, skip connections are used between the first encoding module and the decoding module, and between the fourth encoding module and the decoding module. By fusing information across layers, the degradation problem of deep networks is alleviated and information loss during the encoding process is reduced.

[0086] Each encoding module upsamples the compressed time-frequency signal in the high-frequency portion, while the decoding module combines deconvolution and skip connections to recover high-frequency details. Considering the sensitivity of the human ear to different frequency bands, this reduces the model's complexity while preserving key features and information from the microphone and reference signals. By dividing the spectral feature processing of the microphone and reference signals into low-frequency and high-frequency components, the model can more efficiently extract and learn important information from the speech signal, reducing the computational burden on subsequent network layers.

[0087] In some embodiments, the enhanced speech signal is reconstructed based on key time-frequency signal features, including:

[0088] Key time-frequency signal features are subjected to frequency band decompression processing to obtain the decompressed time-frequency signal;

[0089] The complex ratio mask generated by the mask layer is used to reconstruct the speech of the decompressed time-frequency signal, and the reconstructed time-frequency signal is obtained.

[0090] The reconstructed time-frequency signal is subjected to inverse short-time Fourier transform to obtain the transformed speech signal.

[0091] In this embodiment, the key time-frequency signal features output by the decoding module are subjected to frequency band decompression processing to obtain the decompressed time-frequency signal. Then, a complex ratio mask generated by a mask layer is used to perform speech reconstruction on the decompressed time-frequency signal to obtain the reconstructed time-frequency signal. Finally, an inverse short-time Fourier transform is performed on the reconstructed time-frequency signal to obtain the transformed speech signal. Specifically, an equivalent rectangular bandwidth separation module is used to perform frequency band decompression processing on the key time-frequency signal features, and band separation restoration restores the key time-frequency signal features to the original resolution of the input signal. The complex ratio mask generated by the mask layer masks the noise and echo components of the decompressed time-frequency signal, retaining only the clean time-frequency signal portion. The clean time-frequency signal portion undergoes an inverse short-time Fourier transform to recover the speech-enhanced speech signal.

[0092] like Figure 2 As shown, in some embodiments, the training method for the speech enhancement model includes:

[0093] We acquired speech, echo, and noise datasets, and used data augmentation methods to generate microphone signals containing noise and echo, clean speech signals, and a reference signal dataset containing echoes. The datasets combined data resources from the ICASSP Echo Cancellation Challenge and Noise Suppression Challenge. Several hours of remote single-call synthesis samples were obtained from the Echo Cancellation Challenge; in the Noise Suppression Challenge, for each noisy audio, an echo signal from the Echo Cancellation Challenge was randomly selected and mixed with the noisy audio to construct a new microphone signal. During this process, the signal-to-noise ratio of echo to noise was uniformly distributed within a predetermined threshold. To further enrich the dataset, several hours of dual-call scene data were constructed, and a Chinese speech dataset recorded using professional equipment was used for further training in specific scenarios to improve the model's generality and robustness.

[0094] Microphone signal samples and reference signal samples from the dataset are input into the speech enhancement model for training. The speech enhancement model processes the microphone signal samples and reference signal samples and outputs a speech signal. Based on the output speech signal and the corresponding clean speech signal in the dataset, a loss value is calculated according to the loss function. The model uses this loss value for backpropagation and continues forward inference. Following this process, after multiple rounds of training, the loss value gradually decreases and eventually converges, completing the model training and obtaining the trained speech enhancement model. During the training process, the adaptive gating module dynamically adjusts the gradient backpropagation path, prioritizing the optimization of task-related features.

[0095] In some implementations, the time-domain scale-invariant signal-to-noise ratio and frequency-domain amplitude spectrum loss are combined to balance speech quality and spectral fidelity. The designed loss function is as follows:

[0096]

[0097] Where MSE is the mean squared error and SISNR is the scale-invariant signal-to-noise ratio. S represents the speech signal predicted by the model, where S is the clean speech signal and S_noisy is the speech signal containing echoes and noise. α and β are weights that can be dynamically adjusted according to task requirements.

[0098] The multi-task speech enhancement method provided in this application utilizes a speech enhancement model to extract features from both the microphone signal and the reference signal, improving the performance of signal feature fusion and enhancement. By employing appropriate upsampling and downsampling, it preserves key features from the echo cancellation and noise suppression processes. A dynamic delay alignment module adaptively aligns the two signals, adapting to changes in nonlinear echo paths and dynamic environmental interference for more accurate signal alignment. An adaptive gating module fuses the features extracted from the two signal branches, extracting multi-scale temporal and frequency features from both time and frequency dimensions, improving the accuracy of noise, echo, and speech feature extraction. This method can achieve high performance by jointly executing noise suppression and echo cancellation tasks with minimal computational resource consumption, realizing collaborative optimization of multi-task speech enhancement, improving voice call quality, and enabling lightweight deployment suitable for resource-constrained applications.

[0099] It should be noted that the method in this embodiment can be executed by a single device, such as a computer or server. The method can also be applied in a distributed scenario, where multiple devices cooperate to complete the task. In such a distributed scenario, one of these devices may execute only one or more steps of the method in this embodiment, and the multiple devices will interact with each other to complete the method described.

[0100] It should be noted that the above description describes specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recorded in the claims may be performed in a different order than that shown in the embodiments and still achieve the desired results. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

[0101] like Figure 6 As shown, this application embodiment provides a multi-task voice enhancement device, including:

[0102] The microphone signal processing module is used to upsample the microphone time-frequency signal using the first encoding module to obtain the first microphone time-frequency signal features; to upsample the first microphone time-frequency signal features using the second encoding module to obtain the second microphone time-frequency signal features; and to compress and reduce the second microphone time-frequency signal features using the first bottleneck layer to obtain the third microphone time-frequency signal features.

[0103] The reference signal processing module is used to upsample the reference time-frequency signal using the third encoding module to obtain a first reference time-frequency signal feature; to perform soft alignment of the first microphone time-frequency signal feature and the first reference time-frequency signal feature using the dynamic delay alignment module to obtain an aligned reference time-frequency signal; to upsample the aligned reference time-frequency signal using the fourth encoding module to obtain a second reference time-frequency signal feature; and to compress and reduce the second reference time-frequency signal feature using the second bottleneck layer to obtain a third reference time-frequency signal feature.

[0104] The signal fusion processing module is used to extract and fuse the time-frequency signal features of the third microphone and the third reference time-frequency signal using the adaptive gating module to obtain the fused time-frequency signal features.

[0105] The signal reconstruction module is used to downsample the features of the fused time-frequency signal using the decoding module to obtain key time-frequency signal features; and to reconstruct the enhanced speech signal based on the key time-frequency signal features.

[0106] For ease of description, the above devices are described in terms of function, divided into various modules. Of course, in implementing the embodiments of this application, the functions of each module can be implemented in one or more software and / or hardware.

[0107] The apparatus described above is used to implement the corresponding methods in the foregoing embodiments and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.

[0108] Figure 7 This embodiment illustrates a more specific hardware structure of an electronic device, which may include a processor 1010, a memory 1020, an input / output interface 1030, a communication interface 1040, and a bus 1050. The processor 1010, memory 1020, input / output interface 1030, and communication interface 1040 are interconnected internally via the bus 1050.

[0109] The processor 1010 can be implemented using a general-purpose CPU (Central Processing Unit), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this specification.

[0110] The memory 1020 can be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory), static storage device, dynamic storage device, etc. The memory 1020 can store the operating system and other applications. When the technical solutions provided in the embodiments of this specification are implemented by software or firmware, the relevant program code is stored in the memory 1020 and is called and executed by the processor 1010.

[0111] The input / output interface 1030 is used to connect input / output modules to realize information input and output. Input / output modules can be configured as components within the device (not shown in the figure) or externally connected to the device to provide corresponding functions. Input devices may include keyboards, mice, touchscreens, microphones, various sensors, etc., while output devices may include displays, speakers, vibrators, indicator lights, etc.

[0112] The communication interface 1040 is used to connect a communication module (not shown in the figure) to enable communication between this device and other devices. The communication module can communicate via wired means (such as USB, Ethernet cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.).

[0113] Bus 1050 includes a pathway for transmitting information between various components of the device, such as processor 1010, memory 1020, input / output interface 1030, and communication interface 1040.

[0114] It should be noted that although the above-described device only shows the processor 1010, memory 1020, input / output interface 1030, communication interface 1040, and bus 1050, in specific implementations, the device may also include other components necessary for normal operation. Furthermore, those skilled in the art will understand that the above-described device may only include the components necessary for implementing the embodiments of this specification, and not necessarily all the components shown in the figures.

[0115] The electronic devices described above are used to implement the corresponding methods in the foregoing embodiments and have the beneficial effects of the corresponding method embodiments, which will not be repeated here.

[0116] The computer-readable medium of this embodiment includes permanent and non-permanent, removable and non-removable media, and information storage can be implemented by any method or technology. Information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic magnetic disk storage or other magnetic storage devices, or any other non-transfer medium that can be used to store information accessible by a computing device.

[0117] Those skilled in the art should understand that the discussion of any of the above embodiments is merely exemplary and is not intended to imply that the scope of this disclosure (including the claims) is limited to these examples; within the framework of this disclosure, the technical features of the above embodiments or different embodiments can also be combined, the steps can be implemented in any order, and there are many other variations of different aspects of the embodiments of this application as described above, which are not provided in the details for the sake of brevity.

[0118] Additionally, to simplify the description and discussion, and to avoid obscuring the embodiments of this application, the well-known power / ground connections to integrated circuit (IC) chips and other components may or may not be shown in the provided drawings. Furthermore, the apparatus may be shown in block diagram form to avoid obscuring the embodiments of this application, and this also takes into account the fact that the details of the implementation of these block diagram apparatuses are highly dependent on the platform on which the embodiments of this application will be implemented (i.e., these details should be fully understood by those skilled in the art). While specific details (e.g., circuits) have been set forth to describe exemplary embodiments of this disclosure, it will be apparent to those skilled in the art that the embodiments of this application can be implemented without these specific details or with variations thereof. Therefore, these descriptions should be considered illustrative rather than restrictive.

[0119] Although this disclosure has been described in conjunction with specific embodiments thereof, many substitutions, modifications, and variations of these embodiments will be apparent to those skilled in the art from the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may be used with the embodiments discussed.

[0120] The embodiments of this application are intended to cover all such substitutions, modifications, and variations that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the embodiments of this application should be included within the protection scope of this disclosure.

Claims

1. A multi-task speech enhancement method, characterized in that, include: The microphone time-frequency signal is upsampled using the first encoding module to obtain the first microphone time-frequency signal characteristics; The second encoding module is used to upsample the time-frequency signal features of the first microphone to obtain the time-frequency signal features of the second microphone. The first bottleneck layer is used to compress and reduce the dimensionality of the time-frequency signal features of the second microphone to obtain the time-frequency signal features of the third microphone. The reference time-frequency signal is upsampled using the third encoding module to obtain the characteristics of the first reference time-frequency signal. The dynamic time-frequency signal features of the first microphone and the first reference time-frequency signal features are softly aligned using a dynamic time-delay alignment module to obtain an aligned reference time-frequency signal. The aligned reference time-frequency signal is upsampled using the fourth encoding module to obtain the second reference time-frequency signal features; the second reference time-frequency signal features are compressed and dimension-reduced using the second bottleneck layer to obtain the third reference time-frequency signal features. The adaptive gating module is used to extract and fuse the time-frequency signal features of the third microphone and the third reference time-frequency signal to obtain fused time-frequency signal features. This includes: fusing the time-frequency signal features of the third microphone and the third reference time-frequency signal to obtain fused features; extracting multi-scale temporal features from the fused features using a temporal multi-scale dilated convolutional layer to obtain multi-time-scale features; extracting multi-scale frequency-domain features from the fused features using a frequency-domain multi-scale dilated convolutional layer to obtain multi-frequency-domain scale features; generating a first weight based on the multi-time-scale features and multi-frequency-domain scale features using a gating function; and performing matrix multiplication between the third reference time-frequency signal features and the first weight to obtain a first calculation result. The time-frequency signal features of the third microphone are multiplied by the second weight to obtain a second calculation result; wherein the sum of the first weight and the second weight is 1; the first calculation result and the second calculation result are superimposed and fused through a convolutional layer to obtain the fused time-frequency signal features; The fused time-frequency signal features are downsampled using a decoding module to obtain key time-frequency signal features; the enhanced speech signal is then reconstructed based on these key time-frequency signal features.

2. The method according to claim 1, characterized in that, The echo soft alignment of the first microphone time-frequency signal features and the first reference time-frequency signal features is performed using a dynamic time-delay alignment module to obtain an aligned reference time-frequency signal, including: In the microphone signal processing branch, the first microphone time-frequency signal features are processed by a convolutional layer to obtain a query vector. The first microphone time-frequency signal features are processed by two convolutional layers to obtain a key vector. The query vector and the key vector are multiplied by a matrix to obtain an attention score. The attention score is normalized. The normalized attention score is multiplied by the first microphone time-frequency signal features to obtain the first feature part. In the reference signal processing branch, the first reference time-frequency signal feature is processed by a sliding window to obtain a value vector. The first reference time-frequency signal feature is processed by a convolutional layer and a sliding window to obtain a key vector. The key vector is multiplied by the query vector to obtain an attention score. The attention score is normalized. The normalized attention score is multiplied by the value vector and then weighted and fused to obtain the second feature part. The first feature portion and the second feature portion are spliced ​​together to obtain the aligned reference time-frequency signal.

3. The method according to claim 1, characterized in that, The temporal multi-scale dilated convolutional layer employs multi-level adjustable dilation rate convolution to extract echo features and noise features at different time scales; The frequency domain multi-scale dilated convolutional layer uses independent convolutional kernels and combines multi-level dilation to separate the harmonic components of high-frequency noise and low-frequency echo.

4. The method according to claim 1, characterized in that, Before upsampling the microphone time-frequency signal using the first encoding module, the method further includes: Perform a short-time Fourier transform on the input microphone signal to obtain the transformed microphone time-frequency signal; The transformed microphone time-frequency signal is subjected to bandwidth compression processing to obtain a compressed microphone time-frequency signal; The upsampling of the microphone time-domain signal using the first encoding module is as follows: the compressed microphone time-domain signal is upsampled using the first encoding module.

5. The method according to claim 1, characterized in that, Before upsampling the reference time-frequency signal using the third encoding module, the method further includes: Perform a short-time Fourier transform on the input reference signal to obtain the transformed reference time-frequency signal; The transformed reference time-frequency signal is subjected to bandwidth compression processing to obtain a compressed reference time-frequency signal; The upsampling of the reference time-frequency signal using the third encoding module is as follows: the compressed reference time-frequency signal is upsampled using the third encoding module.

6. The method according to claim 4 or 5, characterized in that, The frequency band compression process includes: Low-frequency components less than or equal to 4 kHz are retained, while high-frequency components greater than 4 kHz are compressed and merged.

7. The method according to claim 1, characterized in that, The enhanced speech signal reconstructed based on the key time-frequency signal features includes: The key time-frequency signal features are subjected to frequency band decompression processing to obtain the decompressed time-frequency signal; The decompressed time-frequency signal is reconstructed by using a complex ratio mask generated by the mask layer to obtain the reconstructed time-frequency signal. The reconstructed time-frequency signal is subjected to inverse short-time Fourier transform to obtain the transformed speech signal.

8. The method according to claim 1, characterized in that, The method is implemented based on a speech enhancement model, the loss function of which is: (1) Where MSE is the mean squared error and SISNR is the scale-invariant signal-to-noise ratio. The speech signal predicted by the model. For clean speech signals, For speech signals that contain noise and echo, α、β For weights.

9. A multi-task speech enhancement device, characterized in that, include: The microphone signal processing module is used to upsample the microphone time-frequency signal using the first encoding module to obtain the first microphone time-frequency signal characteristics; The second encoding module is used to upsample the time-frequency signal features of the first microphone to obtain the time-frequency signal features of the second microphone. The first bottleneck layer is used to compress and reduce the dimensionality of the time-frequency signal features of the second microphone to obtain the time-frequency signal features of the third microphone. The reference signal processing module is used to upsample the reference time-frequency signal using the third encoding module to obtain the first reference time-frequency signal features; The dynamic time-frequency signal features of the first microphone and the first reference time-frequency signal features are softly aligned using a dynamic time-delay alignment module to obtain an aligned reference time-frequency signal. The aligned reference time-frequency signal is upsampled using the fourth encoding module to obtain the second reference time-frequency signal features; the second reference time-frequency signal features are compressed and dimension-reduced using the second bottleneck layer to obtain the third reference time-frequency signal features. The signal fusion processing module is used to perform feature extraction and fusion processing on the time-frequency signal features of the third microphone and the third reference time-frequency signal using an adaptive gating module to obtain fused time-frequency signal features. This includes: performing feature fusion processing on the time-frequency signal features of the third microphone and the third reference time-frequency signal to obtain fused features; using a temporal multi-scale dilated convolutional layer to extract multi-scale temporal features from the fused features to obtain multi-time-scale features; using a frequency-domain multi-scale dilated convolutional layer to extract multi-scale frequency-domain features from the fused features to obtain multi-frequency-domain scale features; generating a first weight based on the multi-time-scale features and multi-frequency-domain scale features through a gating function; performing matrix multiplication calculation on the third reference time-frequency signal features and the first weight to obtain a first calculation result; performing matrix multiplication calculation on the third microphone time-frequency signal features and a second weight to obtain a second calculation result; wherein the sum of the first weight and the second weight is 1; superimposing the first calculation result and the second calculation result, and performing feature fusion through a convolutional layer to obtain the fused time-frequency signal features. The signal reconstruction module is used to downsample the fused time-frequency signal features using the decoding module to obtain key time-frequency signal features; and to reconstruct the enhanced speech signal based on the key time-frequency signal features.