Self-supervised speech denoising method and device

By introducing the TCM module and ONT strategy into the DCUnet network and training a speech denoising model using noisy speech data, the adaptability and computational resource issues of existing technologies in complex noisy environments and multi-source interleaving scenarios are solved, achieving efficient and robust speech denoising effects.

CN120496488BActive Publication Date: 2026-06-23HAINACORD (HUBEI) TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HAINACORD (HUBEI) TECH CO LTD
Filing Date
2025-04-21
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing speech denoising models are not adaptable enough to complex noise environments and multi-source interleaving scenarios, making it difficult to adapt to dynamically changing noise scenarios. They also have high computational resource requirements, affecting real-time performance and generalization ability.

Method used

We employ a deep complex domain DCUnet network combined with a time-channel modeling (TCM) module, and train the model using a self-supervised learning ONT policy. This training utilizes noisy speech data to enhance the model's adaptability to complex environments and its noise reduction performance.

Benefits of technology

It improves the reconstruction quality and noise reduction performance of speech signals, reduces the dependence on training data, enhances the model's adaptability and generalization ability in complex environments, reduces the demand for computing resources, and achieves more efficient speech signal processing.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120496488B_ABST
    Figure CN120496488B_ABST
Patent Text Reader

Abstract

The application provides a self-supervised speech denoising method and device, and relates to the field of acoustic signal processing, and comprises the following steps: S1, obtaining a DCUnet network, constructing a speech denoising model by adding a TCM module in the DCUnet network; S2, obtaining a noise speech set, training the speech denoising model through the noise speech set, and obtaining a trained speech denoising model; S3, denoising speech through the trained speech denoising model. The application constructs a speech denoising model based on a deep complex domain DCUnet network, combines a complex domain overall processing strategy, improves the reconstruction quality and denoising performance of the speech signal, dynamically captures the change characteristics of the noise scene through the TCM module in the speech denoising model, enhances the adaptability of the model to dynamic signals in a complex environment, trains the speech denoising model by using an ONT strategy, uses noise speech as training data, does not need clear target speech data, reduces the training data collection cost, and improves the generalization ability of the model.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of acoustic signal processing, and more particularly to a self-supervised speech denoising method and apparatus. Background Technology

[0002] In complex urban environments, noise interference poses a significant challenge to the separation and analysis of sound sources. Especially in scenarios with multiple intertwined sound sources, noise not only significantly reduces the effectiveness of speech signal separation but may also mask crucial speech or environmental sound information, thereby affecting the performance of downstream tasks such as smart city monitoring, noise control assessment, and the development of multimodal sensing systems. Therefore, the research and application of speech denoising technology in complex environments is extremely important.

[0003] Currently, most speech denoising tasks rely on the "Noisy-Clean Training" (NCT) strategy. This type of method trains a network to map noisy speech to clean speech, thus achieving the denoising goal. However, obtaining completely clean speech signals requires expensive recording equipment and a strictly controlled recording environment; data collection is both time-consuming and costly, and often limited in scale and diversity. To address this issue, self-supervised learning strategies have emerged, such as "Noisy-Noisy Training" (NNT) and "Noisier-Noisy Training" (NerNT). These methods achieve denoising by constructing mappings between noisy speech signals, but they still face performance limitations when dealing with multi-source noise in complex environments. Furthermore, these methods are highly dependent on the characteristics of noise distribution, making it difficult to adapt to dynamically changing noise scenarios, and the models' generalization ability in high-noise scenarios is insufficient, potentially leading to over-smoothing of the speech signal or loss of important details.

[0004] Furthermore, from a model optimization perspective, most existing speech denoising networks are based on real-valued domain computation, typically focusing only on estimating spectral amplitudes while neglecting phase information modeling, thus limiting signal reconstruction quality. Although deep complex networks (such as DCUnet) significantly improve signal reconstruction capabilities by incorporating complex time-frequency masks to optimize signal-to-noise ratio loss, existing methods are still insufficient in handling long-range dependencies and global contextual information. Moreover, the complexity and high computational resource requirements of these methods also limit their practical applications.

[0005] Therefore, developing a speech denoising method that does not rely on clean speech targets and can effectively model complex noisy environments has become an important research direction in the field of speech signal processing. Denoising can not only significantly improve the signal-to-noise ratio of the target speech, but also provide clearer signal input for the separation of mixed sounds, thereby reducing mutual interference between multiple sound sources and improving the overall performance of the separation algorithm. In practical applications, denoising technology lays the foundation for feature extraction and classification of mixed signals and is an indispensable and crucial step in achieving multi-source separation. Optimizing the joint representation of time-frequency domain signals by combining complex domain processing and context modeling capabilities will be one of the key paths for the development of speech denoising technology.

[0006] While self-supervised learning methods such as NNT and NerNT have to some extent freed them from dependence on clear speech target data, they still have limitations when dealing with complex noisy environments and dynamic multi-source scenarios. These methods typically assume that the noise distribution has zero mean or specific statistical properties, but the noise distribution in real-world environments is complex and variable, often contradicting these assumptions and leading to a significant decrease in noise reduction performance. Furthermore, these methods have high requirements for initial model settings and the number of noise samples; when noise characteristics are not adequately covered, the model may overfit specific noise patterns and lack sufficient generalization ability.

[0007] Besides the limitations of training strategies, existing denoising models also face challenges in performance improvement. While denoising networks like Deep Complex U-Net (DCUnet) significantly optimize signal amplitude and phase modeling capabilities by introducing complex time-frequency masks, their architecture still suffers from the following shortcomings: First, DCUnet has limited ability to model long-range dependencies and global contextual information, resulting in its inability to fully capture the dynamic interactions between multiple sound sources in complex signal separation tasks; second, the model's processing complexity for complex signals is high, requiring significant computational resources and limiting its application in resource-constrained environments; finally, DCUnet's ability to reconstruct spectral details in high-noise environments is weak, potentially leading to the loss of key features of the speech signal or excessive smoothing.

[0008] On the one hand, existing denoising models generally fall short in adaptability to dynamic signal changes. Traditional self-supervised learning methods are poorly adapted to dynamic noise scenarios and struggle to adjust model parameters in real time to handle rapidly changing noise characteristics. This is mainly because these methods lack comprehensive utilization of signal temporal information and ignore the correlation between noise and target signals in the time dimension, thus limiting the denoising performance and stability of the model.

[0009] On the other hand, existing denoising methods still need improvement in terms of real-time performance and computational efficiency. Especially when facing multi-source, dynamically changing, and highly complex noise scenarios, models typically require more computational resources to achieve accurate signal separation and reconstruction. This not only affects the real-time performance of denoising algorithms but also limits their deployment in resource-constrained devices such as smart devices and portable voice assistants. Furthermore, existing models often require separate processing of the real and imaginary parts when optimizing complex domain calculations. This design increases computational complexity and may affect the overall signal modeling effect. Summary of the Invention

[0010] In view of this, the purpose of this invention is to provide a self-supervised speech denoising method and device to solve the technical problems of insufficient performance of existing denoising models in terms of adaptability to dynamic signal changes, real-time performance, and computational efficiency.

[0011] This invention provides a self-supervised speech denoising method, comprising the following steps:

[0012] S1: Obtain the DCUnet network and build a speech denoising model by adding the TCM module to the DCUnet network;

[0013] S2: Obtain a set of noisy speech samples, train the speech denoising model using the noisy speech samples, and obtain a trained speech denoising model;

[0014] S3: Denoise speech using a trained speech denoising model.

[0015] Preferred:

[0016] The speech denoising model consists of an encoder, a TCM module, a Complex-TSTM module, and a decoder connected in sequence.

[0017] Preferred:

[0018] The encoder consists of N Conv2D modules connected in sequence, with the Conv2D modules numbered from 1 to N;

[0019] The decoder consists of N Conv2D modules connected in sequence, with the Conv2D modules numbered from N to 1;

[0020] The Conv2D modules with the same number in the encoder and decoder are interconnected.

[0021] Preferred:

[0022] The TCM module includes a head token generation module, a multi-head self-attention mechanism module, and a classification token enhancement module connected in sequence.

[0023] The encoder is connected to the head token generation module, and the classification token enhancement module is connected to the Complex-TSTM module.

[0024] Preferred:

[0025] The Complex-TSTM module includes: the RealTSTM part and the Imag TSTM part;

[0026] The categorized token enhancement module connects to the RealTSTM and ImagingTSTM components;

[0027] The Real TSTM and Imag TSTM sections are connected to the decoder.

[0028] Preferably, step S2 specifically includes:

[0029] S21: Convert noisy speech into noisy image, input the noisy image into the encoder in the speech denoising model for encoding operation, and obtain the first feature image;

[0030] S22: Input the first feature image into the TCM module to enhance its feature representation capability and obtain the second feature image;

[0031] S23: Input the second feature image into the Complex-TSTM module to calculate the comprehensive loss, and then input the second feature image into the decoder for decoding to obtain the initial denoised speech;

[0032] S24: Superimpose the initial denoised speech with the noisy speech to obtain the final denoised speech, and adjust the parameters of the speech denoising model using the final denoised speech;

[0033] S25: Repeat steps S21-S24 until the overall loss is less than the preset value to obtain the trained speech denoising model.

[0034] Preferred:

[0035] The formula for calculating the overall loss L is:

[0036]

[0037] in, For time domain loss, For frequency loss, For weighted signal-to-noise ratio loss, The loss is the regularization loss, and α and β are hyperparameters.

[0038] A storage medium storing instructions and data for implementing the self-supervised speech denoising method.

[0039] A self-supervised speech denoising device includes: a processor and a storage medium; the processor loads and executes instructions and data in the storage medium to implement the self-supervised speech denoising method.

[0040] The present invention has the following beneficial effects:

[0041] A speech denoising model is constructed based on the DCUnet network in the deep complex domain. Combined with a comprehensive complex domain processing strategy, the reconstruction quality and denoising performance of the speech signal are improved. The TCM module in the speech denoising model dynamically captures the changing characteristics of the noise scene, enhancing the model's adaptability to dynamic signals in complex environments. The ONT strategy is used to train the speech denoising model, using noisy speech as training data, eliminating the need for clear target speech data, reducing the cost of training data collection, and improving the model's generalization ability. Attached Figure Description

[0042] Figure 1 This is a flowchart of a method according to an embodiment of the present invention;

[0043] Figure 2 This is a structural diagram of the DCUnet network;

[0044] Figure 3 This is a structural diagram of a speech denoising model;

[0045] Figure 4 Here is a structural diagram of the TCM module;

[0046] Figure 5 This is a structural diagram of the device according to an embodiment of the present invention;

[0047] The realization of the objective, functional features and advantages of the present invention will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation

[0048] It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

[0049] Reference Figure 1 To address the shortcomings of existing speech denoising methods in handling complex noise environments and multi-source interleaving scenarios, such as insufficient adaptability, limited modeling capabilities, and poor real-time performance, this invention proposes a self-supervised speech denoising method that combines Temporal-Channel Modeling (TCM module) with complex domain optimization, comprising the following steps:

[0050] S1: Obtain the DCUnet network and build a speech denoising model by adding the TCM module to the DCUnet network;

[0051] As one example, the network architecture uses Deep Complex U-Net (DCUnet network) as the basic model, and optimizes and extends it. The structure of the DCUnet network is as follows: Figure 2 As shown in the diagram, complex 2D convolution and complex value deconvolution operations are combined in the encoder and decoder modules to improve the model's adaptability and robustness when processing complex noisy signals. Simultaneously, high-resolution feature information is transferred between the encoder and decoder through skip connections, effectively preserving the semantics and details of the signal. The structure of the speech denoising model is as follows: Figure 3 As shown.

[0052] The speech denoising model consists of an encoder, a TCM module, a Complex-TSTM module, and a decoder connected in sequence.

[0053] As one example:

[0054] The encoder consists of N Conv2D modules connected in sequence, with the Conv2D modules numbered from 1 to N;

[0055] The decoder consists of N Conv2D modules connected in sequence, with the Conv2D modules numbered from N to 1;

[0056] The Conv2D modules with the same number in the encoder and decoder are interconnected.

[0057] As one embodiment, the TCM module designed in this invention significantly enhances the model's ability to capture the dependencies between the time dimension and the channel dimension.

[0058] The TCM module includes a head token generation module, a multi-head self-attention mechanism module, and a classification token enhancement module connected in sequence.

[0059] The encoder is connected to the head token generation module, and the classification token enhancement module is connected to the Complex-TSTM module.

[0060] Specifically, the structure of the TCM module is as follows: Figure 4 As shown, Figure 4 In the middle, Head Token Generation represents the head token generation module, Multi-Head Self-Attention represents the multi-head self-attention mechanism module, and ClassificationToken Enrichment represents the classification token enhancement module.

[0061] (1) Header Token Generation Module:

[0062] The TCM module first extracts the channel information of the input signal through the header token generation component. The input sequence consists of a classification token (CLS) and a time token. The sequence is composed of time tokens divided into H segments along the channel dimension, each segment having a dimension of d = D / H, where H represents the number of attention heads. Next, each segment generates channel features through temporal average pooling, and then projects back to the D dimension through a fully connected layer and the GeLU activation function to form head tokens. These tokens represent different parts of the channel information and are subsequently concatenated with the input sequence to form a time-channel token sequence of length T+H+1.

[0063] (2) Multi-head self-attention mechanism module:

[0064] In the TCM module, the Multi-Head Self-Attention (MHSA) mechanism works similarly to traditional MHSA, but its input sequence contains not only temporal tokens but also channel tokens. To learn the interaction between temporal and channel tokens, the MHSA transforms the temporal-channel tokens into Query (Q), Key (K), and Value (V). This process is achieved through the corresponding linear projection matrix. The process involves H projections onto the time-channel tokens, where i represents the index of the attention head. Each projection generates a d-dimensional channel representation. Subsequently, the self-attention operation calculates appropriate weights along the time axis based on the correlation between each token through a scaled dot product; this process is performed in parallel across the H attention heads. Finally, the outputs of all attention heads are concatenated and processed using the final linear projection matrix W. O Convert to output embedding. The overall formula for multi-head self-attention is as follows:

[0065] MultiHead(X)=Concat(head1,…,head H W O

[0066]

[0067] (3) Enhanced Token Classification Module:

[0068] Although the Categorical Token (CLS) in MHSA can already extract information from the Time and Channel Tokens, to further enhance the information representation of the Categorical Tokens, the TCM module separates the Time and Head Tokens in the MHSA output and performs average pooling on them separately. The pooled Time Average Token and Head Average Token are used to enrich the Categorical Tokens, thus providing more comprehensive information support for the final prediction.

[0069] As one example:

[0070] The Complex-TSTM module includes: the RealTSTM part and the Imag TSTM part;

[0071] The categorized token enhancement module connects to the RealTSTM and ImagingTSTM components;

[0072] The RealTSTM and ImagTSTM sections are connected to the decoder.

[0073] Specifically, in the TSTM module, the input complex features are divided into real and imaginary parts, which are processed by Real TSTM and Imagin TSTM respectively. Real TSTM processes the real part of the complex features, while Imagin TSTM processes the imaginary part. Both operate independently during computation, extracting speech features from different dimensions. After processing, the results of the real and imaginary parts are recombined through complex arithmetic operations to generate a complete complex output, thus supporting subsequent decoders.

[0074] S2: Obtain a set of noisy speech samples, train the speech denoising model using the noisy speech samples, and obtain a trained speech denoising model;

[0075] As one embodiment, to achieve efficient training, this invention employs the ONT strategy to generate training pairs from a single noisy speech sample, without relying on clear speech target data. The ONT strategy generates conditionally independent audio pairs through subsampling and combines a regularized loss term to optimize network performance, significantly reducing the training's dependence on noise distribution assumptions, thereby improving the model's generalization ability in complex dynamic noise environments.

[0076] Step S2 is as follows:

[0077] S21: Convert noisy speech into noisy image, input the noisy image into the encoder in the speech denoising model for encoding operation, and obtain the first feature image;

[0078] S22: Input the first feature image into the TCM module to enhance its feature representation capability and obtain the second feature image;

[0079] S23: Input the second feature image into the Complex-TSTM module to calculate the comprehensive loss, and then input the second feature image into the decoder for decoding to obtain the initial denoised speech;

[0080] Specifically, a comprehensive loss is constructed by combining time-domain loss, frequency loss, weighted signal-to-noise ratio loss, and regularization loss.

[0081] The formula for calculating the overall loss L is:

[0082]

[0083] in, For time domain loss, For frequency loss, For weighted signal-to-noise ratio loss, The loss is the regularization loss, and α and β are hyperparameters.

[0084] Specifically, temporal loss The mean square error (MSE) between the enhanced and clear waveforms is calculated and defined as follows:

[0085]

[0086] Among them, s i and Let N represent the audio samples of the i-th clear speech sample and the denoised sample, respectively, and N is the total number of audio samples.

[0087] Frequency domain loss Used to monitor the model learning more information, thereby improving the intelligibility and perceptual quality of speech, it is defined as:

[0088]

[0089] Among them, S and represents the clear spectrum and the enhanced spectrum, respectively; r and i represent the real and imaginary parts of the complex number, respectively; and T and F represent the number of frames and the number of frequency boxes, respectively.

[0090] Weighted signal-to-noise ratio loss Commonly used evaluation metrics for direct optimization in the time domain are defined as follows:

[0091]

[0092] Where x represents noisy samples and y represents target samples. This represents the estimated output, where α represents the energy ratio between the target speech and the noise.

[0093] For audio pairs s1(x) and s2(x) sampled from noisy speech x, this invention uses regularization loss. As an additional constraint, it is defined as:

[0094]

[0095] Among them, f θ This represents a noise reduction network. To stabilize the training process, updates to s1(f) are stopped during training.θ (x)) and s2(f θ The gradient of (x) is calculated, and the hyperparameter γ is gradually increased to achieve the best training effect.

[0096] S24: Superimpose the initial denoised speech with the noisy speech to obtain the final denoised speech, and adjust the parameters of the speech denoising model using the final denoised speech;

[0097] S25: Repeat steps S21-S24 until the overall loss is less than the preset value to obtain the trained speech denoising model.

[0098] S3: Denoise speech using a trained speech denoising model.

[0099] Specifically, a well-trained speech denoising model can comprehensively model complex speech signals, significantly improving denoising performance, computational efficiency, and adaptability in practical applications, providing an efficient and robust innovative solution for the field of speech signal processing. Furthermore, this model is applicable to various scenarios such as multi-source separation, speech enhancement, and noise suppression in complex environments, providing technical support for fields such as intelligent monitoring, speech recognition, and human-computer interaction.

[0100] Please see Figure 5 , Figure 5 This is a schematic diagram of the hardware device in operation according to an embodiment of the present invention. The hardware device specifically includes: a self-supervised speech noise reduction device 401, a processor 402, and a storage medium 403.

[0101] A self-supervised speech denoising device 401: The self-supervised speech denoising device 401 implements the self-supervised speech denoising method.

[0102] Processor 402: The processor 402 loads and executes the instructions and data in the storage medium 403 to implement the self-supervised speech denoising method.

[0103] Storage medium 403: The storage medium 403 stores instructions and data; the storage medium 403 is used to implement the self-supervised speech denoising method.

[0104] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or system. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or system that includes that element.

[0105] The sequence numbers of the above embodiments of the present invention are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments. In the unit claims listing several devices, several of these devices may be embodied by the same hardware item. The use of the terms first, second, and third, etc., does not indicate any order and can be interpreted as identifiers.

[0106] The above are merely preferred embodiments of the present invention and do not limit the patent scope of the present invention. Any equivalent structural or procedural transformations made based on the content of the present invention's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of the present invention.

Claims

1. A self-supervised speech denoising method, characterized in that, Including the following steps: S1: Obtain the DCUnet network and build a speech denoising model by adding the TCM module to the DCUnet network; The speech denoising model consists of an encoder, a TCM module, a Complex-TSTM module, and a decoder connected in sequence. The TCM module includes a head token generation module, a multi-head self-attention mechanism module, and a classification token enhancement module connected in sequence. The encoder is connected to the head token generation module, and the classification token enhancement module is connected to the Complex-TSTM module; The Complex-TSTM module includes: the Real TSTM part and the Imag TSTM part; The categorized token enhancement module connects to the Real TSTM and Imagin TSTM components; The Real TSTM and Imag TSTM sections are connected to the decoder; S2: Obtain a set of noisy speech samples, train the speech denoising model using the noisy speech samples, and obtain a trained speech denoising model; Step S2 is as follows: S21: Convert noisy speech into noisy image, input the noisy image into the encoder in the speech denoising model for encoding operation, and obtain the first feature image; S22: Input the first feature image into the TCM module to enhance its feature representation capability and obtain the second feature image; S23: Input the second feature image into the Complex-TSTM module to calculate the comprehensive loss, and then input the second feature image into the decoder for decoding to obtain the initial denoised speech; S24: Superimpose the initial denoised speech with the noisy speech to obtain the final denoised speech, and adjust the parameters of the speech denoising model using the final denoised speech; S25: Repeat steps S21-S24 until the overall loss is less than the preset value to obtain the trained speech denoising model; S3: Denoise speech using a trained speech denoising model.

2. The self-supervised speech denoising method according to claim 1, characterized in that: The encoder consists of N Conv2D modules connected in sequence, with the Conv2D modules numbered from 1 to N; The decoder consists of N Conv2D modules connected in sequence, with the Conv2D modules numbered from N to 1; The Conv2D modules with the same number in the encoder and decoder are interconnected.

3. The self-supervised speech denoising method according to claim 1, characterized in that: The formula for calculating the overall loss L is: in, For time domain loss, For frequency loss, For weighted signal-to-noise ratio loss, For regularization loss, and This is a hyperparameter.

4. A self-supervised speech noise reduction device, characterized in that: include: A processor and a storage medium; the processor loads and executes instructions and data in the storage medium to implement the self-supervised speech denoising method according to any one of claims 1 to 3.